We’ve been experiencing an interesting problem on our cluster nodes which causes them to freeze up. It appears to be related to the way the linux kernel in CentOS deals with memory allocation requests. The issue is caused by the swap partition on a machine filling completely, which freezes the system. Any attempt to start a new process hangs, waiting for space to become available from the swap (which it never does). There are several ways of trying to deal with this. The first is to use oomkiller, a process that will detect when the memory limit is going to be reached and kill a process it decides can be sacrificed for the greater good. this is a good description of the memory issues and how to test for them.