Alan discovered that we can deal with a stale nfs file handle on zeus by using the following command:
umount -l directory
This performs a lazy unmount, which cleans up references to a filesystem after detaching it.
Alan discovered that we can deal with a stale nfs file handle on zeus by using the following command:
umount -l directory
This performs a lazy unmount, which cleans up references to a filesystem after detaching it.
Computers running rocks software that are receiving updates via yum or another method have a unique issue when it comes to kernel upgrades. Part of the rocks system involves changing the grub.conf file at boot time depending on how the system was shut down. If the computer is shut down properly, grub-orig.conf is copied over grub.conf. If not, rocks.conf is used for the startup.
The rpm package “file” conflicted with the dependencies for httpd on all the nodes except neuron-0-24, which had been recently re-installed. I ran yum clean metadata and was able to get the other nodes to update correctly.
In yum-fastestmirror-1.1.16-13.el5.centos, I discovered a bug involving the exclude option in fastestmirror.conf . If you exclude two mirrors that are adjacent in the mirror repository array used by fastestmirror.py to search for excluded mirrors, the second one will not be removed from the list of possible mirrors.
The reason this occurs is because the for loop which iterates through the array skips the next value after an entry is removed from the array. I was able to fix it by changing line 195 from:
for mirror in repomirrors[str(repo)]:
to
for mirror in repomirrors[str(repo)][:]:
Now it uses a copy of the array to search, and can delete the entry from the original array without effecting the next value in the for loop.
It appears that this bug has been resolved in yum-utils-1.1.18 and higher, by using the filter command to go through each mirror and test it against the exclude list. I’m learning all sorts of neat python tricks today!
It’s been a while since I’ve posted, so we’ve managed to clear a few of things on the task list. I’m going to talk a little bit about them here. The cluster update was a success, but I found a side effect of having the path include a reference to itself (to allow cluster specific binaries to be available). When building adding packages or changing the default partitioning scheme for the cluster nodes, the build process would fail if the root environment was accessed from a user who had this path setting and used su. It would build, but the install process would fail because key files aren’t copied in the hierarchy. Specifically, updates.img and stage2.img would be missing.
The active directory move was accomplished by using several vmware instances to test out the various stages. There is still some work to be done with the security certificates, and then the old domain structure can be removed. The cs disk space move was accomplished without incident.
We’ve been experiencing an interesting problem on our cluster nodes which causes them to freeze up. It appears to be related to the way the linux kernel in CentOS deals with memory allocation requests. The issue is caused by the swap partition on a machine filling completely, which freezes the system. Any attempt to start a new process hangs, waiting for space to become available from the swap (which it never does). There are several ways of trying to deal with this. The first is to use oomkiller, a process that will detect when the memory limit is going to be reached and kill a process it decides can be sacrificed for the greater good. this is a good description of the memory issues and how to test for them.
The cluster downtime was avoided, thanks to some helpful advice from Dell. Our cluster uses a pair of PowerVault 220S enclosures which are configured in a RAID 5 array. When two of the drives in each enclosure went into predictive failure mode, we needed to replace them to ensure the integrity of our data. Since we’re running Rocks for the cluster os, the dell openmanage tools that would allow us to do the hotswap while the cluster was running weren’t available.
I didn’t want to install the openmanage software since it seemed to have a large number of modules and really looked like it might be work (which I try to avoid). A dell rep I talked to recommended DELLmgr instead. He told me to ignore the other rpms, and install only the following: Dellmgr-5.25-0.i386.rpm . It installs one file, dellmgr.bin, that talks to the PERC controller card and gives you an interface very similar to the one used in the card’s BIOS, no restart required. I was able to fail the faulty drives and do the rebuild without having to alter the cluster’s running state at all.
It’s a shame that dell no longer supports it and hasn’t released a version for the new controller cards.
We’re trying out PBRT (Physically Based Rendering) on our cluster for some of the students. Rocks OS doesn’t include it as part of the installation package (it has to be installed from source) and it depends on openexr, a package that was developed at ILM to provide high dynamic-range (HDR) image file format.
Openexr does have a rpm package available, but not in the repository Rocks OS uses. I downloaded it from DAG along with openexr-devel, uses the version which corresponds to CentOS 4 (RedHAT EL 4) since I belive that is the source for the Rocks OS version I am using. The only gotcha during the compile was having to change the include directories for Openexr in pbrt’s Makefile from /usr/local/include to /usr/include and the lib directory to lib64, since we’re running 64 bit.
If pbrt works okay on the test node, I’ll install it on the rest of the cluster.