cluster « What'sJohnWorkingOn

Posts Tagged ‘cluster’

Stale NFS file handles on hydra

Friday, May 13th, 2011

Alan discovered that we can deal with a stale nfs file handle on zeus by using the following command:

umount -l directory

This performs a lazy unmount, which cleans up references to a filesystem after detaching it.

Tags:cluster, hydra, stale nfs file, zeus
Posted in Uncategorized | Comments Closed

Kernel updates on Rocks Cluster

Wednesday, December 22nd, 2010

Computers running rocks software that are receiving updates via yum or another method have a unique issue when it comes to kernel upgrades. Part of the rocks system involves changing the grub.conf file at boot time depending on how the system was shut down. If the computer is shut down properly, grub-orig.conf is copied over grub.conf. If not, rocks.conf is used for the startup.

Tags:cluster
Posted in Uncategorized | Comments Closed

Resolved Hydra update problem

Friday, May 22nd, 2009

The rpm package “file” conflicted with the dependencies for httpd on all the nodes except neuron-0-24, which had been recently re-installed. I ran yum clean metadata and was able to get the other nodes to update correctly.

Tags:cluster, hydra, yum
Posted in Uncategorized | Comments Closed

yum-fastestmirror – excluding mirrors in 1.1.16

Wednesday, April 8th, 2009

In yum-fastestmirror-1.1.16-13.el5.centos, I discovered a bug involving the exclude option in fastestmirror.conf . If you exclude two mirrors that are adjacent in the mirror repository array used by fastestmirror.py to search for excluded mirrors, the second one will not be removed from the list of possible mirrors.

The reason this occurs is because the for loop which iterates through the array skips the next value after an entry is removed from the array. I was able to fix it by changing line 195 from:

for mirror in repomirrors[str(repo)]:

to

for mirror in repomirrors[str(repo)][:]:

Now it uses a copy of the array to search, and can delete the entry from the original array without effecting the next value in the for loop.

It appears that this bug has been resolved in yum-utils-1.1.18 and higher, by using the filter command to go through each mirror and test it against the exclude list. I’m learning all sorts of neat python tricks today!

Tags:cluster, python, yum
Posted in Uncategorized | Comments Closed

Cluster Upgrade and Active Directory move

Monday, March 2nd, 2009

It’s been a while since I’ve posted, so we’ve managed to clear a few of things on the task list. I’m going to talk a little bit about them here. The cluster update was a success, but I found a side effect of having the path include a reference to itself (to allow cluster specific binaries to be available). When building adding packages or changing the default partitioning scheme for the cluster nodes, the build process would fail if the root environment was accessed from a user who had this path setting and used su. It would build, but the install process would fail because key files aren’t copied in the hierarchy. Specifically, updates.img and stage2.img would be missing.

The active directory move was accomplished by using several vmware instances to test out the various stages. There is still some work to be done with the security certificates, and then the old domain structure can be removed. The cs disk space move was accomplished without incident.

Tags:active dir, cluster
Posted in Uncategorized | Comments Closed

Swap file problems in CentOS (Rocks Cluster)

Thursday, October 16th, 2008

We’ve been experiencing an interesting problem on our cluster nodes which causes them to freeze up. It appears to be related to the way the linux kernel in CentOS deals with memory allocation requests. The issue is caused by the swap partition on a machine filling completely, which freezes the system. Any attempt to start a new process hangs, waiting for space to become available from the swap (which it never does). There are several ways of trying to deal with this. The first is to use oomkiller, a process that will detect when the memory limit is going to be reached and kill a process it decides can be sacrificed for the greater good. this is a good description of the memory issues and how to test for them.

Tags:cluster, overcommit_memory, overcommit_ratio, swap
Posted in Uncategorized | Comments Closed

The little binary that could: Dellmgr

Friday, September 19th, 2008

The cluster downtime was avoided, thanks to some helpful advice from Dell. Our cluster uses a pair of PowerVault 220S enclosures which are configured in a RAID 5 array. When two of the drives in each enclosure went into predictive failure mode, we needed to replace them to ensure the integrity of our data. Since we’re running Rocks for the cluster os, the dell openmanage tools that would allow us to do the hotswap while the cluster was running weren’t available.

I didn’t want to install the openmanage software since it seemed to have a large number of modules and really looked like it might be work (which I try to avoid). A dell rep I talked to recommended DELLmgr instead. He told me to ignore the other rpms, and install only the following: Dellmgr-5.25-0.i386.rpm . It installs one file, dellmgr.bin, that talks to the PERC controller card and gives you an interface very similar to the one used in the card’s BIOS, no restart required. I was able to fail the faulty drives and do the rebuild without having to alter the cluster’s running state at all.

It’s a shame that dell no longer supports it and hasn’t released a version for the new controller cards.

Tags:cluster, Dell, PERC 4
Posted in Uncategorized | Comments Closed

PBRT on Rocks Cluster OS tests

Wednesday, August 6th, 2008

We’re trying out PBRT (Physically Based Rendering) on our cluster for some of the students. Rocks OS doesn’t include it as part of the installation package (it has to be installed from source) and it depends on openexr, a package that was developed at ILM to provide high dynamic-range (HDR) image file format.

Openexr does have a rpm package available, but not in the repository Rocks OS uses. I downloaded it from DAG along with openexr-devel, uses the version which corresponds to CentOS 4 (RedHAT EL 4) since I belive that is the source for the Rocks OS version I am using. The only gotcha during the compile was having to change the include directories for Openexr in pbrt’s Makefile from /usr/local/include to /usr/include and the lib directory to lib64, since we’re running 64 bit.

If pbrt works okay on the test node, I’ll install it on the rest of the cluster.

Tags:cluster, openexr, pbrt
Posted in Uncategorized | Comments Closed

What'sJohnWorkingOn