Thu, May 17, 2012
apps0.cs downtime this morning
This morning apps0.cs suffered a drive controller failure, effectively crippling the machine to the extent that anything requiring disk access was failing. We were unable to resuscitate the drive controller, but fortunately we did have a spare server of similar specifications available. We swapped apps0's disks to the spare machine, and after some massaging apps0 is once again up and running.
/updates/2012    permanent link
Fri, May 11, 2012
Replacement comps4, updated webmail, bookable compute server status
A number of system improvements have been made in the past term that are worth mentioning here:
/updates/2012    permanent link
Thu, Mar 22, 2012Matlab 7.13 (2011b) is now the default version of matlab (/opt/matlab/bin/matlab) on CSLab's comps servers.
/updates/2012    permanent link
Sandford Fleming, which houses our primary machine room, suffered a building-wide power outage today. This took down virtually every piece of infrastructure that we have. To make a long story short, as of now (4:10 pm) almost everything should be back and working properly again, although some services such as email may be slow while everything catches up. We're still on track for our 6:0 pm downtime tonight; we did not want to further delay recovering the systems by attempting to piggyback the downtime tasks on top of the power outage (and we were not ready for some of them in any event).
/updates/2012    permanent link
Mon, Mar 19, 2012
Another failure on the same fileserver
In keeping with the theme of 'It never rains but it pours', we have just had a second, non-related disk failure on a different back-end machine serving the same fileserver that suffered yesterday's failure. At least we are physically present this time, so we are currently going through the replacement procedure for the second failed disk. We will have to reboot the fileserver, and all NFS clients are expected to be slow and unresponsive until we can bring the fileserver back to an even keel once again. Outage is expected to be about 15 minutes.
/updates/2012    permanent link
We've replaced the failed component in the fileserver back-end, and all NFS clients appear to be restored to normal functionality. Should you notice any lingering issues, please communicate them to your PoC or software@cs directly depending on the severity. Thank you, and we regret the inconvenience this event may have caused. CSLab Staff
/updates/2012    permanent link
Sun, Mar 18, 2012
Cascade problems from failed fileserver
It seems that we have experienced a fileserver-related hardware failure which has caused NFS filesystems exported from that fileserver to be non-responsive. This in turn has cascaded to problems with NFS clients of that fileserver, as in some cases an NFS client will tie itself in knots attempting to access the now unresponsive filesystems. Most of these problems we will not be able to address until we are physically on premises Monday morning. Even then, it may take some time to work through the chain of problems and get everything back to normal. Should you find a particular server responding poorly or not at all, please try a similar machine; i.e. if apps0 is non-responsive, you may find better results with apps2 or apps3. Some clients will have weathered the storm better than others.
/updates/2012    permanent link
Fri, Jan 20, 2012Over the last little while (two weeks or so), our wireless router has been experiencing intermittent lockups. Today at approximately 12:55 pm the router failed again and we decided the replace it with a hot spare. The work took about 20 minutes and unfortunately during this time the CSLab wireless network would have been very unstable. Our testing indicates that the new hardware is running as expected and we do not expect to see additional interruptions, although we shall monitor it closely over the coming days. We are sorry for the inconvenience that this will have caused. Thanks CSLab Staff
/updates/2012    permanent link