Tue, Nov 22, 2011
Around midnight, fs4.cs, one of our NFS fileservers, failed for reasons that we are still investigating. This stalled all access to its filesystems, which had various bad effects on other machines. Among other things, our mail machine stopped processing email, the IMAP server became very unhappy with life, apps3.cs crashed at some point, and apps0.cs (aka cs.toronto.edu) stopped responding to ssh logins due to being overloaded.
When core staff arrived here this morning, they were able to revive the fileserver (in the end it required deploying new hardware), return fs4's filesystems to life, and get all of the affected machines back in service. We believe that everything should now be back to normal.
There may be minor disruptions to fs4 later today, because we believe that one of the disks it uses is failing and needs to be replaced.
/updates/2011    permanent link
Fri, Oct 07, 2011Last night our secure webserver, which runs such services as csweb.cs and webmail.cs, suffered a kernel bug and became unresponsive. The machine was rebooted this morning and now seems to be fine. We will be investigating the cause of the incident.
/updates/2011    permanent link
Sat, Oct 01, 2011Unfortunately, although our backbone router seemed to be working fine after the reboot yesterday, it was in fact heading slowly downhill again. At around 8:40 am this morning, the net 3 port stopped passing traffic. This was more significant than yesterday's outage, and would have stalled external email delivery, access to the webserver(s) and various other issues. We remotely reset the router at about 9:20 am, which brought it back to proper operation again. However, experience indicated that it was not a permanent solution, so we remotely prepared a replacement board and when one of the sysadmins was able to get into the office, connections were migrated off the untrustworthy board to the replacement one. This seems to have been successful. Time will tell whether the problem was the board, or whether there is a fault with the chassis and we will have to completely migrate to a new router.
/updates/2011    permanent link
Fri, Sep 30, 2011
Backbone router rebooted today
We had a minor hardware failure on our backbone router today, which interrupted network traffic to our campus secondary DNS server. In order to address this issue, and to insure against the situation worsening over the weekend, we decided to reboot the router and verify that it was working properly. The reboot occurred around 1:10pm today, and lasted approximately 30 seconds.
/updates/2011    permanent link
Thu, Sep 08, 2011
comps4.cs has been returned to service
After the original server for comps4.cs suffered a series of hardware failures that took it completely out of comission, we have rebuilt a new version of comps4.cs on new hardware and it is once again back in service as a generally accessible compute server. Unfortunately the new hardware does not have as much memory as the old comps4.cs, but it is the best we can do with the hardware that we currently have available.
The new comps4.cs is the same as the current comps0.cs and comps1.cs; it is a Dell 2950 with 16 GB of RAM and dual quad-core Intel E5355 CPUs.
/updates/2011    permanent link
Thu, Sep 01, 2011comps4.cs is continuing to experience hardware issues and will be unavailable until further notice. We apologize for the inconvenience.
/updates/2011    permanent link
Wed, Aug 31, 2011comps4.cs is back in service but with a reduced capacity due to a hardware failure.
/updates/2011    permanent link
Tue, Aug 30, 2011
comps4.cs is still unavailable
We are continuing to work on restoring comps4.cs to service. It did not successfully reboot after the power outage. We apologize for the inconvenience.
/updates/2011    permanent link
Matlab 7.12 (2011a) has been installed on CSLab's comps servers and it is now the default version of matlab (/opt/matlab/bin/matlab).
/updates/2011    permanent link
wapps windows application server upgraded
We have eliminated the two older Windows Server machines, wapps2003 and wapps2007, in favour of a single machine on faster hardware which supplies both office 2003 and office 2010. Profiles do not carry over from the older versions of Windows Server to Server 2010, so you may find that you have to re-customize your environment. Domain logins are somewhat different as well; from linux rdesktop clients you may be required to preface your username with the domain, i.e. cslab\[username]. Some linux rdesktop clients will require that you delete a file in your .rdesktop directory before starting a new session to wapps2008. Hopefully newer versions of linux rdesktop clients will address this problem and negate the need for a workaround. Windows rdp clients seem to perform flawlessly, including remembering domain membership details.
/updates/2011    permanent link
systems restored after power outage
All services (except for comps4) have been restored after the planned power outage.
/updates/2011    permanent link
Sun, Aug 28, 2011Apps1.cs suffered a crash or freeze around 10:45pm Saturday night. It has just been remote power-cycled and has returned to service.
/updates/2011    permanent link
Wed, Aug 17, 2011
apps3.cs crashed and has been reset
Another one of our application servers has unfortunately crashed (or locked up) and had to be reset. We apologize for the inconvenience and for the ongoing problems with these machines locking up; unfortunately there are few to no messages logged when this happens and no apparent consistency about the circumstances, so diagnosing this problem is very difficult.
/updates/2011    permanent link
Tue, Aug 16, 2011
apps2 crashed and has been reset
At approximately 5:00 pm today (August 16th, 2011) apps2.cs crashed and had to be manually reset. We are hopeful that the kernel upgrade it performed as it returned to service will address this issue. We apologize for any inconvenience.
/updates/2011    permanent link
Wed, Jul 27, 2011
apps0.cs.toronto.edu rebooted this morning
At or about 8:15 am, Wednesday July 27th, apps0 stopped responding and we had to manually reset the machine at about 8:20 am. The server seems stable now but we are unclear why it crashed. We are going to scrutinize the logs to see if we can find an explanation. Sorry for any inconvenience this may have caused.
/updates/2011    permanent link
Thu, Jul 21, 2011
Localized Power Outage: Sandford Fleming Building 4th Floor
At or about 12:30 pm today, July 21st, an electrician accidently cut through an electrical conduit in one of Sandford Fleming's 4th floor maintenance closets. This temporarily cut off power to some of the DCS offices on the 4th floor as well as the power to some of the AIS clusters in SF4301J. As of about 1:05 pm the power has been restored.
/updates/2011    permanent link
Wed, Jul 13, 2011
apps3 crashed and has been reset
At approximately 4:30 today (July 13th, 2011) and for reasons we are unclear apps3.cs crashed and had to be manually reset. You will probably recall that apps3.cs also crashed on Monday afternoon earlier this week. We are still unsure what is causing this but we are actively investigating. We apologize for any inconvenience.
/updates/2011    permanent link
Mon, Jul 11, 2011
apps3 crashed and has been reset
At approximately 15:30 today (July 11th, 2011) for reasons we are unsure of apps3.cs crashed and had to be manually reset. We are conjecturing that a power spike earlier this morning (05.30) might have put the machine into an unstable state but we are still looking into the cause. We apologize for any inconvenience.
/updates/2011    permanent link
Mon, Jul 04, 2011
Our strong system spam filtering option now works again
During a system upgrade back in May we accidentally made it so that our strong (system) spam filtering option (discussed on our System Spam Filtering page) was significantly less effective. We've now repaired that problem.
Technical details: normally, one of the things that strong system spam filtering does is that it automatically discards messages that PureMessage tags as spam. Our mistake made it stop doing that due to a necessary file being missing; now it's doing that again. Unfortunately, because of how the system is set up this failure was a silent one, with no messages logged about the missing file.
We regret any annoyance that this has caused people.
/updates/2011    permanent link
Mon, Jun 27, 2011Last night (Sunday night / Monday morning) our VPN server crashed and became unresponsive. We are now in the process of swapping over to our hot-spare VPN server. VPN services should soon be back to normal, and we will be investigating the cause of the crash once services have been restored. We apologize for any inconvenience.
/updates/2011    permanent link
Last night (Sunday Jun 26) we lost one of the disks in the storage array attached to fs4.cs, which serves homedirs and workdisks for the vision group. This failure had a deleterious effect on the fileserver, which then percolated down to various clients that mounted filesystems from that fileserver. One of the sysadmins who happened to be checking email in the evening swapped in a spare for the failed disk, and returned redundancy to the storage array. Now, Monday morning, we have physically replaced the failed disk and are re-syncing from the spare to the replaced original. This may have a minor impact on performance of the storage pool until the operation is complete.
/updates/2011    permanent link
Tue, Jun 07, 2011We've just completed kernel and security upgrades for all of our linux servers, and everything appears to be back in service and functioning properly. We piggy-backed a server migration of the automatic red-network self-registration system on this downtime, and it too seems to be working properly. Please don't hesitate to contact us or your PoC should you detect anything out of the ordinary that might be related to this downtime.
/updates/2011    permanent link
Wed, May 11, 2011
Red network dhcp server replaced
As part of our (almost finished) campaign to replace all older ubuntu 6.06 machines before the support for them runs out, we've migrated red network dhcp services from an older 6.06 server to a newer 10.04 server. This affects ltsps and red network dhcp clients, and the self-red and auto-red registration systems. Any anomalous behaviour noticed on the red network should be mentioned to your PoC, who will pass it on to us if appropriate. This transition should be invisible to the end-user community.
/updates/2011    permanent link
Thu, May 05, 2011We observed and were notified by some of our monitoring systems that the network was overloaded. A short investigation revealed that a research machine in a sandbox had been attempting to make at least two hundred thousand simultaneous connections to outside of our networks. This caused sufficient load to hit the limiters on some of the networking gear, and resulted in very poor performance across our networks. We have responded by isolating the machine from the networks outside of that sandbox via the sandbox firewall, and have notified the PoC responsible for the sandbox so that they can either examine the machine or pass on the issue to the researcher responsible for it. Network performance should now be back to normal.
/updates/2011    permanent link
Mon, May 02, 2011
External mail gateway machine replaced
We have replaced our old ubuntu 6.06 incoming mail gateway with a faster ubuntu 10.04 one as part of phasing out ubuntu 6.06 prior to its upcoming end of life, and in line with our desire to generally replace older hardware with newer when reasonable. There was approximately a two minute time during which mail would have been queued by external mailers trying to deliver, but all mailers should deal with this in the same manner they would deal with any network interruption; i.e. they would queue the mail for a short time and then attempt another delivery, which would succeed. Mail from outside is being delivered properly, is being spam tagged, and everything looks normal.
/updates/2011    permanent link
Tue, Apr 19, 2011colony.cs, the machine that is www.cs.toronto.edu and hosts user home pages and user-managed webservers (among other things), has been upgraded from Ubuntu 8.04 to Ubuntu 10.04. In the process it has changed IP addresses, although this change should be transparent to most people.
People with user-managed webservers should check them, as you may need to take some additional steps to make them fully functional again. If this applies to you, please talk to your Point of Contact for the full details.
In the process of this upgrade, we discovered an unanticipated issue with the ordering of events on system startup. This problem caused everyone with user-managed webservers to get several mail messages about their web servers being unable to start due to scripts not being found. We apologize for this inconvenience; this problem has now been resolved.
/updates/2011    permanent link
Wed, Mar 30, 2011One of our fileservers crashed overnight (we believe around 5:38am at the latest, although symptoms may have been visible earlier). It was returned to service around 9:10am. While fs4 only has (some) AI filesystems, AI people use apps and comps machines and so on, so fs4 being down eventually badly affected many of CSLab's shared machines and made a number of them basically inaccessible until fs4 returned to service.
We apologize for the disruption.
/updates/2011    permanent link
Tue, Mar 29, 2011
apps0 LTSP logins made to work again
This morning we discovered that apps0 had started to suffer from the same problem with LTSP logins as apps1 had last week: when you logged in the initial session start was quite slow, and then you couldn't start most or all new programs. We have successfully fixed this problem; however, in the process we accidentally logged out all current LTSP sessions.
We apologize for the disruption and the inconvenience.
(The problem turned out to be in the system DBus daemon, which was being mostly unresponsive. It turns out that Gnome will log you out if it loses contact with the system DBus daemon, as happens when you restart said daemon.)
/updates/2011    permanent link
Mon, Mar 28, 2011Late last week, apps1.cs suffered an unidentified failure in some system-wide component of the Gnome desktop, causing LTSP logins (and anyone else trying to run a Gnome session from apps1.cs) to stall and malfunction. Since we don't know what internal Gnome component failed or how to restart it, we had no choice but to reboot apps1.cs in order to get it back to normal.
This has now been done, and apps1.cs is back in service.
(Because we have several apps machines and apps1.cs was still working for some people, especially people logged in remotely, we did not want to reboot it abruptly. Instead we blocked further access to it and got people to end their logins in an orderly way.)
/updates/2011    permanent link
Mon, Mar 14, 2011
comps0.cs crashed and was rebooted
comps0.cs crashed today at approximately 3:00pm. It has been rebooted and is back in service. The cause of the crash is being investigated. Neil
/updates/2011    permanent link
Fri, Mar 11, 2011
University experiencing network connectivity issues [ resolved ]
This is the official word from the University's Network Operations Centre: 11-March 12:54 The University of Toronto's Internet connectivity has been lost. Our technicians are currently investigating the problem. 11-March 13:24 Internet connectivity was restored at 13:15. Our technicians discovered a routing issue on one of our gateway routers which they took action to resolve.
/updates/2011    permanent link
University experiencing network connectivity issues
12:59 pm, March 11th, 2010 Over the last 20 minutes or so the University has been experiencing network connectivity issues. It seems that one of their backbone connections is having an intermittent problem such that it causes certain remote sites to become inaccessible or extremely slow. We are waiting on more information from the University's Network Operations Group.
/updates/2011    permanent link
Thu, Mar 10, 2011
Brief outage on the wireless network
At approximately 4:00pm today our wireless gateway machine, through which all wireless traffic passes, crashed and had to be rebooted. The machine was back in service at 4:15pm. We are investigating the cause of this incident and will keep an eye on it over the immediate future. In case it interests anyone, the machine had an uptime of over a year.
/updates/2011    permanent link
Mon, Feb 28, 2011We have rebooted colony.cs, the machine behind www.cs.toronto.edu and some other websites here, due to problems with user-managed webservers; most user-managed web servers stopped working and just logged endless problem reports to their error logs.
/updates/2011    permanent link
Thu, Feb 17, 2011
Apps and comps machines rebooted this morning
For reasons that we don't know yet, all of our apps machines, many of our comps machines, and a number of other machines rebooted this morning at around 7:40am. At this time our best guess is another of the periodic power glitches that Toronto appears to be experiencing lately. A few machines did not automatically recover from this situation and we are restoring them to service now.
In the long run, we are working to obtain and deploy some power conditioning hardware that we hope will protect us from these problems.
/updates/2011    permanent link
Thu, Feb 10, 2011Feb 10 10:29:27 EST 2011 It appears that the University as a whole is experiencing slow internet connectivity, although this has not yet been reported on the Network Operation Centre's main site: http://www.noc.utoronto.ca/net-ops/events.shtml We are continuing to monitor the situation. Thanks CSLab
/updates/2011    permanent link
Mon, Feb 07, 2011Mon, Feb 07, 2011 At or around about 2:15 pm, we determined that it was necessary to remove one of fs4's redundant data disks since it was causing user visible stalls. We have replaced this disk with a new one. If your CSLab home directory is hosted on fs4, you may notice a small slow down in service as the new disk synchronizes its data. We expect this to be finished within about 12-16 hours.
/updates/2011    permanent link
At 8:49am on Monday Feb 7 comps0 rebooted. We believe that Toronto experienced another minor power glitch which luckily resulted in only one of our servers rebooting.
/updates/2011    permanent link
Wed, Feb 02, 201110:30 am, February 2nd, 2010 Each morning we apply system upgrades to our various servers. Today's upgrade caused our two Samba servers -- smb/dcssmb -- to restart their processes. This may have caused some users to lose connectivity to their shares. We apologize for the interruption.
/updates/2011    permanent link
Tue, Jan 25, 2011Last night, at approximately 3:45am, Cogent Communications lost power at one of their down-town core routers. This interrupted network service for the University of Toronto until the router could be sorted out. The internet connectivity outage lasted approximately ten minutes.
/updates/2011    permanent link
It appears that we experienced a small, possibly localized power outage in our machine room last night. It may be that this was simply a reduction in voltage for a brief moment, which we know affects some of our more sensitive machines by causing them to reboot. Rebooted by momentary loss of power were apps0, apps1, apps2, comps3, oldapps, oldcomps, the wapps machines, and compsbk1. Only compsbk1 did not return to service after the outage. It is being addressed now.
/updates/2011    permanent link
Mon, Jan 24, 2011The University as a whole seems to have lost network connectivity to the outside world. NOC (Network Operations Group) has not yet posted an informational email regarding this nor have they updated their website: https://csweb.cs.toronto.edu/sysadmin/niconotes/WIFIspeeds.html We shall continue to monitor. As we write partial connectivity has been restored. Thanks CSLab Staff
/updates/2011    permanent link
Thu, Jan 20, 2011
apps1.cs rebooted due to process termination issues
Earlier this week apps1.cs began to display an inability to properly terminate gnome-related processes, to the extent that we had over 500 (and growing) defunct processes. The broken gnome processes were also making it impossible for some ltsp users to start up new applications under gnome. As a response, we blocked logins and removed apps1.cs from the ltsp chooser window, to prevent more people from attempting to use the machine while at the same time allowing connected users a grace period to finish up their work and disconnect. Having a waited a few days now, we've rebooted the machine to clear out the problems. apps1.cs is once again accepting remote connections and ltsp logins.
/updates/2011    permanent link
Wed, Jan 19, 2011
January 19: NFS file-server reset
At about 13:00, Jan 19, fs4 started to report that it was not able to field NFS requests. Further investigation revealed that one of fs4's storage backends had experienced an atypical disk failure such that fs4 was stalling when attempting to read or write to it. This in turn caused fs4 to grind to a halt. Even though we removed the failed disk from fs4's configuration, the machine still remained unresponsive. So at about 14:00, we reluctantly decided that we had to reset the machine to fix the problem. This worked and fs4 was returned to full service at about 15:03. We apologize for any inconvenience this may have caused, especially since fs4's reset will have caused a general slow down. We are still unclear why this particular disk failure, unlike many others, caused fs4 to stall but we are actively investigating.
/updates/2011    permanent link
Tue, Jan 11, 2011
Fwd: Internet outage - 11-Jan-2011-10:55 AM
Message from Campus NOC - the campus is having internet issues today. -------- Original Message -------- Subject: Internet outage - 11-Jan-2011-10:55 AM Date: Tue, 11 Jan 2011 11:21:56 -0500 From: I+TS Network Operations (EIS)Reply-To: net-ops@noc.utoronto.ca To: NETWORK-SERVICE-STATUS@listserv.utoronto.ca 11-Jan-2011 10:55 AM University of Toronto network is facing issues with internet connectivity, the reason of outage is under investigation. Regards *Vivin Thomas* Network Operations Centre. Enterprise Infrastructure Solutions Group (EIS) Phone 416-978-4621
/updates/2011    permanent link