CSLab System Updates
These updates can also be obtained via an Atom or RSS feed. Alternatively, to be emailed any new updates as they appear, or to cease being emailed such alerts, send email to systemupdates-request@cs.

Tue, Nov 22, 2011

Recent system instabilities

Around midnight, fs4.cs, one of our NFS fileservers, failed for reasons that we are still investigating. This stalled all access to its filesystems, which had various bad effects on other machines. Among other things, our mail machine stopped processing email, the IMAP server became very unhappy with life, apps3.cs crashed at some point, and apps0.cs (aka cs.toronto.edu) stopped responding to ssh logins due to being overloaded.

When core staff arrived here this morning, they were able to revive the fileserver (in the end it required deploying new hardware), return fs4's filesystems to life, and get all of the affected machines back in service. We believe that everything should now be back to normal.

There may be minor disruptions to fs4 later today, because we believe that one of the disks it uses is failing and needs to be replaced.

/updates/2011    permanent link

Fri, Oct 07, 2011

Secure webserver outage

Last night our secure webserver, which runs such services as csweb.cs 
and webmail.cs, suffered a kernel bug and became unresponsive.

The machine was rebooted this morning and now seems to be fine.  We will 
be investigating the cause of the incident.

/updates/2011    permanent link

Sat, Oct 01, 2011

Router failed board replaced

Unfortunately, although our backbone router seemed to be working fine 
after the reboot yesterday, it was in fact heading slowly downhill 
again.  At around 8:40 am this morning, the net 3 port stopped passing 
traffic.  This was more significant than yesterday's outage, and would 
have stalled external email delivery, access to the webserver(s) and 
various other issues.

We remotely reset the router at about 9:20 am, which brought it back to 
proper operation again.  However, experience indicated that it was not a 
permanent solution, so we remotely prepared a replacement board and when 
one of the sysadmins was able to get into the office, connections were 
migrated off the untrustworthy board to the replacement one.

This seems to have been successful.  Time will tell whether the problem 
was the board, or whether there is a fault with the chassis and we will 
have to completely migrate to a new router.



/updates/2011    permanent link

Fri, Sep 30, 2011

Backbone router rebooted today

We had a minor hardware failure on our backbone router today, which 
interrupted network traffic to our campus secondary DNS server.  In 
order to address this issue, and to insure against the situation 
worsening over the weekend, we decided to reboot the router and verify 
that it was working properly.

The reboot occurred around 1:10pm today, and lasted approximately 30 
seconds.

/updates/2011    permanent link

Thu, Sep 08, 2011

comps4.cs has been returned to service

After the original server for comps4.cs suffered a series of hardware failures that took it completely out of comission, we have rebuilt a new version of comps4.cs on new hardware and it is once again back in service as a generally accessible compute server. Unfortunately the new hardware does not have as much memory as the old comps4.cs, but it is the best we can do with the hardware that we currently have available.

The new comps4.cs is the same as the current comps0.cs and comps1.cs; it is a Dell 2950 with 16 GB of RAM and dual quad-core Intel E5355 CPUs.

/updates/2011    permanent link

Thu, Sep 01, 2011

comps4.cs unavailable

comps4.cs is continuing to experience hardware issues and will be 
unavailable until further notice.

We apologize for the inconvenience.

/updates/2011    permanent link

Wed, Aug 31, 2011

comps4.cs restored

comps4.cs is back in service but with a reduced capacity due to a 
hardware failure.

/updates/2011    permanent link

Tue, Aug 30, 2011

comps4.cs is still unavailable

We are continuing to work on restoring comps4.cs to service.  It did not 
successfully reboot after the power outage.

We apologize for the inconvenience.

/updates/2011    permanent link

matlab on comps servers

Matlab 7.12 (2011a) has been installed on CSLab's comps servers and it 
is now the default version of matlab (/opt/matlab/bin/matlab).

/updates/2011    permanent link

wapps windows application server upgraded

We have eliminated the two older Windows Server machines, wapps2003 and 
wapps2007, in favour of a single machine on faster hardware which 
supplies both office 2003 and office 2010.

Profiles do not carry over from the older versions of Windows Server to 
Server 2010, so you may find that you have to re-customize your environment.

Domain logins are somewhat different as well; from linux rdesktop 
clients you may be required to preface your username with the domain, 
i.e. cslab\[username].

Some linux rdesktop clients will require that you delete a file in your 
.rdesktop directory before starting a new session to wapps2008. 
Hopefully newer versions of linux rdesktop clients will address this 
problem and negate the need for a workaround.

Windows rdp clients seem to perform flawlessly, including remembering 
domain membership details.





/updates/2011    permanent link

systems restored after power outage

All services (except for comps4) have been restored after the planned 
power outage.

/updates/2011    permanent link

Sun, Aug 28, 2011

Apps1.cs crash and reboot

Apps1.cs suffered a crash or freeze around 10:45pm Saturday night.  It 
has just been remote power-cycled and has returned to service.

/updates/2011    permanent link

Wed, Aug 17, 2011

apps3.cs crashed and has been reset

Another one of our application servers has unfortunately crashed (or locked up) and had to be reset. We apologize for the inconvenience and for the ongoing problems with these machines locking up; unfortunately there are few to no messages logged when this happens and no apparent consistency about the circumstances, so diagnosing this problem is very difficult.

/updates/2011    permanent link

Tue, Aug 16, 2011

apps2 crashed and has been reset

At approximately 5:00 pm today (August 16th, 2011) apps2.cs crashed and
had to be manually reset.

We are hopeful that the kernel upgrade it performed as it returned to
service will address this issue.

We apologize for any inconvenience.

/updates/2011    permanent link

Wed, Jul 27, 2011

apps0.cs.toronto.edu rebooted this morning

At or about 8:15 am, Wednesday July 27th, apps0 stopped responding and
we had to manually reset the machine at about 8:20 am.

The server seems stable now but we are unclear why it crashed. We are
going to scrutinize the logs to see if we can find an explanation. 

Sorry for any inconvenience this may have caused.

/updates/2011    permanent link

Thu, Jul 21, 2011

Localized Power Outage: Sandford Fleming Building 4th Floor

At or about 12:30 pm today, July 21st, an electrician accidently cut
through an electrical conduit in one of Sandford Fleming's 4th floor
maintenance closets. This temporarily cut off power to some of the DCS
offices on the 4th floor as well as the power to some of the AIS
clusters in SF4301J. As of about 1:05 pm the power has been restored. 

/updates/2011    permanent link

Wed, Jul 13, 2011

apps3 crashed and has been reset

At approximately 4:30 today (July 13th, 2011) and for reasons we are
unclear apps3.cs crashed and had to be manually reset. 

You will probably recall that apps3.cs also crashed on Monday afternoon
earlier this week. 

We are still unsure what is causing this but we are actively
investigating.

We apologize for any inconvenience.

/updates/2011    permanent link

Mon, Jul 11, 2011

apps3 crashed and has been reset

At approximately 15:30 today (July 11th, 2011) for reasons we are unsure
of apps3.cs crashed and had to be manually reset. 

We are conjecturing that a power spike earlier this morning (05.30)
might have put the machine into an unstable state but we are still
looking into the cause.

We apologize for any inconvenience.

/updates/2011    permanent link

Mon, Jul 04, 2011

Our strong system spam filtering option now works again

During a system upgrade back in May we accidentally made it so that our strong (system) spam filtering option (discussed on our System Spam Filtering page) was significantly less effective. We've now repaired that problem.

Technical details: normally, one of the things that strong system spam filtering does is that it automatically discards messages that PureMessage tags as spam. Our mistake made it stop doing that due to a necessary file being missing; now it's doing that again. Unfortunately, because of how the system is set up this failure was a silent one, with no messages logged about the missing file.

We regret any annoyance that this has caused people.

/updates/2011    permanent link

Mon, Jun 27, 2011

VPN server crashed

Last night (Sunday night / Monday morning) our VPN server crashed and 
became unresponsive.  We are now in the process of swapping over to our 
hot-spare VPN server.  VPN services should soon be back to normal, and 
we will be investigating the cause of the crash once services have been 
restored.

We apologize for any inconvenience.

/updates/2011    permanent link

Disk failure on fs4.cs

Last night (Sunday Jun 26) we lost one of the disks in the storage array 
attached to fs4.cs, which serves homedirs and workdisks for the vision 
group.   This failure had a deleterious effect on the fileserver, which 
then percolated down to various clients that mounted filesystems from 
that fileserver.

One of the sysadmins who happened to be checking email in the evening 
swapped in a spare for the failed disk, and returned redundancy to the 
storage array.  Now, Monday morning, we have physically replaced the 
failed disk and are re-syncing from the spare to the replaced original. 
  This may have a minor impact on performance of the storage pool until 
the operation is complete.

/updates/2011    permanent link

Tue, Jun 07, 2011

Linux upgrades completed

We've just completed kernel and security upgrades for all of our linux 
servers, and everything appears to be back in service and functioning 
properly.

We piggy-backed a server migration of the automatic red-network 
self-registration system on this downtime, and it too seems to be 
working properly.

Please don't hesitate to contact us or your PoC should you detect 
anything out of the ordinary that might be related to this downtime.

/updates/2011    permanent link

Wed, May 11, 2011

Red network dhcp server replaced

As part of our (almost finished) campaign to replace all older ubuntu 
6.06 machines before the support for them runs out, we've migrated red 
network dhcp services from an older 6.06 server to a newer 10.04 server. 
  This affects ltsps and red network dhcp clients, and the self-red and 
auto-red registration systems.

Any anomalous behaviour noticed on the red network should be mentioned 
to your PoC, who will pass it on to us if appropriate.  This transition 
should be invisible to the end-user community.

/updates/2011    permanent link

Thu, May 05, 2011

Network issue resolved

We observed and were notified by some of our monitoring systems that the 
network was overloaded.

A short investigation revealed that a research machine in a sandbox had 
been attempting to make at least two hundred thousand simultaneous 
connections to outside of our networks.  This caused sufficient load to 
hit the limiters on some of the networking gear, and resulted in very 
poor performance across our networks.

We have responded by isolating the machine from the networks outside of 
that sandbox via the sandbox firewall, and have notified the PoC 
responsible for the sandbox so that they can either examine the machine 
or pass on the issue to the researcher responsible for it.

Network performance should now be back to normal.

/updates/2011    permanent link

Mon, May 02, 2011

External mail gateway machine replaced

We have replaced our old ubuntu 6.06 incoming mail gateway with a faster 
ubuntu 10.04 one as part of phasing out ubuntu 6.06 prior to its 
upcoming end of life, and in line with our desire to generally replace 
older hardware with newer when reasonable.

There was approximately a two minute time during which mail would have 
been queued by external mailers trying to deliver, but all mailers 
should deal with this in the same manner they would deal with any 
network interruption; i.e. they would queue the mail for a short time 
and then attempt another delivery, which would succeed.

Mail from outside is being delivered properly, is being spam tagged, and 
everything looks normal.

/updates/2011    permanent link

Tue, Apr 19, 2011

colony.cs has been upgraded

colony.cs, the machine that is www.cs.toronto.edu and hosts user home pages and user-managed webservers (among other things), has been upgraded from Ubuntu 8.04 to Ubuntu 10.04. In the process it has changed IP addresses, although this change should be transparent to most people.

People with user-managed webservers should check them, as you may need to take some additional steps to make them fully functional again. If this applies to you, please talk to your Point of Contact for the full details.

In the process of this upgrade, we discovered an unanticipated issue with the ordering of events on system startup. This problem caused everyone with user-managed webservers to get several mail messages about their web servers being unable to start due to scripts not being found. We apologize for this inconvenience; this problem has now been resolved.

/updates/2011    permanent link

Wed, Mar 30, 2011

fs4.cs returned to service

One of our fileservers crashed overnight (we believe around 5:38am at the latest, although symptoms may have been visible earlier). It was returned to service around 9:10am. While fs4 only has (some) AI filesystems, AI people use apps and comps machines and so on, so fs4 being down eventually badly affected many of CSLab's shared machines and made a number of them basically inaccessible until fs4 returned to service.

We apologize for the disruption.

/updates/2011    permanent link

Tue, Mar 29, 2011

apps0 LTSP logins made to work again

This morning we discovered that apps0 had started to suffer from the same problem with LTSP logins as apps1 had last week: when you logged in the initial session start was quite slow, and then you couldn't start most or all new programs. We have successfully fixed this problem; however, in the process we accidentally logged out all current LTSP sessions.

We apologize for the disruption and the inconvenience.

(The problem turned out to be in the system DBus daemon, which was being mostly unresponsive. It turns out that Gnome will log you out if it loses contact with the system DBus daemon, as happens when you restart said daemon.)

/updates/2011    permanent link

Mon, Mar 28, 2011

apps1.cs is back in service

Late last week, apps1.cs suffered an unidentified failure in some system-wide component of the Gnome desktop, causing LTSP logins (and anyone else trying to run a Gnome session from apps1.cs) to stall and malfunction. Since we don't know what internal Gnome component failed or how to restart it, we had no choice but to reboot apps1.cs in order to get it back to normal.

This has now been done, and apps1.cs is back in service.

(Because we have several apps machines and apps1.cs was still working for some people, especially people logged in remotely, we did not want to reboot it abruptly. Instead we blocked further access to it and got people to end their logins in an orderly way.)

/updates/2011    permanent link

Mon, Mar 14, 2011

comps0.cs crashed and was rebooted

comps0.cs crashed today at approximately 3:00pm.  It has been rebooted 
and is back in service.  The cause of the crash is being investigated.

	Neil

/updates/2011    permanent link

Fri, Mar 11, 2011

University experiencing network connectivity issues [ resolved ]

This is the official word from the University's Network Operations
Centre:

11-March 12:54 The University of Toronto's Internet connectivity has
been lost. Our technicians are currently investigating the problem.

11-March 13:24 Internet connectivity was restored at 13:15. Our
technicians discovered a routing issue on one of our gateway routers
which they took action to resolve.

/updates/2011    permanent link

University experiencing network connectivity issues

12:59 pm, March 11th, 2010
 
Over the last 20 minutes or so the University has been experiencing
network connectivity issues. It seems that one of their backbone
connections is having an intermittent problem such that it causes
certain remote sites to become inaccessible or extremely slow. We are
waiting on more information from the University's Network Operations
Group.

/updates/2011    permanent link

Thu, Mar 10, 2011

Brief outage on the wireless network

At approximately 4:00pm today our wireless gateway machine, through 
which all wireless traffic passes, crashed and had to be rebooted.  The 
machine was back in service at 4:15pm.

We are investigating the cause of this incident and will keep an eye on 
it over the immediate future.

In case it interests anyone, the machine had an uptime of over a year.

/updates/2011    permanent link

Mon, Feb 28, 2011

colony.cs rebooted

We have rebooted colony.cs, the machine behind www.cs.toronto.edu and some other websites here, due to problems with user-managed webservers; most user-managed web servers stopped working and just logged endless problem reports to their error logs.

/updates/2011    permanent link

Thu, Feb 17, 2011

Apps and comps machines rebooted this morning

For reasons that we don't know yet, all of our apps machines, many of our comps machines, and a number of other machines rebooted this morning at around 7:40am. At this time our best guess is another of the periodic power glitches that Toronto appears to be experiencing lately. A few machines did not automatically recover from this situation and we are restoring them to service now.

In the long run, we are working to obtain and deploy some power conditioning hardware that we hope will protect us from these problems.

/updates/2011    permanent link

Thu, Feb 10, 2011

Slow internet connectivity

Feb 10 10:29:27 EST 2011

It appears that the University as a whole is experiencing slow internet
connectivity, although this has not yet been reported on the Network
Operation Centre's main site:

 http://www.noc.utoronto.ca/net-ops/events.shtml

We are continuing to monitor the situation.

Thanks

CSLab

/updates/2011    permanent link

Mon, Feb 07, 2011

fs4 disk failure

Mon, Feb 07, 2011

At or around about 2:15 pm, we determined that it was necessary to
remove one of fs4's redundant data disks since it was causing user
visible stalls. We have replaced this disk with a new one. 

If your CSLab home directory is hosted on fs4, you may notice a small
slow down in service as the new disk synchronizes its data. We expect
this to be finished within about 12-16 hours. 

/updates/2011    permanent link

comps0 rebooted

At 8:49am on Monday Feb 7 comps0 rebooted.

We believe that Toronto experienced another minor power glitch which 
luckily resulted in only one of our servers rebooting.

/updates/2011    permanent link

Wed, Feb 02, 2011

smb/dcssmb reset themselves

10:30 am, February 2nd, 2010

Each morning we apply system upgrades to our various servers. Today's
upgrade caused our two Samba servers -- smb/dcssmb -- to restart their
processes. This may have caused some users to lose connectivity to their
shares. We apologize for the interruption.

/updates/2011    permanent link

Tue, Jan 25, 2011

Network outage

Last night, at approximately 3:45am, Cogent Communications lost power at 
one of their down-town core routers.  This interrupted network service 
for the University of Toronto until the router could be sorted out.

The internet connectivity outage lasted approximately ten minutes.

/updates/2011    permanent link

Power outage last night

It appears that we experienced a small, possibly localized power outage 
in our machine room last night.  It may be that this was simply a 
reduction in voltage for a brief moment, which we know affects some of 
our more sensitive machines by causing them to reboot.

Rebooted by momentary loss of power were apps0, apps1, apps2, comps3, 
oldapps, oldcomps, the wapps machines, and compsbk1.

Only compsbk1 did not return to service after the outage.  It is being 
addressed now.

/updates/2011    permanent link

Mon, Jan 24, 2011

Network connectivity issues

The University as a whole seems to have lost network connectivity to the
outside world. NOC (Network Operations Group) has not yet posted an
informational email regarding this nor have they updated their website:

 https://csweb.cs.toronto.edu/sysadmin/niconotes/WIFIspeeds.html

We shall continue to monitor.

As we write partial connectivity has been restored.

Thanks

CSLab Staff

/updates/2011    permanent link

Thu, Jan 20, 2011

apps1.cs rebooted due to process termination issues

Earlier this week apps1.cs began to display an inability to properly 
terminate gnome-related processes, to the extent that we had over 500 
(and growing) defunct processes.  The broken gnome processes were also 
making it impossible for some ltsp users to start up new applications 
under gnome.

As a response, we blocked logins and removed apps1.cs from the ltsp 
chooser window, to prevent more people from attempting to use the 
machine while at the same time allowing connected users a grace period 
to finish up their work and disconnect.  Having a waited a few days now, 
we've rebooted the machine to clear out the problems.

apps1.cs is once again accepting remote connections and ltsp logins.

/updates/2011    permanent link

Wed, Jan 19, 2011

January 19: NFS file-server reset

At about 13:00, Jan 19, fs4 started to report that it was not able to
field NFS requests.  Further investigation revealed that one of fs4's
storage backends had experienced an atypical disk failure such that fs4
was stalling when attempting to read or write to it. This in turn caused
fs4 to grind to a halt. 

Even though we removed the failed disk from fs4's configuration, the
machine still remained unresponsive. So at about 14:00, we reluctantly
decided that we had to reset the machine to fix the problem.

This worked and fs4 was returned to full service at about 15:03. 

We apologize for any inconvenience this may have caused, especially
since fs4's reset will have caused a general slow down.

We are still unclear why this particular disk failure, unlike many
others, caused fs4 to stall but we are actively investigating.

/updates/2011    permanent link

Tue, Jan 11, 2011

Fwd: Internet outage - 11-Jan-2011-10:55 AM

Message from Campus NOC - the campus is having internet issues today.


-------- Original Message --------
Subject: 	Internet outage - 11-Jan-2011-10:55 AM
Date: 	Tue, 11 Jan 2011 11:21:56 -0500
From: 	I+TS Network Operations (EIS) 
Reply-To: 	net-ops@noc.utoronto.ca
To: 	NETWORK-SERVICE-STATUS@listserv.utoronto.ca



11-Jan-2011 10:55 AM University of Toronto network is facing issues with
internet connectivity, the reason of outage is under investigation.

Regards

*Vivin Thomas*

Network Operations Centre.

Enterprise Infrastructure Solutions Group (EIS)

Phone 416-978-4621

/updates/2011    permanent link


CSLab System Updates
These updates can also be obtained via an Atom or RSS feed. Alternatively, to be emailed any new updates as they appear, or to cease being emailed such alerts, send email to systemupdates-request@cs.
Blosxom