Report Explaining Recent Outages

From Rob Stanicic

All-

I wanted to give you a report explaining the system outages we recently experienced.

On Sunday, January 9 at 10:00AM we experienced a brief power outage at District that shut down our District data room.

The outage lasted approximately 2 hours while we recovered all systems.

We have systems in place to help prevent such outages.

These systems include UPS battery supply, a power generator and automated notification when UPS battery power is engaged.

Upon investigation we discovered that at the time of the outage the UPS battery system was set to bypass mode.

This means that the batteries were offline and power was being supplied directly to our systems.

The bypass mode was triggered by a brief UPS system capacity issue.

This particular type of failover did not engage our automated notification system.

We have since corrected both the immediate capacity issue and the automated notification for this type of outage.

We continued to experience some system performance issues as a result of the outage that primarily affecting our website services.

These were corrected during Monday, January 10.

On Tuesday, January 11 at 11:00AM we experienced an issue with our virtual server farm.

The virtual server farm supports services such as the College’s website and email system.

Banner was not affected as it resides on physical servers.

Virtual services were running but with irregular performance.

Performance degraded to a point where we needed to reboot all virtual services.

The systems outage to allow our technicians to power down and bring back on line services one by one lasted between 1 to 4 hours.

Prior to the issue our technicians were performing routine maintenance duties.

This activity triggered an unexpected corruption in the virtual server system.

Upon investigation we discovered a bug with the latest version of the ‘Vsphere’ product (Version 4.0).

Vsphere is the underlying platform for the virtual farm and is designed to support routine systems maintenance.

It is a widely used product and is considered a standard.

We are in contact with VMWare to identify a patch to correct the bug.

In the meantime we have changed our procedures to only perform systems maintenance during our regular after hours window.

All systems are now performing well and we expect continued delivery of reliable services to our students, staff and community.

Thank you for your patience.

Rob

South

Posted in SanJac ITS News

Leave a Reply

Your email address will not be published. Required fields are marked *

*