Cooling Systems Incident at Data Centre

Incident Report

Introduction

On Friday November 16, 2018 controlled emergency shutdown procedures were executed when the Pyrmont Data Centre temperatures exceeded known acceptable thresholds due to a cooling system failure.

No customer data was lost and all hardware was protected from thermal-failure due to this timely response. Once the cooling system was restored we were able to bring all customers services back online.

We have received a post incident review from the data centre including remedial action and confirmation of upgrades and testing to ensure a repeat of same or similar incident is extremely unlikely.

Summary

At 12:32pm, our monitoring systems indicated that temperatures in the data centre had started to increase. System administrators conducted an immediate audit of all servers in the facility to determine if this was isolated to some areas, or was a facility-wide event. It became quickly evident that this was a facility-wide event.

At 12:44pm we received confirmation that the facility had seamlessly moved to UPS power after an areawide Ausgrid power outage at 12:22pm, however, there began an unexpected cooling issue which they then worked to actively address.

As temperatures continued to rise, the decision was made at 1:02pm to execute emergency shutdown procedures to ensure data integrity across our platforms. This involved a graceful shutdown of all services.

The situation was an emergeny and so it was not possible to begin the process of customers alerts.

At 1:31pm engineers on-site noted that cooling systems has resumed normal operation and were functioning.

At 1:42pm our monitoring systems indicated that temperatures in the data centre facility were approaching normal levels. Engineers monitored the situation closely to ensure thermal stability prior to powering servers on.

At 1:45pm engineers started restoring power to servers, incrementally. The process to restore all dedicated servers and associated services had begun

The majority of dedicated servers and VPS's were restored by 3:30pm. All remaining services ie, reseller accounts, normal singular web hosting accounts and associated services were restored by 5:56pm.

Root Cause

The root cause of the incident was an extended cooling system interruption at the data centre.

Corrective and Preventative Measures

Independent engineers have determined the remedial action was needed to replace some components in the cooling system and install an override facility on the auxiliary emergency heat extraction system. The faulty components have been replaced and a number of full failover tests have been successfully completed without incident.