Gladserv suffered a prolonged outage on Friday afternoon affecting our Manchester servers. The root cause was a hardware failure on a server that had been running for over 470 days. Some web, email and database services were affected for up to four hours. Normally under these circumstances we would failover to another datacentre in London or Edinburgh.
However, attempts to move services to the Edinburgh datacentre were hampered by by high load on those servers. New servers had been installed in London and Edinburgh but were not yet in service. Earlier in the afternoon we had installed a new server in a separate Edinburgh facility to take some of the load, but this server had literally just been plugged in when the Manchester outage happened and so was not ready.
To recover from the outage, an engineer was despatched to the Manchester datacentre to replace the faulty hardware while we worked at preparing the new server in Edinburgh. The Manchester systems were brought back online just as the Edinburgh server came into service.
By early Friday evening all systems were online, including new servers in London and Edinburgh, and an upgraded Manchester server. We continued to work through the weekend to ensure we can recover much more quickly from a similar outage in future. On Sunday evening we performed a firedrill for the most critical systems that failed on Friday and are pleased to say that all went to plan.
Thank you to our affected customers for your patience throughout.
Brett Sheffield
Managing Director
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment