Team Data Services

Sterling outage 2

The Sterling main brand site was unavailable for a period of around 1 hour on the evening of Friday 18th, staring around 7:45PM. This was caused by an issue in the external hosting service for one of the site services. This was totally unrelated to the previous outage.

We were immediately notified of the outage by the now correctly configured monitoring service and was immediately investigated. It was found that one of the sites dependent microservices was unreachable which was causing requests to the main brand site to timeout. The hosting provider Heroku was very slow in posting a status incident, so it was not clear if it was an issue to the infrastructure or the service itself. Through our own investigations it was determined that it was most likely an issue with the provider. An unaffected node in our architecture was identified and the microservice moved to that in order to bring the main site back up. Eventually a status update was posted (https://status.heroku.com/incidents/2584) which confirmed our own conclusions. As we had identified and deployed the work-around we thankfully didn’t need to wait for the service to be resumed by the provider.

To mitigate against future issues we will make the main brand site handle the microservice being unavailable more gracefully. We are also considering building small, stripped down versions of the brand sites with callback and quoting capability. These would be deployed on alternative platforms so if there is an interruption to the service on the main brand sites traffic can immediately be temporarily redirected to these instances.