Exchange Service Outage

Jun 5, 2023 14:00

outage | aws | elastic-beanstalk | exchange |

The exchange-services suffered an outage this morning that resulted in a downtime of a few hours. This update will provide details of the reasons for the incident, and it’s resolution.

The exchange service recently has been updated to run on Amazon Linux 2 in the AWS Elastic Beanstalk (EB) container service. This upgrade was rolled out a couple of weeks ago and has been running successfully since the upgrade. As part of the upgrade the opportunity was taken to add a new load balancer for all the new and updated EB services. This will improve resiliency and reduces costs as well as reducing the overall complexity of managing single-instance load balances for the microservices.

At approximately 02:45 GMT the AWS container management service performed an update on the exchange-service containers (production and test as they share a maintenance window.) As part of this update it caused the instances to be deregistered from the load balancers target groups. This caused the load balancer to have no available targets for the exchange service rules and as such requests were met with a 503 Service Unavailable response. The outage lasted until approximately 09:00 GMT when the issue was resolved.

As soon as this was figured out this morning (it was not obvious) the applications were re-registered with the appropriate target groups and normal operating service was resumed. The impact of this outage was:

Online quotes, although sent and processed correctly in TGSL were not sent to exchange.
Responses to Customer/Quotation surveys would have not been sent.
Responses to contact forms would not have been sent.

Of these issues we were able to send all the data to exchange almost immediately for the online quotes as this is retained within the quote service to be resent in case of any such issues. As the data is sent directly to the exchange microservice for the other form based responses these are unable to be resent. This would affect any such responses from customers between 02:45 and 09:00 GMT.

To mitigate against potential issues in the future, the following have/will be carried out immediately:

The autoscaling configuration of the target groups is to be amended in order to automatically register updated target application containers.
Test maintenance windows on AWS will be configured to hit a few days before the production ones.

As part of the forward architecture plan a possible middle message processing layer could be added in order to process and retry failed sends to the exchange microservice. This message layer could be a fully manged AWS service or serverless function that would be extremely resilient to any sort of outage. This would increase the complexity dramatically however, as well as adding as yet undefined services costs from AWS. Pros/cons will be weighed up in any long-term architecture roadmap.