Date: Friday 6 October 16:56 - 17:04
All Times in UTC
On October 6th 2017 at 16:56 one of our database clusters unexpectedly failed. The specific way this node failed prevented our automation from correctly removing it and promoting the backup cluster. Engineers were paged and manually intervened, promoting the backup database cluster to primary at 17:04, at which time normal service was restored. Merchants during this period would have seen HTTP 504 responses to API and Control Panel requests.
On October 6th 2017 at 16:56 one of our Postgres clusters experienced a hardware failure. Due to the partial nature of this failure, connections to Postgres were held open during this period. This resulted in requests timing out after waiting instead of failing more quickly for merchants located on this shard. This timeout caused a cascading effect where most requests saw higher request time or timeouts.
During this period, automation correctly detected an issue and attempted to fence the node. Due to a network misconfiguration, the fencing attempt was unsuccessful. Had this automation worked, merchants located on the impacted node would have seen a brief increase in latency and other merchants would have seen minimal impact.
Engineers were paged at 16:56 and correctly identified the root cause of the failures. Work began to manually promote the backup cluster to be the primary. This work was completed at 17:04 and API and Control Panel requests began functioning normally shortly thereafter.
A correction to the network misconfiguration which prevented automation from successfully fencing the node is being worked on
Adding additional monitoring to more specifically confirm the ability to fence all nodes
Work to ensure that requests directed at dependencies that have failed in this manner do not disrupt other unrelated requests