Elevated API Errors
Incident Report for Braintree

RFO: Database Cluster Failure

Date: Friday 6 October 16:56 - 17:04

All Times in UTC

Executive Summary

On October 6th 2017 at 16:56 one of our database clusters unexpectedly failed. The specific way this node failed prevented our automation from correctly removing it and promoting the backup cluster. Engineers were paged and manually intervened, promoting the backup database cluster to primary at 17:04, at which time normal service was restored. Merchants during this period would have seen HTTP 504 responses to API and Control Panel requests.

Technical Summary

On October 6th 2017 at 16:56 one of our Postgres clusters experienced a hardware failure. Due to the partial nature of this failure, connections to Postgres were held open during this period. This resulted in requests timing out after waiting instead of failing more quickly for merchants located on this shard. This timeout caused a cascading effect where most requests saw higher request time or timeouts.

During this period, automation correctly detected an issue and attempted to fence the node. Due to a network misconfiguration, the fencing attempt was unsuccessful. Had this automation worked, merchants located on the impacted node would have seen a brief increase in latency and other merchants would have seen minimal impact.

Engineers were paged at 16:56 and correctly identified the root cause of the failures. Work began to manually promote the backup cluster to be the primary. This work was completed at 17:04 and API and Control Panel requests began functioning normally shortly thereafter.

Follow-Up

  • A correction to the network misconfiguration which prevented automation from successfully fencing the node is being worked on

  • Adding additional monitoring to more specifically confirm the ability to fence all nodes

  • Work to ensure that requests directed at dependencies that have failed in this manner do not disrupt other unrelated requests

Posted 17 days ago. Oct 06, 2017 - 20:51 UTC

Resolved
API response time and error rate has returned to normal.
Posted 17 days ago. Oct 06, 2017 - 17:10 UTC
Monitoring
Braintree service is recovering. Engineers are monitoring.
Posted 17 days ago. Oct 06, 2017 - 17:05 UTC
Investigating
Some merchants are experiencing an elevated rate of errors when attempting to call the Braintree API. Engineers are working urgently to resolve the issue.

Scope: This issue potentially affects any merchant attempting to make requests using the Braintree API. The Control Panel is not affected.
Posted 17 days ago. Oct 06, 2017 - 17:00 UTC
This incident affected: API, United States Processing, Canadian Processing, European Processing, PayPal Processing, and APAC Processing.