Production Errors
Incident Report for Braintree
Postmortem

Root Cause Analysis

Date: Wednesday, 22 August 2018 22:06 - 22:22

All Times in UTC

Executive Summary

At 22:06, a networking event inside one of our datacenters triggered an automated database failover. Our automated database failover did not complete, leaving specific database clusters in a nonfunctional state. Engineers intervened and completed the failover at 22:22. The impacted databases handle traffic for American Express, Pay with PayPal, and some other transaction processing for a select group of merchants. As a result, impacted merchants may have observed the following between 22:06 and 22:22:

  • Elevated HTTP 500 errors
  • Failed PayPal and American Express transactions

    • Some failed PayPal transactions may have charged the customer. These are currently being voided/refunded to match the failed state in the gateway.
  • Authorized transactions where the request to Submit for Settlement failed

    • 14 of these requests succeeded. The current/correct status is reflected in the Control Panel and queryable via API.
    • Some of these requests for PayPal transactions failed, resulting in the original authorization being marked as failed.
  • Authorized transactions that do not appear in the gateway

    • There are 9 of these, and they will naturally expire from customer statements

Technical Summary

A planned maintenance operation on a failed core network switch resulted in an unexpected loss of connectivity to one datacenter rack that contained primary databases for some of Braintree's production database clusters. This loss of connectivity triggered an automated failover processes for those clusters, however management equipment in the same rack was  unreachable, which left the failover in an incomplete state. While new primary members were promoted for each cluster, database clients could not connect to the affected clusters since subsequent steps in the automation were not completed. On-call engineers intervened to complete the failover process, restoring service. The affected databases served American Express, Pay with PayPal, and one member of our transaction database pool (which impacted the group of merchants assigned to that cluster).

Follow-Up

  • Engineers are investigating how to make our automated failover more robust to the specific conditions under which the failure occurred.
  • Engineers have already rectified minor data visibility issues with API search and reporting, and any impacted transactions are now up to date or in the process of being voided/refunded to match the failed state relayed to merchants.
Posted 3 months ago. Aug 24, 2018 - 21:19 UTC

Resolved
Error rates have returned to normal and PayPal and American Express transactions are processing successfully as of 22:24 UTC. Impacted requests received an HTTP 500 response, and impacted transactions were recorded as Processor Network Unavailable - Try Again (3000) between 22:05 and 22:24 UTC.
Posted 3 months ago. Aug 22, 2018 - 22:33 UTC
Update
In addition to some API requests, Pay with PayPal and American Express transactions are failing as a result of this issue. Engineers continue to work towards resolution.
Posted 3 months ago. Aug 22, 2018 - 22:22 UTC
Investigating
Some merchants are experiencing errors when attempting to call the Braintree API or access the Control Panel. We're aware of the issue and working to resolve it as quickly as possible.
Posted 3 months ago. Aug 22, 2018 - 22:13 UTC
This incident affected: API, Control Panel, United States Processing, Canadian Processing, European Processing, PayPal Processing, and APAC Processing.