Production Errors
Incident Report for Braintree
Postmortem

Impact Description

During the incident window of 10:18 to 12:00 UTC on 13 May 2020, impacted merchants encountered timeouts or increased latency when attempting to reach the Braintree Gateway API endpoint. Some merchants might have also seen an elevated rate of HTTP 500-level responses for API requests. Control Panel users may have also received error messages or had difficulty downloading reports.

Root Cause

Braintree routes traffic to the Gateway API through multiple internet service providers (ISPs). During the incident, an ISP servicing one of our data centers had an upstream problem that resulted in packet loss for incoming requests. Simultaneously, our engineers were alerted to asymmetric routing issues which also caused timeouts. As there were two ongoing issues, it was not immediately clear that the ISP was the sole cause of the symptoms we observed. It took some time to properly determine the root cause and subsequently mitigate the impact. At 11:58 UTC, Braintree engineers rerouted traffic around the problematic ISP, restoring service for impacted traffic.

Corrective Actions & Preventative Measures

  • We have reviewed our monitoring and mitigation measures for prolonged outbound connectivity issues and identified areas for improvements.
  • We are working to improve our application resiliency through better decoupling DNS measures and improved outbound Network Address Translation (NAT), amongst other steps.
  • We have begun work to improve monitoring around our routers and load balancer metrics.
  • We are working on more granular and higher-precision system health monitoring to showcase related dependencies across the platform to aid debugging and decision making.
Posted May 15, 2020 - 21:46 UTC

Resolved
Braintree API error rates and traffic are now back to normal; merchants should no longer encounter timeouts or errors at this time. The incident window was from approximately 10:18 to 12:00 UTC on 13 May 2020. A postmortem will be published in the coming days.
Posted May 13, 2020 - 12:52 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 13, 2020 - 12:05 UTC
Update
We are still investigating this issue. Some merchants might also experience elevated timeouts or increased latency when making API calls.
Posted May 13, 2020 - 11:39 UTC
Investigating
We're currently investigating an elevated rate of errors with the API and Control Panel. An update will be provided as soon as possible.

Symptoms
An elevated rate of 500-level responses for API requests and/or error pages when using the Control Panel.

Cause
Engineers are investigating internally.
Posted May 13, 2020 - 10:47 UTC
This incident affected: Production API (Forward API, Gateway API) and Control Panel.