Production Networking Issue
Incident Report for Braintree
Postmortem

Impact Description

Between 8:44 UTC and 9:46 UTC on 13 June 2019, impacted merchants encountered timeouts when attempting to reach Braintree’s API endpoint during the incident window. Some Control Panel users may have also had difficulty using features such as downloads.

Root Cause

Braintree routes traffic to the Gateway API via multiple internet service providers (ISPs) across multiple datacenters. One ISP servicing one of our datacenters had an upstream issue preventing inbound traffic from reaching our servers. While engineers were quickly alerted to the connectivity errors, it took some time to properly determine which ISP was having issues. Once engineers were able to confirm the impacted ISP, they removed both the ISP and the datacenter from serving public traffic and traffic levels and error rates recovered.

Corrective Actions & Preventative Measures

  • We are auditing and refining our automation around removing ISPs and datacenters from serving traffic (de-peering)

  • We are introducing higher-precision ISP-level monitoring into primary system dashboards

  • We are improving application resiliency during prolonged outbound connectivity issues

Posted 4 months ago. Jun 14, 2019 - 15:45 UTC

Resolved
This incident has been resolved. The full incident impact window was 8:44 to 9:46 UTC.
Posted 4 months ago. Jun 13, 2019 - 11:12 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted 4 months ago. Jun 13, 2019 - 10:59 UTC
Update
We are continuing to work on a fix for this issue
Posted 4 months ago. Jun 13, 2019 - 10:44 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted 4 months ago. Jun 13, 2019 - 10:15 UTC
Update
We are continuing to investigate this issue
Posted 4 months ago. Jun 13, 2019 - 09:45 UTC
Update
We are continuing to investigate this issue.
Posted 4 months ago. Jun 13, 2019 - 09:33 UTC
Investigating
We're investigating a networking issue that may be impacting the availability of the API or Control Panel for some merchants.

Symptoms
Timeouts or increased latency when making API calls.

Cause
An upstream networking issue with a third-party Internet Service Provider (ISP).
Posted 4 months ago. Jun 13, 2019 - 09:16 UTC
This incident affected: Transaction Processing (United States Processing, Canadian Processing, European Processing, APAC Processing), Control Panel, and Production API (Gateway API).