Elevated Forward API Errors
Incident Report for Braintree
Postmortem

Impact Description 

During the incident window, CenturyLink/Level(3), a major ISP and Internet transit provider, experienced an outage that impacted several Braintree services in addition to a significant number of other services and providers across the Internet.  

Braintree utilizes several ISPs, so most services were not impacted by this issue. However, merchants utilizing the Forward API may have experienced an increase in HTTP 5xx errors between 10:02 and 12:58 UTC. Additionally, merchants using 3D Secure may have seen an increase in “authentication_unavailable” 3DS results between 10:02 and 14:23 UTC.

Root Cause 

Braintree services utilize multiple ISPs for inbound and outbound traffic across physical data centers and CenturyLink/Level(3) is one of them. When CenturyLink began having issues at approximately 10:02 UTC, that caused intermittent packet loss for any requests utilizing the Forward API, 3D Secure, or Fraud Protection. Engineers were engaged but it was not immediately clear where in the networking path the packet loss was occurring. As soon as CenturyLink was identified as a possible root cause, engineers began moving Forward API traffic to alternate ISP paths. Fraud Protection and 3D Secure rely on external providers who were themselves affected. We worked with those providers and tried multiple configurations under our control to restore connectivity to them without affect, ultimately relying on those providers to mitigate the issues.  

Corrective Actions & Preventative Measures 

  • CenturyLink/Level(3) claimed responsibility at 13:33 UTC via Twitter 
  • We are reviewing playbooks and mitigation strategies to ensure ISP issues can be addressed quickly and reliably
Posted Sep 01, 2020 - 22:30 UTC

Resolved
This incident has been resolved. The Forward API was degraded between approximately 10:00 and 12:58 UTC due to an upstream internet service provider outage.
Posted Aug 30, 2020 - 13:14 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 30, 2020 - 13:01 UTC
Investigating
We're currently investigating an elevated rate of errors with the Forward API. An update will be provided as soon as possible.

Symptoms
An elevated rate of 500-level responses for Forward API requests.

Cause
Engineers are investigating an upstream networking issue.
Posted Aug 30, 2020 - 10:15 UTC
This incident affected: Production API (Forward API).