Production Networking Issue
Incident Report for Braintree
Postmortem

Impact Description

During the incident window between 17:56 and 18:18 UTC on 12 December 2019, an elevated rate of merchant traffic received connection timeout errors when attempting to call the Gateway API. This includes approximately 100% of Pay with Venmo traffic. Merchants hosted within US AWS regions may have been unable to reach the Gateway API.

Root Cause

At 17:56 UTC, we encountered a hardware failure for one of our load balancers servicing one of our data centers. This load balancer became unable to service traffic, which is typically quickly resolved by automatic failover mechanisms. In this case, the automatic failover mechanisms did not trigger and engineers had to manually intervene to mitigate the issue at 18:18 UTC.

Corrective Actions & Preventative Measures

  • The impacted load balancer will remain out of service until engineers are able to ensure a successful return to service.
  • Automated failover mechanisms are being checked and improved to ensure full functionality during future hardware failures.
Posted Dec 13, 2019 - 18:40 UTC

Resolved
We have seen no further issues and this incident is now resolved. A complete root cause analysis will be completed and published here in the coming days.

Failed transactions can be safely retried.
Posted Dec 12, 2019 - 18:30 UTC
Monitoring
Engineers have put a fix in place and traffic has shown signs of recovery since 18:18 UTC.
Posted Dec 12, 2019 - 18:25 UTC
Investigating
We're investigating a networking issue that may be impacting the availability of the API or Control Panel for some merchants.

Symptoms
Timeouts or increased latency when making API calls.

Cause
Engineers are investigating network-related causes.
Posted Dec 12, 2019 - 18:16 UTC
This incident affected: Production API (Gateway API) and Control Panel.