Production Errors
Incident Report for Braintree
Postmortem

Impact Description

Between 20:16 and 20:51 UTC on 14 May 2019, impacted merchants received an elevated rate of HTTP 500-level responses when attempting to issue a variety of API calls related to the Vault (i.e. customer/payment method create, update, search, or find), recurring billing, or transaction find/search functions. Some Control Panel users may have also had difficulty using some functions during the incident window. Transaction create calls were largely unimpacted due to transaction resiliency processes.

Root Cause

The root cause was determined to be related to an operation that incorrectly assigned a conflicting IP address to a newly-provisioned database server that was not yet ready to accept production traffic.

Engineers were in the process of bootstrapping new production database servers and were in the process of assigning unused IP addresses to the servers. Once they found what was thought to be an unused IP address, they proceeded to run several tests to ensure the IP address was available for use. Despite these tests, an in-use IP address was assigned to one of the newly-provisioned servers. This IP address was already assigned to the primary node in a production transaction database shard. This IP conflict incorrectly routed production traffic to the newly-provisioned server, which was not yet ready to accept traffic and resulted in HTTP 5xx errors beginning at 20:16 UTC.

Due to some recently-implemented transaction resiliency measures, transaction create calls were balanced on other transaction shards and processed successfully. Other API and Control Panel traffic errored when the database queries were incorrectly made to the newly-provisioned database server.

Corrective Actions & Preventative Measures

  • Once engineers were able to determine that a conflicting IP was the cause of the errors, the network interface on the newly-provisioned server was forced down. This redirected the traffic back to the intended database servers.

  • Improved database connection monitoring will be introduced.

  • Additional database resiliency will be added to rescue other critical API traffic such as customer/payment creates

  • Networking capacity available to database servers will be expanded to eliminate the need to reuse IP addresses

Posted 2 months ago. May 16, 2019 - 21:02 UTC

Resolved
This incident is now resolved. The full incident window was 20:16 to 20:51 UTC.
Posted 2 months ago. May 14, 2019 - 20:58 UTC
Investigating
We're currently investigating an elevated rate of errors with the API and Control Panel. An update will be provided as soon as possible.

Symptoms
An elevated rate of 500-level responses for API requests and/or error pages when using the Control Panel.

Cause
Engineers are investigating internally.
Posted 2 months ago. May 14, 2019 - 20:32 UTC
This incident affected: Production API (Gateway API) and Control Panel.