Elevated API and GraphQL API errors
Incident Report for Braintree
Postmortem

Impact Description

During the incident window, Braintree merchants experienced elevated GraphQL API and client-side errors. Customers may have also been unable to complete Braintree-hosted checkout flows. Some merchants would have also experienced increased latency and timeouts for the requests to the Braintree Gateway API.

Root Cause

At 9:35 UTC, our engineers were alerted to a loss in Braintree API traffic. As we investigated further, it was determined that there was a large increase in incoming requests to Braintree’s network, which were passed to our load balancers. As a result of the increased volume, a portion of our networking devices experienced service degradation resulting in latency, increased error rates and timeouts for some merchants.

Corrective Actions & Preventative Measures

  • As the volume of requests subsided, all services were restored and Braintree traffic returned to normal.
  • We have reviewed our resiliency and mitigation measures to ensure adequate protection from sudden spikes in network traffic volume.
  • We have begun work to improve monitoring around our routers and load balancers metrics.
  • We will implement more granular monitoring and alert messaging to aid investigation and decision-making.
  • We will improve our runbooks to bolster our mitigation steps and quickly isolate the root cause.
Posted Apr 21, 2020 - 22:50 UTC

Resolved
GraphQL API and client-side error rates have returned to normal.
Posted Apr 19, 2020 - 10:28 UTC
Identified
GraphQL API and client-side error rates are improving but still elevated. Engineers have identified solutions and will have them deployed shortly.
Posted Apr 19, 2020 - 10:19 UTC
Investigating
We're currently investigating an elevated rate of GraphQL and client-side errors. An update will be provided as soon as possible.

Symptoms
An elevated rate of GraphQL API and/or client-side errors.

Cause
Engineers are investigating internally.
Posted Apr 19, 2020 - 09:54 UTC
This incident affected: Production API (Gateway API, GraphQL API).