Elevated GraphQL API and Client Errors
Incident Report for Braintree
Postmortem

Impact Description

During the incident window of 12:00 to 13:00 UTC on 6 August 2019, merchants using the GraphQL API received HTTP 502 (Bad Gateway) error responses for all GraphQL API requests. Additionally, customers of merchants using the below client SDKs may have been unable to load Braintree-hosted checkout forms such as the Drop-in UI or Hosted Fields, or tokenize payment methods:

  • Android 2.9.0+
  • iOS 4.14.0+
  • JavaScript 3.28.1+

Root Cause

A certificate for applications hosted in our regional AWS ELBs serving our GraphQL API endpoint expired at 12:00 UTC on 6 August 2019. This caused these applications to return HTTP 502 (Bad Gateway) errors to clients making any request to the GraphQL API and other services fronted by the applications, including tokenization for the aforementioned client SDKs.

While engineers were aware of the expiring certificate and renewed it in March 2019, the renewed certificate was installed only on the Cloudfront CDN. The renewed certificate was not deployed to regional ELBs, which meant that Cloudfront could no longer securely communicate with the regional ELBs. This caused Cloudfront to return HTTP 502 (Bad Gateway) errors.

Corrective Actions & Preventative Measures

  • Engineers restored service at 13:00 after manually deploying the renewed certificate to all regional ELBs.
  • We are working to improve the tracking and renewal process for all application certificates across Braintree infrastructure
  • Alert timing and context have been improved so that engineers are alerted more intelligently and quickly to major outages
  • We are working to establish an emergency process for getting certificates on short notice as a failsafe for future similar issues
Posted 2 months ago. Aug 07, 2019 - 21:36 UTC

Resolved
Traffic returned to normal at 13:00 UTC. A full root cause analysis will be provided.
Posted 2 months ago. Aug 06, 2019 - 13:11 UTC
Update
Engineers are continuing to troubleshoot the issue. We will provide the next update at 13:10 UTC.
Posted 2 months ago. Aug 06, 2019 - 12:56 UTC
Investigating
We're currently investigating an elevated rate of GraphQL and client-side errors and difficulties loading Braintree-hosted checkout forms. An update will be provided as soon as possible.

Symptoms
An elevated rate of GraphQL API and/or client-side errors.

Cause
Engineers are investigating internally.
Posted 2 months ago. Aug 06, 2019 - 12:30 UTC
This incident affected: Production API (GraphQL API).