Production Errors
Incident Report for Braintree
Postmortem

Impact Description 

There was an outage affecting the Braintree Payments platform globally from 2020-09-16 15:57 to 2020-09-16 16:41 UTC, and in certain cases until 17:02 UTC. During the incident window, impacted API calls to authenticated endpoints received HTTP 5xx and/or timeout responses. Users of the Control Panel UI may have experienced “Looks like something is wrong” error messages. Webhooks and outbound emails related to user creation and password resets were also affected. All data integrity was preserved and failed requests can be retried safely. 

Root Cause 

Preliminary investigation revealed that this outage was due to an mTLS client certificate expiration that was not flagged ahead of time by our certificate management system.  Automated operations monitors detected the outage, alerted, and engineering began troubleshooting immediately. New certificates were rolled out at 16:14 UTC, however, this did not completely solve the problem. Engineers continued troubleshooting and discovered that, due to a recent infrastructure change, the new certificate did not roll out to all services properly. After manual intervention, service was restored to the Gateway API and Gateway Control Panel at 16:42 UTC. While monitoring for service restoration the engineers found continued errors from our edge services in AWS. It was determined those errors were due to cached stale data and the service was restarted to clear the cache. That action restored service to remaining client APIs at 17:02 UTC.  

Corrective Actions & Preventative Measures 

  • A new certificate was issued and deployed to application servers and edge servers were restarted to clear cached stale data. 
  • A comprehensive review of the monitoring and alerting around TLS certificates is under review and identified gaps will be closed. 
  • Code changes to unify the certificate deploy process given the new infrastructure changes are underway 
  • Certificate expiry reports will be manually generated until automation is in place.
Posted Sep 16, 2020 - 21:53 UTC

Resolved
This incident has been resolved and all services are operating normally.

Incident Description
There was an outage affecting the Braintree Payments platform that lasted from 2020-09-16 15:57 to 2020-09-16 16:41 UTC, and in certain cases until 17:02 UTC. Automated alarms detected the outage and engineers began troubleshooting immediately. Engineers identified the issue and redeployed applications to restore service. Service was restored to the Gateway API at 16:41 UTC. An additional operation was needed to clear stale cached values and fully restore client-side flows, which completed at 17:02 UTC. A full Root Cause Analysis document will be provided later today.
Posted Sep 16, 2020 - 17:26 UTC
Update
We have seen recovery for all traffic as of 17:02 UTC. Engineers are monitoring to ensure full recovery.
Posted Sep 16, 2020 - 17:12 UTC
Update
The Gateway API is fully recovered as of 16:42 UTC. Errors have remained high for some client-side flows. Engineers are working to resolve as quickly as possible.
Posted Sep 16, 2020 - 17:00 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 16, 2020 - 16:39 UTC
Update
We are continuing to work on a fix for this issue.
Posted Sep 16, 2020 - 16:28 UTC
Update
We are continuing to work on a fix for this issue.
Posted Sep 16, 2020 - 16:22 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Sep 16, 2020 - 16:21 UTC
Investigating
We're currently investigating an elevated rate of errors with the API and Control Panel. An update will be provided as soon as possible.

Symptoms
An elevated rate of 500-level responses for API requests and/or error pages when using the Control Panel.

Cause
Engineers are investigating internally.
Posted Sep 16, 2020 - 16:03 UTC
This incident affected: Production API (Forward API, Gateway API, GraphQL API) and Control Panel.