There was an outage affecting the Braintree Payments platform globally from 2020-09-16 15:57 to 2020-09-16 16:41 UTC, and in certain cases until 17:02 UTC. During the incident window, impacted API calls to authenticated endpoints received HTTP 5xx and/or timeout responses. Users of the Control Panel UI may have experienced “Looks like something is wrong” error messages. Webhooks and outbound emails related to user creation and password resets were also affected. All data integrity was preserved and failed requests can be retried safely.
Preliminary investigation revealed that this outage was due to an mTLS client certificate expiration that was not flagged ahead of time by our certificate management system. Automated operations monitors detected the outage, alerted, and engineering began troubleshooting immediately. New certificates were rolled out at 16:14 UTC, however, this did not completely solve the problem. Engineers continued troubleshooting and discovered that, due to a recent infrastructure change, the new certificate did not roll out to all services properly. After manual intervention, service was restored to the Gateway API and Gateway Control Panel at 16:42 UTC. While monitoring for service restoration the engineers found continued errors from our edge services in AWS. It was determined those errors were due to cached stale data and the service was restarted to clear the cache. That action restored service to remaining client APIs at 17:02 UTC.