Production Networking Issue
Incident Report for Braintree
Postmortem

RCA: Production Networking Issues

Date: Monday, 26 March 2018 12:33 - 13:43

All Times in UTC

Executive Summary
Beginning at 12:33, engineers observed a small decrease in traffic to the Braintree production API endpoints. Upon investigation of our network paths, engineers were able to determine that the issues stemmed from an upstream issue with one of our Internet Service Providers (ISPs). Engineers then re-routed traffic away from the affected ISP, which improved latency and brought traffic back to normal levels as of 13:43.

Follow-Up

  • We are working with the affected ISP to determine what caused the issue and what they are doing to prevent this type of incident in the future.
  • We plan to explore the addition of additional ISPs to improve redundancy and provide additional ingress routes.
  • We are reviewing internal procedures to improve the process for removing ISPs as soon as we detect a drop in traffic.
  • We are actively working on several initiatives to improve latency, uptime, and redundancy by leveraging failover processing paths as well as routing API traffic through alternate services.
Posted 5 months ago. Mar 26, 2018 - 23:36 UTC

Resolved
Traffic is now flowing normally and we have not seen additional latency or timeouts since 13:43 UTC.
Posted 5 months ago. Mar 26, 2018 - 14:21 UTC
Monitoring
We have re-routed traffic away from one of our ISPs. Latency has decreased and engineers are monitoring.
Posted 5 months ago. Mar 26, 2018 - 13:52 UTC
Investigating
We have detected a problem routing inbound requests to the production gateway and Control Panel via one of our Internet Service Providers (ISPs). Merchants may be experiencing timeouts or latency trying to connect to the Braintree Production API or Control Panel. Engineers are investigating.
Posted 5 months ago. Mar 26, 2018 - 13:31 UTC
This incident affected: API.