RCA: Production Networking Issues
Date: Monday, 26 March 2018 12:33 - 13:43
All Times in UTC
Beginning at 12:33, engineers observed a small decrease in traffic to the Braintree production API endpoints. Upon investigation of our network paths, engineers were able to determine that the issues stemmed from an upstream issue with one of our Internet Service Providers (ISPs). Engineers then re-routed traffic away from the affected ISP, which improved latency and brought traffic back to normal levels as of 13:43.
- We are working with the affected ISP to determine what caused the issue and what they are doing to prevent this type of incident in the future.
- We plan to explore the addition of additional ISPs to improve redundancy and provide additional ingress routes.
- We are reviewing internal procedures to improve the process for removing ISPs as soon as we detect a drop in traffic.
- We are actively working on several initiatives to improve latency, uptime, and redundancy by leveraging failover processing paths as well as routing API traffic through alternate services.