- Published on
Post-Incident Review: Service Degradation Platform-Wide February 2025
- Authors
- Name
- Alex Lee
- @alexjoelee
Incident Overview
Event: Service degradation resulting in extended response times and 5XX errors
Severity: High (Level 1)
Customer Impact: All sites
Duration: 29 hours, 15 minutes (February 18, 17:45 UTC - February 19, 23:00 UTC)
Status: Fully Resolved
Impact Details
Beginning February 18 at 17:45 UTC, our systems experienced progressively degrading performance across a portion of network locations. Beta users reported:
- Extended page load times
- Connection timeouts
- 5XX server errors
The disruption affected network traffic intermittently, with performance gradually worsening as the incident progressed. At the peak of the issue, nearly half of the requests coming into our network waited 2-3 seconds for a response. Some requests were failing entirely.
Root Cause Analysis
The service degradation resulted from an unintended interaction between two configuration changes deployed on February 17:
- Health Check Configuration: A change intended to increase the interval between active health checks inadvertently modified the timeout parameter instead. This caused health check connections to remain open for up to three minutes waiting for a response rather than closing promptly when requests fail.
- Connection Management: Simultaneously, we modified our security controls to limit connection volume from individual sources. This configuration incorrectly began processing connections between our edge servers using the same restrictions that should only be applied to public internet traffic.
These changes created a cascading effect where:
- Our L4 firewall began blocking connections between our edge servers
- Health checks between edge servers weren’t failing promptly and would begin to consume excessive resources
- Resource exhaustion progressively worsened network performance
- A few servers crashed
Resolution Process
We:
- identified the resource consumption was being caused by excessive health checks that were not failing properly,
- identified the configuration changes made on 2/17,
- rolled back the health check configuration change and fixed our firewall rules to classify traffic between our edge servers properly,
- brought crashed servers back online,
- and ensured service was restored and remained consistent
before calling this incident resolved.
Moving Forward
In order to prevent this from happening again, we:
- have implemented additional alerting and notification connections to have earlier warnings of abnormal resource usage,
- expanded our development and testing infrastructure and extended testing periods,
- and updated our logging dashboards and view filters.
Support
If you have any ongoing issues please create a ticket at https://support.skip2.net/