Post-Incident Review: Service Degradation Platform-Wide February 2025

Incident Overview

Event: Service degradation resulting in extended response times and 5XX errors

Severity: High (Level 1)

Customer Impact: All sites

Duration: 29 hours, 15 minutes (February 18, 17:45 UTC - February 19, 23:00 UTC)

Status: Fully Resolved

Impact Details

Beginning February 18 at 17:45 UTC, our systems experienced progressively degrading performance across a portion of network locations. Beta users reported:

Extended page load times
Connection timeouts
5XX server errors

The disruption affected network traffic intermittently, with performance gradually worsening as the incident progressed. At the peak of the issue, nearly half of the requests coming into our network waited 2-3 seconds for a response. Some requests were failing entirely.

Root Cause Analysis

The service degradation resulted from an unintended interaction between two configuration changes deployed on February 17:

Health Check Configuration: A change intended to increase the interval between active health checks inadvertently modified the timeout parameter instead. This caused health check connections to remain open for up to three minutes waiting for a response rather than closing promptly when requests fail.
Connection Management: Simultaneously, we modified our security controls to limit connection volume from individual sources. This configuration incorrectly began processing connections between our edge servers using the same restrictions that should only be applied to public internet traffic.

These changes created a cascading effect where:

Our L4 firewall began blocking connections between our edge servers
Health checks between edge servers weren’t failing promptly and would begin to consume excessive resources
Resource exhaustion progressively worsened network performance
A few servers crashed

Resolution Process

We:

identified the resource consumption was being caused by excessive health checks that were not failing properly,
identified the configuration changes made on 2/17,
rolled back the health check configuration change and fixed our firewall rules to classify traffic between our edge servers properly,
brought crashed servers back online,
and ensured service was restored and remained consistent

before calling this incident resolved.

Moving Forward

In order to prevent this from happening again, we:

have implemented additional alerting and notification connections to have earlier warnings of abnormal resource usage,
expanded our development and testing infrastructure and extended testing periods,
and updated our logging dashboards and view filters.

Support

If you have any ongoing issues please create a ticket at https://support.skip2.net/