Increased Latency and Reduced Availability in the US, EU, and Global Region Groups

Incident Report for Fauna

Postmortem

Between January 9th and January 13th UTC, users in our US, EU, and Global Region Groups experienced multiple periods of increased latency and reduced availability. The issue was caused by a series of DDoS attacks that were targeted at a set of our customers, resulting in extreme waves of traffic to Fauna, each of which were many orders of magnitude above our baseline traffic patterns.

For additional context, Fauna’s data plane is broken into four logical layers: routing, compute, log, and storage. The routing layer lives outside Fauna Region Groups, handling requests as they ingress into the Fauna network, and uses an intelligent routing algorithm to optimally route requests to the closest region of the relevant Region Group. More information about Fauna’s architecture is available in our technical whitepaper.

On January 9th at 5:30 AM UTC, a large increase in customer traffic caused higher latency and unavailability across Region Groups. Multiple on-call engineers were engaged by 5:34 AM and a message was posted to our status page at 5:39 AM. After initial investigation, engineers realized that despite the fact that the requests were being throttled and sent a 429 response at the compute layer due to exceeding throughput limits, routing nodes were still experiencing resource starvation due to the number of connections being held and resources associated with request handling being kept in memory. Engineers identified and blocked offending accounts, causing the compute layer to short circuit the normal request handling process and return 410 responses for requests from those accounts, which shortened the time that each connection was held and mitigated the impact of the issue.

Note that Fauna has a number of layers of protection against DDoS attacks including IP-based blocking mechanisms at the routing layer, throughput limits at the compute and storage layers, and other systems that cannot be mentioned for security reasons. In this specific case, the volume, shape, and the distributed nature of the requests caused them to bypass some of these mechanisms, resulting in degradation of the service despite the fact that throttling based on throughput limits worked as expected.

Subsequent large waves of traffic started on January 9th, 10th, 11th, and 13th UTC – for specific impact periods, please refer to our status page. Most events lasted for around an hour, and in each case requests came from hundreds of IP addresses. Later waves included more than an order of magnitude more requests than earlier ones.

Engineers were engaged around the clock to mitigate the impact of each wave, taking a number of mitigation steps that included:

Scaling the routing fleet, which was challenging because the request volume was constantly in flux.
Reducing the number of connections allowed to each routing node from each IP address.
Profiling the router under production load to find optimizations to the routing pipeline.
Adding a new mechanism to block disabled accounts closer to the edge at the routing layer instead of the compute layer.
Adding a new mechanism to enforce throttling for accounts that exceeded throughput limits closer to the edge at the routing layer instead of the compute and storage layers.

At the same time, engineers were working with impacted customers to help them modify their architecture to minimize the impact of attacks on Fauna and other downstream dependencies.

These changes have collectively improved our security and availability posture, but there is more work to do. Over the next few weeks and months, we will aggressively prioritize additional improvements including:

Modifying the request routing infrastructure in front of the routing layer to provide additional tools to block specific types of traffic closer to the client.
Implementing reactive autoscaling at the routing layer to scale up more quickly under heavy load.
Enhancing our test environments in ways that will allow us to run load tests under very specific types of simulated customer load quickly.
Optimizing our routing node configuration (eg. rightsizing the total number of connections allowed to each routing node).
Creating a faster configuration deployment pipeline for the routing layer to speed up deployment of changes during an event.

We prioritize the security, availability, and performance of our service above everything else and apologize for any inconvenience that this event caused you. If you have further questions or comments about the event or want to speak with us personally to discuss the event in greater detail, please reach out to our support team.

Posted Jan 19, 2024 - 16:19 PST

Resolved

On January 9th between 5:30 AM and 6:30 AM UTC we experienced increased latency and reduced availability in the US, EU, and Global Region Groups. The issue has been resolved and the service is operating normally.

Posted Jan 08, 2024 - 23:21 PST

Update

We are continuing to investigate this issue.

Posted Jan 08, 2024 - 21:58 PST

Update

We are continuing to investigate an issue that is causing increased latency and reduced availability in the US, EU, and Global Region Groups.

Posted Jan 08, 2024 - 21:56 PST

Investigating

We are investigating an issue that is causing reduced availability in the US Region Group.

Posted Jan 08, 2024 - 21:49 PST

This incident affected: Global Region Group (FQL API), US Region Group (FQL API), and EU Region Group (FQL API).