Between January 9th and January 13th UTC, users in our US, EU, and Global Region Groups experienced multiple periods of increased latency and reduced availability. The issue was caused by a series of DDoS attacks that were targeted at a set of our customers, resulting in extreme waves of traffic to Fauna, each of which were many orders of magnitude above our baseline traffic patterns.
For additional context, Fauna’s data plane is broken into four logical layers: routing, compute, log, and storage. The routing layer lives outside Fauna Region Groups, handling requests as they ingress into the Fauna network, and uses an intelligent routing algorithm to optimally route requests to the closest region of the relevant Region Group. More information about Fauna’s architecture is available in our technical whitepaper.
On January 9th at 5:30 AM UTC, a large increase in customer traffic caused higher latency and unavailability across Region Groups. Multiple on-call engineers were engaged by 5:34 AM and a message was posted to our status page at 5:39 AM. After initial investigation, engineers realized that despite the fact that the requests were being throttled and sent a 429 response at the compute layer due to exceeding throughput limits, routing nodes were still experiencing resource starvation due to the number of connections being held and resources associated with request handling being kept in memory. Engineers identified and blocked offending accounts, causing the compute layer to short circuit the normal request handling process and return 410 responses for requests from those accounts, which shortened the time that each connection was held and mitigated the impact of the issue.
Note that Fauna has a number of layers of protection against DDoS attacks including IP-based blocking mechanisms at the routing layer, throughput limits at the compute and storage layers, and other systems that cannot be mentioned for security reasons. In this specific case, the volume, shape, and the distributed nature of the requests caused them to bypass some of these mechanisms, resulting in degradation of the service despite the fact that throttling based on throughput limits worked as expected.
Subsequent large waves of traffic started on January 9th, 10th, 11th, and 13th UTC – for specific impact periods, please refer to our status page. Most events lasted for around an hour, and in each case requests came from hundreds of IP addresses. Later waves included more than an order of magnitude more requests than earlier ones.
Engineers were engaged around the clock to mitigate the impact of each wave, taking a number of mitigation steps that included:
At the same time, engineers were working with impacted customers to help them modify their architecture to minimize the impact of attacks on Fauna and other downstream dependencies.
These changes have collectively improved our security and availability posture, but there is more work to do. Over the next few weeks and months, we will aggressively prioritize additional improvements including:
We prioritize the security, availability, and performance of our service above everything else and apologize for any inconvenience that this event caused you. If you have further questions or comments about the event or want to speak with us personally to discuss the event in greater detail, please reach out to our support team.