Timeouts in the US Region Group

Incident Report for Fauna

Postmortem

On March 3, 2023 between 01:00 and 06:00 UTC users in the US Region Group experienced periods of increased latency and API timeouts. Users in all Region Groups also experienced intermittent failures when logging into the Dashboard during the same periods.

Two events were taking place in the US Region Group when the first impact period started:

As part of regular maintenance, one of the regions in the US Region Group was being rebuilt with updated hardware and a full data sync between regions had started on March 2 at 22:00.
The service had been experiencing bursts of extremely read-heavy traffic with an unusual shape that was causing slightly elevated latency across the US Region Group.

At 01:00 nodes in each region in the US Region Group started running out of memory. An operator was paged in immediately and began to follow the runbook for this situation by manually restarting affected nodes, which temporarily brought the Region Group back to a healthy state. Minutes later nodes began running out of memory again, so the operator shut down the newly rebuilt region and used operational tooling to throttle customer traffic. At 1:40 after another manual restart the Region Group was healthy and throttles were gradually removed. Multiple engineers and leaders were engaged by this point; the team believed that the root cause was related to the sudden bursts of extremely read-heavy traffic.

At 02:40 nodes in the Region Group started running out of memory again. The operator followed the same runbook, but the manual restarts didn’t restore the service. The operator then reinstated customer throttling and retried the restarts, which didn’t work. JVM profiles indicated that the source of memory allocation was incoming requests, so additional capacity was added to the region, but that did not improve the situation. Eventually, the newly rebuilt region was removed from the Region Group and the situation resolved as during the first period. Full availability was reached at 04:00. The root cause of both impact periods was now thought to be related to the data transfers in conjunction with a specific type of traffic pattern. Only a subset of the nodes from the rebuilt region were brought back in order to let them finish their data transfer without impacting the availability of the service.

At 05:45, availability dropped again. The operator immediately shut down the remaining nodes from the newly rebuilt region and availability was restored minutes later.

Fauna uses Optimistic Concurrency Control (OCC) in order to avoid the use of locks. Once transactions are committed to the transaction log, data nodes apply committed transactions for the subset of data that they own. Before applying a transaction, nodes can request data from other data nodes as needed to ensure that writes do not contend with other simultaneous transactions. If nodes in the local region are unable to return necessary data, requests are directed to other regions in the Region Group.

In this case, the combination of the region rebuild and extremely read-heavy traffic patterns contributed to a perfect storm where the system could not keep up with the number of reads associated with OCC checks. The majority of reads in the rebuilt region were routed to nodes in other regions, amplifying the problem and making the read-heavy traffic much more impactful for all regions in the Region Group. Once the most impacting traffic was identified and throttled, the rebuilt region was added back to the Region Group and the data sync completed, significantly reducing the risk of a similar issue occurring.

We know that downtime is unacceptable and we are prioritizing the following work in order to improve our operational posture and prevent issues like this one moving forward:

Augmenting existing service protection mechanisms with more aggressive limits based on read, write, and compute operations performed.
Adding safeguards in front of the transaction log that will prevent too many read-heavy transactions from being scheduled at the same time.
Optimizing how data replicas perform OCC so that read-heavy transactions cannot exhaust available memory.
Augmenting our observability data with new metrics on requests that trigger a 5xx response to allow operators to understand the cause of complex issues like this one more rapidly.

We prioritize the security, availability, and performance of our service above everything else and apologize for any inconvenience that this event caused you. We will aggressively prioritize the work to ensure that we hit our availability targets in the future. If you have further questions/comments about the event or want to speak with us personally to discuss the event in greater detail, please reach out to support@fauna.com.

Posted Mar 17, 2023 - 09:05 PDT

Resolved

On March 3, from 2:36 AM and 4:03 AM UTC we experienced increased API timeouts in the US Region Group. The issue has been resolved and the service is operating normally.

Posted Mar 02, 2023 - 20:32 PST

Investigating

We are investigating API timeouts in the US Region Group.

Posted Mar 02, 2023 - 18:49 PST

This incident affected: US Region Group (FQL API).