On March 3, 2023 between 01:00 and 06:00 UTC users in the US Region Group experienced periods of increased latency and API timeouts. Users in all Region Groups also experienced intermittent failures when logging into the Dashboard during the same periods.
Two events were taking place in the US Region Group when the first impact period started:
At 01:00 nodes in each region in the US Region Group started running out of memory. An operator was paged in immediately and began to follow the runbook for this situation by manually restarting affected nodes, which temporarily brought the Region Group back to a healthy state. Minutes later nodes began running out of memory again, so the operator shut down the newly rebuilt region and used operational tooling to throttle customer traffic. At 1:40 after another manual restart the Region Group was healthy and throttles were gradually removed. Multiple engineers and leaders were engaged by this point; the team believed that the root cause was related to the sudden bursts of extremely read-heavy traffic.
At 02:40 nodes in the Region Group started running out of memory again. The operator followed the same runbook, but the manual restarts didn’t restore the service. The operator then reinstated customer throttling and retried the restarts, which didn’t work. JVM profiles indicated that the source of memory allocation was incoming requests, so additional capacity was added to the region, but that did not improve the situation. Eventually, the newly rebuilt region was removed from the Region Group and the situation resolved as during the first period. Full availability was reached at 04:00. The root cause of both impact periods was now thought to be related to the data transfers in conjunction with a specific type of traffic pattern. Only a subset of the nodes from the rebuilt region were brought back in order to let them finish their data transfer without impacting the availability of the service.
At 05:45, availability dropped again. The operator immediately shut down the remaining nodes from the newly rebuilt region and availability was restored minutes later.
Fauna uses Optimistic Concurrency Control (OCC) in order to avoid the use of locks. Once transactions are committed to the transaction log, data nodes apply committed transactions for the subset of data that they own. Before applying a transaction, nodes can request data from other data nodes as needed to ensure that writes do not contend with other simultaneous transactions. If nodes in the local region are unable to return necessary data, requests are directed to other regions in the Region Group.
In this case, the combination of the region rebuild and extremely read-heavy traffic patterns contributed to a perfect storm where the system could not keep up with the number of reads associated with OCC checks. The majority of reads in the rebuilt region were routed to nodes in other regions, amplifying the problem and making the read-heavy traffic much more impactful for all regions in the Region Group. Once the most impacting traffic was identified and throttled, the rebuilt region was added back to the Region Group and the data sync completed, significantly reducing the risk of a similar issue occurring.
We know that downtime is unacceptable and we are prioritizing the following work in order to improve our operational posture and prevent issues like this one moving forward:
We prioritize the security, availability, and performance of our service above everything else and apologize for any inconvenience that this event caused you. We will aggressively prioritize the work to ensure that we hit our availability targets in the future. If you have further questions/comments about the event or want to speak with us personally to discuss the event in greater detail, please reach out to email@example.com.