On Nov 17 at approximately 11:00 UTC, a customer contacted Fauna Cloud Support to report long index build times. The support engineer engaged our engineering on-call, who began to investigate. The on-call quickly realized that the asynchronous task execution system that powers index builds was backed up and determined that the root cause was the combination of two separate issues. First, a code change had been pushed to optimize throughput in our task execution system in order to rebuild indexes, but that change wasn’t behaving as expected. Second, a single customer had initiated a large number of index builds at one time. The interaction between these issues resulted in a state where tasks were not being executed and a large number of tasks were sitting in the queue.
The problematic change was rolled back, which allowed task execution to restart. The on-call then utilized operational tooling to manually pause lower-priority tasks and allow urgent index builds to be completed. The task execution system worked through the queue over the course of the next day and index build times returned to normal. The escalation on-call, who was also engaged and was keeping track of the issue, marked it resolved on Nov 19 at 03:15 UTC.
We’re taking the following steps to improve:
We prioritize the availability, security, and performance of our service above everything else and apologize for any inconvenience that this event caused you. If you have further questions/comments about the event or suggestions on additional steps that we could have taken to provide a better customer experience during the event, please reach out to email@example.com.