Delayed Index Builds
Incident Report for Fauna
Postmortem

On Nov 17 at approximately 11:00 UTC, a customer contacted Fauna Cloud Support to report long index build times. The support engineer engaged our engineering on-call, who began to investigate. The on-call quickly realized that the asynchronous task execution system that powers index builds was backed up and determined that the root cause was the combination of two separate issues. First, a code change had been pushed to optimize throughput in our task execution system in order to rebuild indexes, but that change wasn’t behaving as expected. Second, a single customer had initiated a large number of index builds at one time. The interaction between these issues resulted in a state where tasks were not being executed and a large number of tasks were sitting in the queue.

The problematic change was rolled back, which allowed task execution to restart. The on-call then utilized operational tooling to manually pause lower-priority tasks and allow urgent index builds to be completed. The task execution system worked through the queue over the course of the next day and index build times returned to normal. The escalation on-call, who was also engaged and was keeping track of the issue, marked it resolved on Nov 19 at 03:15 UTC.

We’re taking the following steps to improve:

  1. Increasing capacity in each replica to improve task execution throughput.
  2. Improving monitoring and alerting by adding new alarms on the number of tasks in the queue and task execution latency.
  3. Adding a new mechanism to round-robin tasks between tenants to increase execution fairness in the task execution system and make it more difficult for a single customer to brown out the system.
  4. Improving the performance of index builds.

We prioritize the availability, security, and performance of our service above everything else and apologize for any inconvenience that this event caused you. If you have further questions/comments about the event or suggestions on additional steps that we could have taken to provide a better customer experience during the event, please reach out to support@fauna.com.

Posted Nov 30, 2020 - 14:54 PST

Resolved
This incident has been resolved.
Posted Nov 18, 2020 - 19:16 PST
Investigating
We are aware of an issue that has been causing index builds to take longer than normal since 13:00 UTC. We have identified the root cause and are taking steps to remedy the issue.
Posted Nov 17, 2020 - 11:22 PST
This incident affected: Global Region Group (FQL API).