On May 24, 2020 at 0222 UTC an incident was identified. An initial fix was designed and deployed within 3 minutes. Monitoring continued through until 0334 UTC. The incident was resolved at 0543 UTC on 30 May 2020.
Internal SLO monitoring identified that both query and write-op latency was elevated above acceptable levels. The on-call engineer began investigation and restarted affected nodes in the cluster. This was, initially, successful but -- as the cluster stabilised -- read permits were lost on additional nodes and the symptom was recurring.
Subsequent to this, asynchronous index build tasks were paused. This took multiple tries to pause the right combination of tasks. As they were paused they were able to be restarted, individually, to successfully complete.
While work continues on optimising the index build pipeline, much of the delay and impact in this incident was related to escalation paths and processes impacting the update of community and status page. More timely response would have obviated some of the community concern and minimised the impact.
To that end, Fauna is reworking our SSO solution to ensure that all engineers have access to our status tool. We are also revisiting our escalation path and weekend notification process.
This work has been completed but will be tested during the next incident and/or practice incident.