Elevated latencies and timeouts in some regions

Incident Report for Fauna

Postmortem

Incident Summary

On May 24, 2020 at 0222 UTC an incident was identified. An initial fix was designed and deployed within 3 minutes. Monitoring continued through until 0334 UTC. The incident was resolved at 0543 UTC on 30 May 2020.

Root Cause Analysis

Internal SLO monitoring identified that both query and write-op latency was elevated above acceptable levels. The on-call engineer began investigation and restarted affected nodes in the cluster. This was, initially, successful but -- as the cluster stabilised -- read permits were lost on additional nodes and the symptom was recurring.

Subsequent to this, asynchronous index build tasks were paused. This took multiple tries to pause the right combination of tasks. As they were paused they were able to be restarted, individually, to successfully complete.

Lessons Learned & Corrective Actions

While work continues on optimising the index build pipeline, much of the delay and impact in this incident was related to escalation paths and processes impacting the update of community and status page. More timely response would have obviated some of the community concern and minimised the impact.

To that end, Fauna is reworking our SSO solution to ensure that all engineers have access to our status tool. We are also revisiting our escalation path and weekend notification process.

This work has been completed but will be tested during the next incident and/or practice incident.

Posted Jun 11, 2020 - 14:48 PDT

Resolved

This incident has been resolved.

Posted May 23, 2020 - 22:43 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 23, 2020 - 20:34 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted May 23, 2020 - 19:23 PDT

Investigating

We are experiencing elevated latencies and timeouts in some of our regions.

Posted May 23, 2020 - 19:22 PDT

This incident affected: Global Region Group (FQL API).