We're experiencing a major service disruption
Incident Report for Fauna
Postmortem

Incident Summary

On May 29, 2020 at 2008 UTC an incident was identified. An initial fix was designed and deployed within 8 minutes. Unfortunately, this fix proved insufficient and the degradation of the write pipeline continued. At 2139 UTC an additional fix was applied and inbound volume was managed, committed, and returned to normal levels. The entire incident was resolved at 0019 UTC on 30 May 2020.

Root Cause Analysis

During issue troubleshooting, it was determined that a single binary log file for a log segment was approximately 7 times normal size. Two of three nodes had replicated these log files. The initial effort was spent in moving these logs aside and subsequently restarting the node with the expectation it would pull log files from other, healthy nodes. This did not happen and manual intervention was necessary. This cascaded throughout the cluster and required multiple node restarts as the transaction was handled manually.

Lessons Learned & Corrective Actions

It was determined that Fauna’s index rebuild method created a single transaction entry consisting of more than 3.5 million writes for a single tenant database.

During the issue itself, a fix that limits the overall size of transactions (previously applied but now expanded in scope) was built and deployed. Internal QoS was disabled via config and internal work to ensure scheduler persistence was deployed.

The engineering team has taken several discrete action items focused on minimising potential downtime should a similar issue ever manifest in the future. This, in particular, is focused on automation of manual activities taken to restore the cluster as well as ensuring the hardening of fixes taken to remove the incident as initially manifested.

Posted Jun 11, 2020 - 14:45 PDT

Resolved
This incident has been resolved.
Posted May 29, 2020 - 17:19 PDT
Monitoring
A cause has been identified and an initial fix has been applied. The Fauna team is actively monitoring the situation as volume continues to ramp to normal levels.
Posted May 29, 2020 - 14:39 PDT
Update
The issue has been identified. A malformed transaction has resulted in degradation throughout the write pipeline. There is no expectation of loss of user data. Remediation for the issue has been identified and is underway.
Posted May 29, 2020 - 13:55 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted May 29, 2020 - 13:16 PDT
Investigating
We are currently investigating this issue.
Posted May 29, 2020 - 13:08 PDT
This incident affected: Global Region Group (FQL API).