Reduced availability in all regions

Incident Report for Fauna

Postmortem

On Jan 18 at 7:00 UTC the leader for the Raft cluster in one of Fauna’s log partitions became unable to write transactions and the partition was unable to recover. In order to preserve determinism, the Fauna cluster stopped processing transactions and started returning 503s for requests to all three regions in Fauna’s cloud production environment. The issue was immediately detected by a canary test that runs against production on a regular basis and the database team’s on-call was paged. The on-call was able to quickly diagnose the issue and initiate a rolling restart of production nodes which triggered a new leader election. By 7:24 UTC service was restored across all regions.

We know that downtime is unacceptable and we are prioritizing work to improve. Specifically, to better understand the edge case and mitigate similar classes of issues in the future, we’re taking the following steps immediately:

Disabling a quiescence feature in our Raft implementation that reduces the frequency of heartbeats between Raft nodes, but appears to have contributed to the log partition failure.
Adding more granular logging/tracing to our Raft implementation.
Augmenting our unit testing of Raft and surrounding network code.
Creating a long-lived environment for chaos testing where we will simulate network and hardware failures.

We prioritize the availability, security, and performance of our service above everything else and apologize for any inconvenience that this event caused you. If you have further questions/comments about the event or suggestions on additional steps that we could have taken to provide a better customer experience during the event, please reach out to support@fauna.com.

Posted Jan 23, 2021 - 09:18 PST

Resolved

Between January 18 07:00-07:24 UTC we experienced reduced availability in all regions. The issue has been resolved and the service is operating normally.

Posted Jan 18, 2021 - 02:34 PST

Investigating

We are aware of an issue that caused reduced availability in all regions between 07:00-07:24 UTC. We have mitigated the issue and are investigating the root cause.

Posted Jan 18, 2021 - 00:14 PST

This incident affected: Global Region Group (FQL API).