On Jan 18 at 7:00 UTC the leader for the Raft cluster in one of Fauna’s log partitions became unable to write transactions and the partition was unable to recover. In order to preserve determinism, the Fauna cluster stopped processing transactions and started returning 503s for requests to all three regions in Fauna’s cloud production environment. The issue was immediately detected by a canary test that runs against production on a regular basis and the database team’s on-call was paged. The on-call was able to quickly diagnose the issue and initiate a rolling restart of production nodes which triggered a new leader election. By 7:24 UTC service was restored across all regions.
We know that downtime is unacceptable and we are prioritizing work to improve. Specifically, to better understand the edge case and mitigate similar classes of issues in the future, we’re taking the following steps immediately:
We prioritize the availability, security, and performance of our service above everything else and apologize for any inconvenience that this event caused you. If you have further questions/comments about the event or suggestions on additional steps that we could have taken to provide a better customer experience during the event, please reach out to email@example.com.