On August 2, 2022 between 16:55 and 19:05 UTC users in the Classic Region Group experienced API timeouts and users in all Region Groups were unable to login to the Dashboard.
Internally, Fauna is separated into three logical layers: compute coordination, transaction log, and data storage. Transaction log nodes use Raft to reach quorum between regions on the set of transactions that are committed.
On July 21 and August 1, AWS notified Fauna of hardware failures with two different EC2 hosts that function as both log and storage nodes in the Classic Region Group. Storage nodes are stateful, which means that the system needs to rebalance the way that data is partitioned across nodes when cluster topology changes. In order to monitor the rebalancing process, node additions or removals are typically handled by an operator who first describes the steps that they will follow in a Live Site Change Plan (LSCP). The plan is then carefully reviewed by at least two other team members before the plan is executed in production.
During the execution of the LSCP to replace the problematic hosts, a host was terminated unintentionally, so the operator paused on other repair actions in the cluster in order for data rebalancing to catch up.
The operator eventually resumed the LSCP and brought the first new node online, expecting it to sync transaction log data from other nodes and start functioning normally. However, the transaction log was not being truncated and had grown very large while waiting for one of the hosts to come back up. As a result, the live members of the log segment could not keep up with the I/O requirements of both processing live data and syncing the transaction backlog to the new node, which are currently handled by the same process. With the log unable to make progress, the service started failing to respond to all incoming requests. Further, because Fauna customer metadata is stored in Fauna in the Classic Region Group, customers became unable to login to the Dashboard regardless of which Region Group their data resided in, although APIs for data access remained available in both the US and EU Region Groups.
Several Fauna operators were already engaged due to alarms on internal metrics when the Classic Region Group started failing to serve requests at 16:55 UTC. The team first attempted to reinitialize the log segment that was failing, but by 17:25 UTC it was apparent that a different approach was needed. The team then reinitialized all log segments, a process that does not result in data loss, but is currently manual and non-trivial to complete. By 18:35 UTC operators were able to bring the log back into a healthy state and storage nodes began to catch up on unapplied transactions. The team began to bring compute coordinator nodes online, and service was fully restored in the Classic Region Group by 19:05 UTC.
We know that downtime is unacceptable and we are prioritizing the following work in order to improve our operational posture and prevent issues like this one moving forward:
We prioritize the security, availability, and performance of our service above everything else and apologize for any inconvenience that this event caused you. We target 99.95% availability for all of our Region Groups and have achieved that target for the last several quarters - we will aggressively prioritize the work to ensure that we hit our availability targets in the future. If you have further questions/comments about the event or want to speak with us personally to discuss the event in greater detail, please reach out to firstname.lastname@example.org.