Timeouts in the Classic Region Group

Incident Report for Fauna

Postmortem

On August 2, 2022 between 16:55 and 19:05 UTC users in the Classic Region Group experienced API timeouts and users in all Region Groups were unable to login to the Dashboard.

Internally, Fauna is separated into three logical layers: compute coordination, transaction log, and data storage. Transaction log nodes use Raft to reach quorum between regions on the set of transactions that are committed.

On July 21 and August 1, AWS notified Fauna of hardware failures with two different EC2 hosts that function as both log and storage nodes in the Classic Region Group. Storage nodes are stateful, which means that the system needs to rebalance the way that data is partitioned across nodes when cluster topology changes. In order to monitor the rebalancing process, node additions or removals are typically handled by an operator who first describes the steps that they will follow in a Live Site Change Plan (LSCP). The plan is then carefully reviewed by at least two other team members before the plan is executed in production.

During the execution of the LSCP to replace the problematic hosts, a host was terminated unintentionally, so the operator paused on other repair actions in the cluster in order for data rebalancing to catch up.

The operator eventually resumed the LSCP and brought the first new node online, expecting it to sync transaction log data from other nodes and start functioning normally. However, the transaction log was not being truncated and had grown very large while waiting for one of the hosts to come back up. As a result, the live members of the log segment could not keep up with the I/O requirements of both processing live data and syncing the transaction backlog to the new node, which are currently handled by the same process. With the log unable to make progress, the service started failing to respond to all incoming requests. Further, because Fauna customer metadata is stored in Fauna in the Classic Region Group, customers became unable to login to the Dashboard regardless of which Region Group their data resided in, although APIs for data access remained available in both the US and EU Region Groups.

Several Fauna operators were already engaged due to alarms on internal metrics when the Classic Region Group started failing to serve requests at 16:55 UTC. The team first attempted to reinitialize the log segment that was failing, but by 17:25 UTC it was apparent that a different approach was needed. The team then reinitialized all log segments, a process that does not result in data loss, but is currently manual and non-trivial to complete. By 18:35 UTC operators were able to bring the log back into a healthy state and storage nodes began to catch up on unapplied transactions. The team began to bring compute coordinator nodes online, and service was fully restored in the Classic Region Group by 19:05 UTC.

We know that downtime is unacceptable and we are prioritizing the following work in order to improve our operational posture and prevent issues like this one moving forward:

Making transaction log backlog synchronization asynchronous. This would have prevented the members of the log segment from getting into a place where they couldn’t make forward progress while a new node was joining.
Adding additional safeguards to our internal operator tooling. This would have prevented an operator from inadvertently terminating the wrong host.
Adding additional operator tools to truncate the log and/or update log topology on the fly. This would have significantly reduced the time to resolution during the event.
Exploring the feasibility of moving our internal metadata to our own Private Region Group. This would have prevented the Dashboard login issues for all customers and removed impact for customers with data in the US and EU Region Groups.

We prioritize the security, availability, and performance of our service above everything else and apologize for any inconvenience that this event caused you. We target 99.95% availability for all of our Region Groups and have achieved that target for the last several quarters - we will aggressively prioritize the work to ensure that we hit our availability targets in the future. If you have further questions/comments about the event or want to speak with us personally to discuss the event in greater detail, please reach out to support@fauna.com.

Posted Aug 05, 2022 - 14:12 PDT

Resolved

On Aug 2, 2022 between 16:55 and 19:05 UTC users in the Classic Region Group experienced API timeouts and users in all Region Groups were unable to login to the Dashboard. All services are operating normally. We will post a detailed post-mortem on the incident.

Posted Aug 02, 2022 - 16:32 PDT

Monitoring

On Aug 2, 2022 between 16:55 and 19:05 UTC users in the Classic Region Group experienced API timeouts and users in all Region Groups were unable to login to the Dashboard. We have identified and addressed the root cause of this problem and the service is operating normally. We will continue to monitor the service and will post a detailed post-mortem on the incident.

Posted Aug 02, 2022 - 12:18 PDT

Update

We are in the process of applying a mitigation that we believe will resolve API timeouts in the Classic Region Group and Dashboard logins across Region Groups.

Posted Aug 02, 2022 - 11:41 PDT

Identified

We have identified the root cause of increased API timeouts in the Classic Region Group and are working towards resolution. Dashboard logins are also affected.

Posted Aug 02, 2022 - 10:19 PDT

Investigating

We are investigating API timeouts in the Classic Region Group.

Posted Aug 02, 2022 - 10:10 PDT

This incident affected: Global Region Group (FQL API) and Dashboard.