Index mismatches for some customers

Incident Report for Fauna

Postmortem

Between August 22nd and November 18th 2021 some Fauna customers in the Classic Region Group experienced index mismatches that manifested as either documents missing in an index, or an index entry pointing to a document that was missing in a collection. The issue typically only impacted a single region within a region group, meaning that customers could see non-deterministic behavior depending on which region their requests were routed to.

Documents and indexes are stored separately in Fauna and are kept in sync by Fauna’s internal transaction processing pipeline which is based on Calvin. At the storage layer, both documents and indexes are stored in log-structured merge-trees (LSM trees) as sorted string tables (sstables). As sstables accumulate, the number of disk seeks required to execute a transaction increases, so a background process regularly compacts the sstables down to fewer files. Compaction is safe because live sstables are not deleted until the compaction has finished and an atomic, in-memory operation swaps the live sstables for newly compacted sstables. If the process restarts during compaction, running compactions are aborted and partially compacted sstables are cleaned up.

During the impact period, a bug in Fauna’s storage engine occasionally caused live sstables to be misidentified as partially compacted sstables on process restart, which resulted in those live sstables being deleted. The bug only manifested during a process restart in the middle of a compaction when a specific race condition was hit, which made it extremely difficult to deterministically reproduce the issue and understand the root cause.

Upon becoming aware of the issue on August 22nd, Fauna engineers started working around the clock to mitigate the impact, understand the root cause, and address the problem. When customers reported issues with specific indexes, engineers either manually rebuilt those indexes on the behalf of customers or ran targeted region to region repair tooling to address those issues. On October 13th Fauna engineers were able to observe an instance of the bug in production with enough forensics in place to understand the root cause. The next day, a fix was committed, tested and deployed to production. No new index mismatches have occurred since that time.

After the fix was deployed, engineers began using repair tooling to copy Classic Region Group data offline, rebuild all indexes into new sstables, and load those new sstables back into the live production data set to address index issues. Because of the volume of data in the Region Group and a desire to be extremely cautious when loading repaired data, this process took several weeks. Repairs were completed on November 18th, concluding the impact of the incident for all Fauna customers.

We prioritize the availability, security, and performance of our service above all else and we recognize that data consistency and correctness are of the utmost importance to our customers, so we are taking the following steps to improve:

Enhancing our test suite that runs against all builds as part of our automated release pipeline and checks for database correctness under real world conditions.
Adding capabilities to our entropy detection mechanism, an added layer of security that monitors for data correctness issues in the event that a bug makes it through our test suite, to more quickly identify index mismatches in production.
Augmenting existing operational tooling to allow engineers to respond to index correctness issues more quickly.
Kicking off a larger body of work to refactor our storage engine, which will improve our testability, scalability, and performance.

We value your business and sincerely apologize for any inconvenience that this issue has caused you. If you have any questions about the issue, the actions that we took to mitigate it, or the steps that we’re taking to improve, please contact support@fauna.com.

Posted Dec 01, 2021 - 17:29 PST

Resolved

We have resolved the issue that is causing index mismatches for some customers in the Classic Region Group. The service is now operating normally.

Posted Oct 13, 2021 - 19:52 PDT

Identified

We have identified an issue that is causing index mismatches for some customers in the Classic Region Group and are working to resolve the issue.

Posted Oct 13, 2021 - 18:26 PDT

Investigating

We have identified an issue that is causing index mismatches for some customers in the Classic Region Group and are working to resolve the issue.

Posted Oct 13, 2021 - 16:18 PDT

This incident affected: Global Region Group (FQL API).