Between August 22nd and November 18th 2021 some Fauna customers in the Classic Region Group experienced index mismatches that manifested as either documents missing in an index, or an index entry pointing to a document that was missing in a collection. The issue typically only impacted a single region within a region group, meaning that customers could see non-deterministic behavior depending on which region their requests were routed to.
Documents and indexes are stored separately in Fauna and are kept in sync by Fauna’s internal transaction processing pipeline which is based on Calvin. At the storage layer, both documents and indexes are stored in log-structured merge-trees (LSM trees) as sorted string tables (sstables). As sstables accumulate, the number of disk seeks required to execute a transaction increases, so a background process regularly compacts the sstables down to fewer files. Compaction is safe because live sstables are not deleted until the compaction has finished and an atomic, in-memory operation swaps the live sstables for newly compacted sstables. If the process restarts during compaction, running compactions are aborted and partially compacted sstables are cleaned up.
During the impact period, a bug in Fauna’s storage engine occasionally caused live sstables to be misidentified as partially compacted sstables on process restart, which resulted in those live sstables being deleted. The bug only manifested during a process restart in the middle of a compaction when a specific race condition was hit, which made it extremely difficult to deterministically reproduce the issue and understand the root cause.
Upon becoming aware of the issue on August 22nd, Fauna engineers started working around the clock to mitigate the impact, understand the root cause, and address the problem. When customers reported issues with specific indexes, engineers either manually rebuilt those indexes on the behalf of customers or ran targeted region to region repair tooling to address those issues. On October 13th Fauna engineers were able to observe an instance of the bug in production with enough forensics in place to understand the root cause. The next day, a fix was committed, tested and deployed to production. No new index mismatches have occurred since that time.
After the fix was deployed, engineers began using repair tooling to copy Classic Region Group data offline, rebuild all indexes into new sstables, and load those new sstables back into the live production data set to address index issues. Because of the volume of data in the Region Group and a desire to be extremely cautious when loading repaired data, this process took several weeks. Repairs were completed on November 18th, concluding the impact of the incident for all Fauna customers.
We prioritize the availability, security, and performance of our service above all else and we recognize that data consistency and correctness are of the utmost importance to our customers, so we are taking the following steps to improve:
We value your business and sincerely apologize for any inconvenience that this issue has caused you. If you have any questions about the issue, the actions that we took to mitigate it, or the steps that we’re taking to improve, please contact support@fauna.com.