On May 29, 2020 at 2008 UTC an incident was identified. An initial fix was designed and deployed within 8 minutes. Unfortunately, this fix proved insufficient and the degradation of the write pipeline continued. At 2139 UTC an additional fix was applied and inbound volume was managed, committed, and returned to normal levels. The entire incident was resolved at 0019 UTC on 30 May 2020.
During issue troubleshooting, it was determined that a single binary log file for a log segment was approximately 7 times normal size. Two of three nodes had replicated these log files. The initial effort was spent in moving these logs aside and subsequently restarting the node with the expectation it would pull log files from other, healthy nodes. This did not happen and manual intervention was necessary. This cascaded throughout the cluster and required multiple node restarts as the transaction was handled manually.
It was determined that Fauna’s index rebuild method created a single transaction entry consisting of more than 3.5 million writes for a single tenant database.
During the issue itself, a fix that limits the overall size of transactions (previously applied but now expanded in scope) was built and deployed. Internal QoS was disabled via config and internal work to ensure scheduler persistence was deployed.
The engineering team has taken several discrete action items focused on minimising potential downtime should a similar issue ever manifest in the future. This, in particular, is focused on automation of manual activities taken to restore the cluster as well as ensuring the hardening of fixes taken to remove the incident as initially manifested.