Elevated latency in the EU region

Incident Report for Fauna

Postmortem

During a software update rollout, a host in the EU region had trouble becoming operational after a restart. It turned out that it failed to properly truncate for several days its local copy of a transaction log. This caused it to try to work through a very large backlog on restart. Normally, FaunaDB nodes aggressively truncate their local copies of transaction logs to discard entries that are no longer needed. This ordinarily keeps these logs small, containing just the most recent few minutes of transaction data and they’re easily re-processed when a node is restarted.

We solved the problem by removing the files representing the problematic node’s local copy of the transaction log. At this point it reacquired the (correctly truncated) log from its replication peers and started up without issue.

We will be instituting internal monitoring and alerting on ages of transaction log files until we can diagnose the root problem: nodes sporadically failing to keep their transaction logs small.

Posted Aug 03, 2020 - 11:29 PDT

Resolved

This incident has been resolved.

Posted Jul 28, 2020 - 12:48 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 28, 2020 - 12:40 PDT

Update

We are continuing to investigate this issue.

Posted Jul 28, 2020 - 12:33 PDT

Investigating

We are currently investigating this issue.

Posted Jul 28, 2020 - 12:33 PDT

This incident affected: Global Region Group (FQL API).