During a software update rollout, a host in the EU region had trouble becoming operational after a restart. It turned out that it failed to properly truncate for several days its local copy of a transaction log. This caused it to try to work through a very large backlog on restart. Normally, FaunaDB nodes aggressively truncate their local copies of transaction logs to discard entries that are no longer needed. This ordinarily keeps these logs small, containing just the most recent few minutes of transaction data and they’re easily re-processed when a node is restarted.
We solved the problem by removing the files representing the problematic node’s local copy of the transaction log. At this point it reacquired the (correctly truncated) log from its replication peers and started up without issue.
We will be instituting internal monitoring and alerting on ages of transaction log files until we can diagnose the root problem: nodes sporadically failing to keep their transaction logs small.