fix: non-determinism from actor debug flag #638
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This took me forever to diagnose. After testing many different configurations (diff fendermint configs, diff commit refs, old localnet vs. new localnet, node restart sequences), I saw that changing the fendermint log level on a live network lead to a consensus failure... which is, super weird and didn't make any sense. I finally realized this is related to actor debugging.
The actor debug flag cannot change on a live network due to how debugging works with the FVM / Wasmtime. Since WASM modules can't actually log to stdout/err, they are returned via syscall, which effects the respond data for a function call. The upshot is that changing to/from actor debugging on a live network leads to non-determinism and consensus failures. This can now happen when the fendermint log level goes from
infotodebugordebugtoinfo, due to this change which ties the actor debugging flag to the fendermint log level introduced in #610. I hysterically commented "perfect" on this change :)Initial consensus failure:
Later you will see:
Ideally, we could somehow alter this flag without rebuilding fendermint, but we need to think about how to do that in a way that's non "configuration".