All issues

WAL truncation race during failover loses last batch#3001

New issue
Open
LELeon Fischercommented 1w ago

Describe the bug

WAL truncation race during failover loses last batch — surfaces on atlas-edge in the high-priority path. The full reproduction is below, along with the workaround we use in staging.

Reproduce

  • Clone ln-dev7/atlas-edge at the tip of main.
  • Run the smoke suite: pnpm test --workspace atlas-edge.
  • Notice the test for the affected contract flakes within the first 50 iterations.

Expected behavior

The contract should hold deterministically across the full matrix — staging, production, and the local dev runtime.

Environment

  • node 22.x
  • pnpm 9.x
  • OS: macOS 15.2 (also reproduced on Linux 6.x)

Labels: bug

LEleon-fischerself-assigned this1w ago
LEleon-fischeradded the labelbug1w ago
LEleon-fischeradded this to the milestoneQ4 reliability1w ago
LEleon-fischeradded a commit that references this issuefix(atlas): guard the regression surfaced in #300158a99f21w ago
LEleon-fischermentioned this in#30541w ago
LELeon Fischercommented 6d ago

Pulled this locally, the workaround in the description is solid. Happy to land it as a follow-up.

LELeon Fischercommented 6d ago

Bumping priority — three customers hit this last week, two of them on the enterprise tier.

LELeon Fischercommented 6d ago

Could we add a regression test before the patch lands? Otherwise this will resurface as soon as the surrounding cleanup happens.

LELeon Fischercommented 5d ago

Re-reading the spec, I think the right contract is the one in the bug report, not the one we're shipping. Going to open a small PR.

Add a comment

M↓Markdown is supported