Claim-state CAS: solving non-idempotent actions in a workflow engine
Optimistic locking via @VersionColumn catches double-write. It does not catch double-action — the case where two concurrent requests both run the action body before either commits, producing duplicate emails, duplicate clones, double payment credits. The fix is a config-driven CAS primitive: declare a transient "claiming" status, the engine flips into it via the optimistic lock before running the action, concurrent callers fail the CAS and abort. A per-module janitor handles the only case the wrapper can't catch — process crash mid-action.
The narrow gap optimistic locking misses
If you use TypeORM's @VersionColumn, every entity has a version field that increments on save, and a save against an entity whose in-memory version is stale throws an OptimisticLockVersionMismatchError. This is the well-known compare-and-swap for entity state.
Two concurrent requests arrive, both reading version N of an invoice in status approved. Both run validation. Both run the action. Both attempt to persist with version N. One wins on the CAS and writes version N+1. The other gets OptimisticLockVersionMismatchError and fails cleanly.
For idempotent steps this is enough. The losing request did wasted work, but the entity's state is correct. Running mark_as_overdue twice has the same observable effect as running it once.
For non-idempotent steps it's a leaky guard. Two concrete examples from our codebase where CAS is genuinely required:
- Regenerate sent invoice. Soft-delete the original, clone with the same
invoice_number, recompute. Two concurrentregenerate→ two clones with the same invoice number, original soft-deleted twice. The writes are the transition — they must be transactional with the status flip — so they can't be moved out of the action body to after-commit. - Apply payment. Insert payment row, mark invoice paid. Two concurrent applies → double-credit possible if the payment side effect isn't naturally idempotent. The payment row must exist before the invoice flips to
paid— they're one transactional unit.
(One earlier example — close (billing) enqueuing one email per invoice in the period — was originally a CAS site too. With after_effect shipped a few days after the original ADR-031, the emails moved to a success_after_effect use case that fires after commit. Optimistic locking catches the concurrent status flip; only one commit succeeds; only one batch of emails fires. CAS turns out to be the wrong tool for pure-side-effect actions — the right reach is after-effect. The criterion that stabilized in production: if the side effects can move after the commit without changing correctness, prefer after-effect; if they are the writes that constitute the transition, you need CAS.)
The optimistic lock fires on the final persist. By that point, the side effects of the action body have already happened. The lock keeps the database state consistent; it doesn't keep the side effects single.
What we needed
A way for a workflow step to declare "this action has non-idempotent side effects; serialize callers before the action body runs." Three properties:
- Config-driven. The serialization should be visible in the workflow JSON, not buried inside the action's code. A reader of the step config should see that it's protected.
- No new infrastructure. Adding pg_cron, Redlock, or an external coordination service for this would be a big hammer for a narrow problem. Reuse what's already there.
- Self-contained recovery. If a process crashes between "I claimed it" and "I released it," some out-of-band mechanism has to free the entity. Not the engine's job to know how — but the contract must be explicit.
The primitive
A step can declare a claim_state field. The value is a two-element array: the transient status to flip into when claiming, and the source status to revert to if the action fails or is rejected. Per-requester granularity is supported (the same as every other workflow field), so admin clicks can claim while system cron — which is already serialized elsewhere — can skip claiming.
{
"step_id": 20,
"step_key": "close",
"status": "approved",
"successful_claim_state": { "admin": ["closing", "approved"] },
"successful_state": { "admin": "closed" },
"action": { "admin": "CloseBillingActionUseCase" }
}
The engine wraps the action body. Pseudocode:
if (step.claim_state) {
const [claim_into, revert_to] = step.claim_state;
// CAS: source status → claim_into, using the same optimistic-lock mechanism
const claimed = await tryClaim(entity, claim_into);
if (!claimed) {
return { claim_failed: true }; // concurrent caller already has it
}
try {
const result = await executeSingleStep(context);
if (result.success) return result;
// Validation rejected or action returned failure
await tryRelease(entity, claim_into, revert_to);
return result;
} catch (e) {
await tryRelease(entity, claim_into, revert_to);
throw e;
}
}
The tryClaim flip happens as a versioned write — the same @VersionColumn CAS the engine already uses. The first caller wins, flips the entity into closing, and proceeds into the action body. The second caller's tryClaim fails because the entity is no longer in approved (status check) or because the version has moved (version check). It returns claim_failed: true and the API responds with "another operation is already in progress." No second action body. No duplicate emails.
If the action succeeds, the engine's normal persist flips the entity from closing to closed. If the action returns failure or is rejected, tryRelease reverts the entity from closing back to approved. If the action throws unexpectedly, the catch block does the same release best-effort before re-raising.
The crash case the wrapper can't catch
Three failure modes are handled by the wrapper. A fourth isn't:
| Failure mode | Handled by |
|---|---|
| Action returns Result.failure | Wrapper releases immediately |
| Action returns is_invalid: true (validation rejection) | Wrapper releases immediately |
| Action throws an exception | Wrapper releases best-effort, then re-raises |
| Process crash / pod killed / DB unreachable mid-action | Nothing in application code can catch this |
The fourth case leaves the entity stuck in the transient status. No request can move it forward (no step is configured at closing) and no application code is alive to release it. Out-of-band recovery is the only option.
The recovery contract
The choice was where to put the janitor. Three options:
- Engine-side janitor in
common/. Would have to know entity table names and DataSource. Couples the engine to modules. Rejected — the engine's MS-readiness depends on it holding only strings. - pg_cron. Cleanest schema-level answer. Adds a Postgres extension dependency. Hides recovery from app developers reading TypeScript. Rejected — the recovery logic should be visible in the same language as the rest of the system.
- Per-owning-module janitor. Each module that uses
claim_stateowns its own janitor cron. Follows the existing pattern in the codebase. Accepted.
The janitor is a simple cron: every 2 minutes, query the entity table for rows where status = claim_into and updated_at < NOW() - INTERVAL '5 minutes', and flip them back to revert_to. The 5-minute window is the user-visible blast radius: in the worst case, an entity is stuck in a transient status for at most 5 minutes after a crash before the janitor releases it.
The boot-time coverage check
Per-module janitors are great until someone forgets to write one. A new workflow declares a claim_state in its config, ships, and the corresponding module never adds a janitor. The first time a process crashes mid-action, the entity is stuck forever.
The fix is a registry. Two phases at boot:
- The engine eagerly loads every workflow config, scans every step's
claim_stateacross every requester, and registers each uniqueclaim_intovalue into a "declared" map, along with its owning module and entity table (taken from the workflow's meta-config). - Each module's janitor, in its own
onModuleInit, reads the declared map filtered to its owning module and registers itself as the cover for those entries.
After every module's onModuleInit resolves, NestJS fires the registry's onApplicationBootstrap. The registry diffs declared vs. covered. Any uncovered claim state throws an error naming the workflow and the missing module. The application refuses to start. Adding a new claim_state to a workflow that lacks a janitor is no longer a "discovered in production" bug. It's a discovered-at-boot bug.
The validator rules
Six rules at config validation time, all enforced before the application accepts the config:
claim_statetuple is a 2-element string array- Both items are in the workflow's
status_types claim_into !== revert_to(no-op tuple)claim_into !== step.status(the CAS would be a no-op — second caller would pass too)claim_into !== successful_state(engine's final persist would be a no-op too)- Cross-step: every step that uses the same
claim_intovalue must declare the samerevert_to. Without this, the janitor can't disambiguate — it would see entities stuck inapplying_paymentand not know whether to revert them tosentoroverdue.
Rule 6 has a consequence: claim states proliferate. If apply_payment can fire from both sent and overdue, you can't share a single applying_payment transient. You end up with applying_payment_from_sent and applying_payment_from_overdue. The configs and the status_type enum grow. This is the visible cost of the pattern. I accept it; the alternative is ambiguity in recovery, which is worse.
What you give up, what you get
The pattern adds:
- Two transient statuses per claim site (acquire + release transitions in the value object)
- One janitor cron per owning module (~30 lines of TypeScript)
- A registry singleton in
common/(~100 lines) - Six validator rules at startup (~50 lines)
And it eliminates:
- Duplicate emails on concurrent billing close (observed in production before the primitive shipped)
- Duplicate clones on concurrent invoice regenerate (genuine integrity hole — two invoices with the same invoice_number is a production-data bug, not a cosmetic one)
- The class of "non-idempotent action + concurrent caller" bugs in general, for any step configured with
claim_state
It does not eliminate non-idempotent steps that don't declare claim_state. The validator could be extended to require claim_state for any step whose action is in a registered "non-idempotent" list, but I haven't pulled that trigger. For now, declaring claim_state is opt-in, and code review catches the cases that need it.
The general shape
The pattern generalizes beyond workflow engines. Any time you have:
- A stateful entity whose state column you control
- A primitive that does compare-and-swap on that column (optimistic lock, conditional UPDATE, advisory lock — anything atomic)
- An out-of-band recovery mechanism for the case where you crash between acquire and release
...you can build a CAS-based coordination primitive without external coordination services. The cost is honest: extra statuses, recovery infrastructure, validation overhead. The gain is that side-effect duplication becomes structurally impossible for protected operations, not a thing you have to remember to defend against in code review.
I find this trade favorable in regulated markets, where "send each invoice once" and "credit each payment once" are not nice-to-haves. Your mileage will vary by domain.