Claim-state CAS: solving non-idempotent actions in a workflow engine

TL;DR

Optimistic locking via @VersionColumn catches double-write. It does not catch double-action — the case where two concurrent requests both run the action body before either commits, producing duplicate emails, duplicate clones, double payment credits. The fix is a config-driven CAS primitive: declare a transient "claiming" status, the engine flips into it via the optimistic lock before running the action, concurrent callers fail the CAS and abort. A per-module janitor handles the only case the wrapper can't catch — process crash mid-action.

The narrow gap optimistic locking misses

If you use TypeORM's @VersionColumn, every entity has a version field that increments on save, and a save against an entity whose in-memory version is stale throws an OptimisticLockVersionMismatchError. This is the well-known compare-and-swap for entity state.

Two concurrent requests arrive, both reading version N of an invoice in status approved. Both run validation. Both run the action. Both attempt to persist with version N. One wins on the CAS and writes version N+1. The other gets OptimisticLockVersionMismatchError and fails cleanly.

For idempotent steps this is enough. The losing request did wasted work, but the entity's state is correct. Running mark_as_overdue twice has the same observable effect as running it once.

For non-idempotent steps it's a leaky guard. Two concrete examples from our codebase where CAS is genuinely required:

(One earlier example — close (billing) enqueuing one email per invoice in the period — was originally a CAS site too. With after_effect shipped a few days after the original ADR-031, the emails moved to a success_after_effect use case that fires after commit. Optimistic locking catches the concurrent status flip; only one commit succeeds; only one batch of emails fires. CAS turns out to be the wrong tool for pure-side-effect actions — the right reach is after-effect. The criterion that stabilized in production: if the side effects can move after the commit without changing correctness, prefer after-effect; if they are the writes that constitute the transition, you need CAS.)

The optimistic lock fires on the final persist. By that point, the side effects of the action body have already happened. The lock keeps the database state consistent; it doesn't keep the side effects single.

What we needed

A way for a workflow step to declare "this action has non-idempotent side effects; serialize callers before the action body runs." Three properties:

  1. Config-driven. The serialization should be visible in the workflow JSON, not buried inside the action's code. A reader of the step config should see that it's protected.
  2. No new infrastructure. Adding pg_cron, Redlock, or an external coordination service for this would be a big hammer for a narrow problem. Reuse what's already there.
  3. Self-contained recovery. If a process crashes between "I claimed it" and "I released it," some out-of-band mechanism has to free the entity. Not the engine's job to know how — but the contract must be explicit.

The primitive

A step can declare a claim_state field. The value is a two-element array: the transient status to flip into when claiming, and the source status to revert to if the action fails or is rejected. Per-requester granularity is supported (the same as every other workflow field), so admin clicks can claim while system cron — which is already serialized elsewhere — can skip claiming.

{
  "step_id": 20,
  "step_key": "close",
  "status": "approved",
  "successful_claim_state": { "admin": ["closing", "approved"] },
  "successful_state":       { "admin": "closed" },
  "action":                 { "admin": "CloseBillingActionUseCase" }
}

The engine wraps the action body. Pseudocode:

if (step.claim_state) {
  const [claim_into, revert_to] = step.claim_state;

  // CAS: source status → claim_into, using the same optimistic-lock mechanism
  const claimed = await tryClaim(entity, claim_into);
  if (!claimed) {
    return { claim_failed: true };  // concurrent caller already has it
  }

  try {
    const result = await executeSingleStep(context);
    if (result.success) return result;
    // Validation rejected or action returned failure
    await tryRelease(entity, claim_into, revert_to);
    return result;
  } catch (e) {
    await tryRelease(entity, claim_into, revert_to);
    throw e;
  }
}

The tryClaim flip happens as a versioned write — the same @VersionColumn CAS the engine already uses. The first caller wins, flips the entity into closing, and proceeds into the action body. The second caller's tryClaim fails because the entity is no longer in approved (status check) or because the version has moved (version check). It returns claim_failed: true and the API responds with "another operation is already in progress." No second action body. No duplicate emails.

If the action succeeds, the engine's normal persist flips the entity from closing to closed. If the action returns failure or is rejected, tryRelease reverts the entity from closing back to approved. If the action throws unexpectedly, the catch block does the same release best-effort before re-raising.

The crash case the wrapper can't catch

Three failure modes are handled by the wrapper. A fourth isn't:

Failure modeHandled by
Action returns Result.failureWrapper releases immediately
Action returns is_invalid: true (validation rejection)Wrapper releases immediately
Action throws an exceptionWrapper releases best-effort, then re-raises
Process crash / pod killed / DB unreachable mid-actionNothing in application code can catch this

The fourth case leaves the entity stuck in the transient status. No request can move it forward (no step is configured at closing) and no application code is alive to release it. Out-of-band recovery is the only option.

The recovery contract

The choice was where to put the janitor. Three options:

The janitor is a simple cron: every 2 minutes, query the entity table for rows where status = claim_into and updated_at < NOW() - INTERVAL '5 minutes', and flip them back to revert_to. The 5-minute window is the user-visible blast radius: in the worst case, an entity is stuck in a transient status for at most 5 minutes after a crash before the janitor releases it.

The boot-time coverage check

Per-module janitors are great until someone forgets to write one. A new workflow declares a claim_state in its config, ships, and the corresponding module never adds a janitor. The first time a process crashes mid-action, the entity is stuck forever.

The fix is a registry. Two phases at boot:

  1. The engine eagerly loads every workflow config, scans every step's claim_state across every requester, and registers each unique claim_into value into a "declared" map, along with its owning module and entity table (taken from the workflow's meta-config).
  2. Each module's janitor, in its own onModuleInit, reads the declared map filtered to its owning module and registers itself as the cover for those entries.

After every module's onModuleInit resolves, NestJS fires the registry's onApplicationBootstrap. The registry diffs declared vs. covered. Any uncovered claim state throws an error naming the workflow and the missing module. The application refuses to start. Adding a new claim_state to a workflow that lacks a janitor is no longer a "discovered in production" bug. It's a discovered-at-boot bug.

The validator rules

Six rules at config validation time, all enforced before the application accepts the config:

  1. claim_state tuple is a 2-element string array
  2. Both items are in the workflow's status_types
  3. claim_into !== revert_to (no-op tuple)
  4. claim_into !== step.status (the CAS would be a no-op — second caller would pass too)
  5. claim_into !== successful_state (engine's final persist would be a no-op too)
  6. Cross-step: every step that uses the same claim_into value must declare the same revert_to. Without this, the janitor can't disambiguate — it would see entities stuck in applying_payment and not know whether to revert them to sent or overdue.

Rule 6 has a consequence: claim states proliferate. If apply_payment can fire from both sent and overdue, you can't share a single applying_payment transient. You end up with applying_payment_from_sent and applying_payment_from_overdue. The configs and the status_type enum grow. This is the visible cost of the pattern. I accept it; the alternative is ambiguity in recovery, which is worse.

What you give up, what you get

The pattern adds:

And it eliminates:

It does not eliminate non-idempotent steps that don't declare claim_state. The validator could be extended to require claim_state for any step whose action is in a registered "non-idempotent" list, but I haven't pulled that trigger. For now, declaring claim_state is opt-in, and code review catches the cases that need it.

The general shape

The pattern generalizes beyond workflow engines. Any time you have:

...you can build a CAS-based coordination primitive without external coordination services. The cost is honest: extra statuses, recovery infrastructure, validation overhead. The gain is that side-effect duplication becomes structurally impossible for protected operations, not a thing you have to remember to defend against in code review.

I find this trade favorable in regulated markets, where "send each invoice once" and "credit each payment once" are not nice-to-haves. Your mileage will vary by domain.