Crash Recovery & Checkpoints

Durable BLOGE executions survive process crashes. When a node has already produced output, that output is checkpointed; when a worker dies mid-flight, another instance can pick the execution back up and resume from where it left off — without re-running completed work.

Recovery configuration

Crash recovery is opt-in. A RecoveryConfig renews execution leases while a flow runs, scans for expired RUNNING executions, and resumes the ones that are still eligible.

yaml

spring:
  bloge:
    recovery:
      enabled: true
      lease-duration: 30s
      scan-interval: 10s
      version-mismatch-policy: WARN   # WARN | RERUN | FAIL

Lease renewal keeps a healthy worker's claim alive.
Expiry scan finds executions whose lease lapsed (the worker crashed or stalled).
Resume restores durable state and continues from the next pending node.

Checkpoint types

BLOGE persists several kinds of checkpoint so resume is precise rather than all-or-nothing:

Checkpoint	Purpose
`NODE_OUTPUT`	A node's produced output, so it is not re-executed on resume
`FOREACH_PROGRESS`	Per-item progress inside a sequential `foreach`

Sequential foreach recovery

Each completed item in a sequential ForEachOperator is checkpointed under CheckpointType.FOREACH_PROGRESS. On resume, the engine restores the longest contiguous prefix of matching items and re-executes from the first missing or changed item. No schema migration is required — it reuses the existing bd_execution_checkpoint table and iteration_key.

bloge

foreach order in ctx.orders {
  node settle : SettleOrderOperator {
    input { order = item }
  }
}

If the process dies after settling items 0 and 1, a resumed execution restores that prefix and continues from item 2.

Operator-fingerprint safety

A node's operator fingerprint is embedded in NodeSpec and persisted with each NODE_OUTPUT checkpoint. On resume, a mismatch between the saved and current fingerprint means the operator's definition changed since the checkpoint was written. The VersionMismatchPolicy decides what to do:

Policy	Behavior
`WARN` (default)	Log a warning and reuse the checkpoint
`RERUN`	Discard the checkpoint and re-execute the node
`FAIL`	Abort resume with an error

Nested checkpoint resume

Sub-graph based nodes derive deterministic child execution IDs so inner NODE_OUTPUT checkpoints survive across outer-execution retries:

Plain subgraphs — scoped by parent execution ID + node ID.
Foreach item bodies — scoped by parent execution ID + foreach node ID + item index + item fingerprint.
Loop iteration bodies — scoped by parent execution ID + loop node ID + iteration number.

This means a retried outer execution does not throw away the work an inner sub-graph already completed.

Grep-friendly recovery logs

Recovery and fallback paths emit structured log tokens instead of swallowing failures silently — for example [TIMER_RESTORE_FAILED], [FINGERPRINT_SKIPPED], and [SCHEMA_ENRICH_SKIPPED] — so you can alert on them in production.

Next steps

Set up the durable store in Durable Flows.
Combine recovery with Saga & Compensation for rollback on permanent failure.
Watch resumes land in the Event Journal & Ops Console.

Crash Recovery & Checkpoints ​

Recovery configuration ​

Checkpoint types ​

Sequential foreach recovery ​

Operator-fingerprint safety ​

Nested checkpoint resume ​

Grep-friendly recovery logs ​

Next steps ​