Skip to content

Crash Recovery & Checkpoints

Durable BLOGE executions survive process crashes. When a node has already produced output, that output is checkpointed; when a worker dies mid-flight, another instance can pick the execution back up and resume from where it left off — without re-running completed work.

Recovery configuration

Crash recovery is opt-in. A RecoveryConfig renews execution leases while a flow runs, scans for expired RUNNING executions, and resumes the ones that are still eligible.

yaml
spring:
  bloge:
    recovery:
      enabled: true
      lease-duration: 30s
      scan-interval: 10s
      version-mismatch-policy: WARN   # WARN | RERUN | FAIL
  • Lease renewal keeps a healthy worker's claim alive.
  • Expiry scan finds executions whose lease lapsed (the worker crashed or stalled).
  • Resume restores durable state and continues from the next pending node.

Checkpoint types

BLOGE persists several kinds of checkpoint so resume is precise rather than all-or-nothing:

CheckpointPurpose
NODE_OUTPUTA node's produced output, so it is not re-executed on resume
FOREACH_PROGRESSPer-item progress inside a sequential foreach

Sequential foreach recovery

Each completed item in a sequential ForEachOperator is checkpointed under CheckpointType.FOREACH_PROGRESS. On resume, the engine restores the longest contiguous prefix of matching items and re-executes from the first missing or changed item. No schema migration is required — it reuses the existing bd_execution_checkpoint table and iteration_key.

bloge
foreach order in ctx.orders {
  node settle : SettleOrderOperator {
    input { order = item }
  }
}

If the process dies after settling items 0 and 1, a resumed execution restores that prefix and continues from item 2.

Operator-fingerprint safety

A node's operator fingerprint is embedded in NodeSpec and persisted with each NODE_OUTPUT checkpoint. On resume, a mismatch between the saved and current fingerprint means the operator's definition changed since the checkpoint was written. The VersionMismatchPolicy decides what to do:

PolicyBehavior
WARN (default)Log a warning and reuse the checkpoint
RERUNDiscard the checkpoint and re-execute the node
FAILAbort resume with an error

Nested checkpoint resume

Sub-graph based nodes derive deterministic child execution IDs so inner NODE_OUTPUT checkpoints survive across outer-execution retries:

  • Plain subgraphs — scoped by parent execution ID + node ID.
  • Foreach item bodies — scoped by parent execution ID + foreach node ID + item index + item fingerprint.
  • Loop iteration bodies — scoped by parent execution ID + loop node ID + iteration number.

This means a retried outer execution does not throw away the work an inner sub-graph already completed.

Grep-friendly recovery logs

Recovery and fallback paths emit structured log tokens instead of swallowing failures silently — for example [TIMER_RESTORE_FAILED], [FINGERPRINT_SKIPPED], and [SCHEMA_ENRICH_SKIPPED] — so you can alert on them in production.

Next steps