Crash Recovery & Checkpoints
Durable BLOGE executions survive process crashes. When a node has already produced output, that output is checkpointed; when a worker dies mid-flight, another instance can pick the execution back up and resume from where it left off — without re-running completed work.
Recovery configuration
Crash recovery is opt-in. A RecoveryConfig renews execution leases while a flow runs, scans for expired RUNNING executions, and resumes the ones that are still eligible.
spring:
bloge:
recovery:
enabled: true
lease-duration: 30s
scan-interval: 10s
version-mismatch-policy: WARN # WARN | RERUN | FAIL- Lease renewal keeps a healthy worker's claim alive.
- Expiry scan finds executions whose lease lapsed (the worker crashed or stalled).
- Resume restores durable state and continues from the next pending node.
Checkpoint types
BLOGE persists several kinds of checkpoint so resume is precise rather than all-or-nothing:
| Checkpoint | Purpose |
|---|---|
NODE_OUTPUT | A node's produced output, so it is not re-executed on resume |
FOREACH_PROGRESS | Per-item progress inside a sequential foreach |
Sequential foreach recovery
Each completed item in a sequential ForEachOperator is checkpointed under CheckpointType.FOREACH_PROGRESS. On resume, the engine restores the longest contiguous prefix of matching items and re-executes from the first missing or changed item. No schema migration is required — it reuses the existing bd_execution_checkpoint table and iteration_key.
foreach order in ctx.orders {
node settle : SettleOrderOperator {
input { order = item }
}
}If the process dies after settling items 0 and 1, a resumed execution restores that prefix and continues from item 2.
Operator-fingerprint safety
A node's operator fingerprint is embedded in NodeSpec and persisted with each NODE_OUTPUT checkpoint. On resume, a mismatch between the saved and current fingerprint means the operator's definition changed since the checkpoint was written. The VersionMismatchPolicy decides what to do:
| Policy | Behavior |
|---|---|
WARN (default) | Log a warning and reuse the checkpoint |
RERUN | Discard the checkpoint and re-execute the node |
FAIL | Abort resume with an error |
Nested checkpoint resume
Sub-graph based nodes derive deterministic child execution IDs so inner NODE_OUTPUT checkpoints survive across outer-execution retries:
- Plain subgraphs — scoped by parent execution ID + node ID.
- Foreach item bodies — scoped by parent execution ID + foreach node ID + item index + item fingerprint.
- Loop iteration bodies — scoped by parent execution ID + loop node ID + iteration number.
This means a retried outer execution does not throw away the work an inner sub-graph already completed.
Grep-friendly recovery logs
Recovery and fallback paths emit structured log tokens instead of swallowing failures silently — for example [TIMER_RESTORE_FAILED], [FINGERPRINT_SKIPPED], and [SCHEMA_ENRICH_SKIPPED] — so you can alert on them in production.
Next steps
- Set up the durable store in Durable Flows.
- Combine recovery with Saga & Compensation for rollback on permanent failure.
- Watch resumes land in the Event Journal & Ops Console.