Operations and Troubleshooting
Production workflow operations are mostly about proving why Crab chose to run, skip, restore, or fail a stage.
First Commands to Run
crab run --validate
crab workflow status
crab workflow dag
crab run --dry --explain-missThese commands are read-oriented and safe to run before changing the workspace.
Stage Keeps Missing Cache
Run:
crab run --dry --explain-miss trainCheck:
- A dep file changed content or path.
- A declared param changed.
- The command string changed.
- An allowlisted env var changed.
nondeterministicoralways_changedis set.- The stage was run with
--forceor--no-run-cache.
Fix by updating the intended input, correcting the stage declaration, or
accepting the new cache entry with crab run --cache-push.
Missing Dependency
If a dependency is absent from the workspace:
crab run --pullIf the lockfile proves the missing dependency is unchanged and the stage can be skipped:
crab run --pull --allow-missingIf the dependency is an output of another stage, check the DAG:
crab workflow dagDirty or Conflicting Outputs
If a cache hit would overwrite local work, use:
crab run --no-overwriteThen inspect the changed output:
git status --short
git diff -- path/to/outputCommit, move, or discard the local output before replaying cache.
Scheduler Lock Is Held
Another crab run may be active:
crab run --no-wait
crab run --lock-timeout 30If a previous run crashed, inspect journals:
crab workflow journal ls
crab workflow journal show <run-id>Abandon a stuck run:
crab run --abandon <run-id>Only abandon runs you know are no longer active.
Interrupted Run
Crab journals stage progress so interrupted runs can be inspected and resumed. If outputs exist but the journal did not record final hashes, use:
crab run --resume-trust-outputsUse this only when you trust the existing outputs. Otherwise delete or move the outputs and rerun.
Lockfile Merge Conflict
Resolve conflict markers:
crab workflow lockfile resolve crab.lock
crab run --validate
crab run --dryCommit the resolved lockfile after a successful validation or rerun.
Experiment Cleanup
List experiments:
crab exp lsClean temporary worktrees and stale queue state:
crab exp cleanPrune old metadata:
crab exp gc --keep 100 --dry-run
crab exp gc --keep 100Journal Cleanup
crab workflow journal gc --keep 50Keep enough journals to debug recent CI failures. Long-term reproducibility should come from committed workflow files, lockfiles, metrics, and remote cache, not local journals.
Production Checklist
crab run --validatepasses.crab workflow dagshows the expected producer and consumer edges.crab run --dry --explain-missexplains every stale stage.crab.lockor split lockfiles are committed.- CI either proves
--cache-onlyor publishes with--cache-push. - Experiment metadata that needs sharing is pushed with
crab exp push. - Credentials live in the environment or local config, not in committed docs or workflow files.