Crab Workflow
Crab Workflow is the production pipeline layer for repositories that already use Crab for large files. It lets you define stages, track dependencies and outputs, cache stage results, compare metrics, and run experiments without adding a separate data server.
Use Workflow when a repository has commands that should be reproducible: data preparation, model training, evaluation, report generation, asset builds, or any expensive step where "same inputs, same outputs" should be reusable.
What Workflow Adds
Crab's file layer handles large content in git by storing pointers in commits and bytes in object storage. Workflow sits above that layer:
| Layer | What it does | Typical commands |
|---|---|---|
| Git | Tracks source, workflow files, params, lockfiles, and pointer blobs | git commit, git diff |
| Crab file storage | Chunks, deduplicates, hydrates, dehydrates, pushes, and pulls large files | crab add, crab push, crab hydrate |
| Crab Workflow | Runs reproducible stages, caches outputs, compares params and metrics, and manages experiments | crab run, crab exp, crab metrics, crab plots |
The workflow layer is additive. A repo can use Crab for large files without Workflow, and a Workflow repo still uses normal git commits and branches.
Production Model
A workflow repo usually commits these files:
crab.yamland optional*.workflow.yamlfiles declare stages.params.yamlor other params files hold values used by stages.crab.lockand optional*.workflow.lockfiles record the exact state that ran successfully.- Source files, scripts, configs, and small metrics live in git.
- Large stage outputs can be Crab-managed files backed by object storage.
Local runtime state stays out of git:
.crab/cache/stages/stores local stage cache entries..crab/workflow/runs/stores run journals for crash recovery and audit..crab/workflow/exp/stores local experiment metadata.- Queued experiment logs and temporary worktrees are local until pushed.
DVC-Style, Crab-Native
Workflow is inspired by DVC's pipeline and experiment model, but it is designed to live inside Crab's serverless git remote:
- Stages use familiar
deps,outs,params,metrics, andplots. crab reprois an alias forcrab run.crab stage addauthors DVC-style stage definitions.crab expruns experiments in temporary worktrees.crab queueandcrab exp queuerun batches of experiment overrides.crab workflow push-cacheshares stage cache through the configured Crab remote.
If you are moving from DVC, start with Migrating from DVC.
Recommended Learning Path
- Quickstart - enable Workflow and run a small pipeline.
- Concepts - understand stages, cache, lockfiles, journals, and experiments.
- Authoring Stages - write production
crab.yamlfiles. - Running Pipelines - use target modes, cache controls, and JSON output.
- Experiments and Hydra - compare parameterized runs.
- Remote Cache and CI - make cache reuse work for teams and automation.
- Operations and Troubleshooting - recover from common failures.
Command-level details remain in Automation & Pipelines and CLI Reference.