Authoring Stages
You can author stages with crab stage add or by editing crab.yaml directly.
Use the helper when you want DVC-style command ergonomics. Edit YAML directly
when the stage uses advanced fields, shared defaults, foreach, or matrix.
Use crab stage add
crab stage add -n prepare \
-d src/prepare.py \
-d data/raw.csv \
-o data/prepared.csv \
python src/prepare.pyAdd metrics and plots:
crab stage add -n train \
-d src/train.py \
-d data/prepared.csv \
-p train.lr \
-p train.epochs \
-o models/model.pkl \
-m metrics/train.json \
--plots metrics/loss.csv \
python src/train.pyUse --force to replace an existing stage definition. Use --run when the
stage should run immediately after being written.
Write YAML Directly
params:
- params.yaml
stages:
train:
desc: Train the model used by batch scoring
cmd: python src/train.py --config params.yaml
deps:
- src/train.py
- data/features.parquet
params:
- train.lr
- train.epochs
outs:
- models/model.pkl
metrics:
- metrics/train.json
plots:
- metrics/loss.csv:
x: epoch
y: lossKeep stages narrow. One stage should produce one logical result. If a command downloads data, trains a model, and publishes a report, split it unless those steps must always be invalidated together.
Command Forms
Use a shell string for common pipelines:
cmd: python src/train.py && python src/evaluate.pyUse a list when you want to preserve separate command lines:
cmd:
- python src/train.py
- python src/evaluate.pyPrefer explicit scripts over long inline shell blocks. That keeps workflow diffs small and makes stage hashes easier to reason about.
Working Directories
wdir runs the command from a subdirectory. Relative deps, outs, metrics, and
stage-local params are resolved from that working directory.
stages:
train:
wdir: pipelines/training
cmd: python train.py
deps:
- train.py
- ../../data/features.parquet
outs:
- model.pklUse wdir for subprojects. Do not use it to hide cross-project dependencies;
declare those paths explicitly.
Templating
Use ${...} to reuse values from vars and params files:
vars:
- codedir: src
- params.yaml: [paths, train]
stages:
prepare:
cmd: python ${codedir}/prepare.py --out ${paths.prepared}
deps:
- ${codedir}/prepare.py
- ${paths.raw}
outs:
- ${paths.prepared}Template values should describe stable project structure, not per-run choices. Use params and experiments for per-run choices.
Foreach and Matrix
Use foreach when one stage shape repeats over a list or map:
stages:
featurize:
foreach:
small:
window: 8
large:
window: 32
do:
cmd: python src/featurize.py --window ${item.window} --out data/${key}.parquet
deps:
- src/featurize.py
- data/raw.csv
outs:
- data/${key}.parquetUse matrix for cartesian products:
stages:
train:
matrix:
model: [linear, forest]
seed: [1, 2, 3]
cmd: python src/train.py --model ${item.model} --seed ${item.seed}
deps:
- src/train.py
- data/features.parquet
outs:
- models/${item.model}-${item.seed}.pklExpanded stage names appear in DAG and status output with @ suffixes.
Frozen and Nondeterministic Stages
Freeze a stage when upstream changes should not trigger it:
crab freeze train
crab unfreeze trainYAML form:
stages:
train:
frozen: trueUse nondeterministic: true or DVC-compatible always_changed: true for
commands that should run every time, such as live data imports or timestamped
reports:
stages:
ingest:
nondeterministic: truePrefer deterministic stages for production pipelines. Nondeterministic stages are useful at boundaries, but they reduce cache reuse.
External Dependencies and Outputs
Stages can depend on local paths, upstream outs, URLs, and configured workflow remote aliases. Pin remote data with digests when reproducibility matters.
stages:
train:
deps:
- data/features.parquet
- url: https://example.com/schema.json
digest: b3:0000000000000000000000000000000000000000000000000000000000000000
outs:
- models/model.pklFor object-store outputs, configure workflow remotes in .crab/config.toml and
use them consistently across stages. Keep credentials out of committed files.
Validate Before Running
crab run --validate
crab workflow dagValidation catches duplicate outputs, cycles, invalid names, and malformed stage fields before an expensive command starts.
For command-level details, see Running Commands and Workflow Pipelines.