foreach / matrix Reference
Stage expansion lets you define a stage template once and expand it into
multiple concrete stages — one per item (foreach) or one per
combination (matrix). Expanded stages are regular DAG nodes: they
participate in scheduling, caching, and the lockfile individually.
Table of Contents
- foreach syntax
- Available variables
- Expanded stage naming
- matrix syntax
- matrix variables
- Combining with templating
- Interaction with DAG, lockfile, and --validate
foreach syntax
A stage with foreach: and do: expands into one stage per item.
Three forms are supported.
List of scalars
stages:
preprocess:
foreach:
- raw_a
- raw_b
- raw_c
do:
cmd: "python clean.py ${item}"
deps:
- ${item}.csv
outs:
- ${item}_clean.csvProduces three stages: preprocess@raw_a, preprocess@raw_b,
preprocess@raw_c.
List of dicts
stages:
train:
foreach:
- {name: small, lr: 0.01, epochs: 10}
- {name: large, lr: 0.001, epochs: 100}
do:
cmd: "python train.py --lr ${item.lr} --epochs ${item.epochs}"
deps:
- train.py
outs:
- models/${item.name}.pklProduces: train@0, train@1. Access fields via ${item.field}.
Dict form
stages:
build:
foreach:
uk:
region: eu-west-2
bucket: data-uk
us:
region: us-east-1
bucket: data-us
do:
cmd: "python build.py --region ${item.region} --bucket ${item.bucket}"
deps:
- build.py
outs:
- output/${key}/Produces: build@uk, build@us. The dict key is available as ${key}.
Available variables
Inside the do: block, these variables are available for substitution:
| Variable | Available when | Value |
|---|---|---|
${item} | List of scalars | The scalar value itself |
${item} | Dict form | The value mapping for the current key |
${item.field} | List of dicts or dict form | Nested field access |
${key} | Dict form | The dict key (e.g., uk, us) |
${index} | All forms | Zero-based iteration index |
Examples
# List of scalars: ${item} = "raw_a", ${index} = 0
foreach: [raw_a, raw_b]
# List of dicts: ${item.name} = "small", ${index} = 0
foreach:
- {name: small, lr: 0.01}
# Dict form: ${key} = "uk", ${item.region} = "eu-west-2", ${index} = 0
foreach:
uk: {region: eu-west-2}
us: {region: us-east-1}Expanded stage naming
Expanded stages follow the pattern base@suffix:
| Source form | Suffix | Example names |
|---|---|---|
| List of scalars | @value | preprocess@raw_a, preprocess@raw_b |
| List of dicts | @index | train@0, train@1 |
| Dict form | @key | build@uk, build@us |
| Matrix | @val1-val2 | train@resnet-imagenet |
The @ separator is chosen because:
- It's not valid in the base stage-name grammar (no collisions with regular stages).
- It's URL-safe and shell-safe.
- DVC uses the same convention (familiar to migrating users).
Both the base name and the suffix must individually satisfy the stage
name grammar: [a-zA-Z_][a-zA-Z0-9_-]{0,63}. If an item value
produces an invalid suffix (e.g., contains spaces), the parser rejects
it with a clear error.
Name collisions
If expansion produces a stage name that collides with another stage (expanded or not), the parser fails:
Error: WorkflowExpandedNameCollision
name: "train@resnet"
sources: ["train (foreach item 0)", "train_resnet (explicit stage)"]matrix syntax
matrix: expands a stage over the Cartesian product of multiple
variables. Use it for hyperparameter sweeps and multi-dimensional
exploration.
stages:
train:
matrix:
model: [resnet, vgg, efficientnet]
dataset: [imagenet, cifar10]
cmd: "python train.py --model ${item.model} --data ${item.dataset}"
deps:
- train.py
- data/${item.dataset}/
outs:
- models/${item.model}-${item.dataset}.pkl
params:
- model.lr
metrics:
- metrics/${item.model}-${item.dataset}.jsonThis produces 3 × 2 = 6 stages:
train@resnet-imagenettrain@resnet-cifar10train@vgg-imagenettrain@vgg-cifar10train@efficientnet-imagenettrain@efficientnet-cifar10
Matrix with more dimensions
stages:
sweep:
matrix:
arch: [resnet, vgg]
optimizer: [adam, sgd]
scheduler: [cosine, step]
cmd: "python train.py --arch ${item.arch} --opt ${item.optimizer} --sched ${item.scheduler}"
outs:
- results/${item.arch}-${item.optimizer}-${item.scheduler}.jsonProduces 2 × 2 × 2 = 8 stages. The suffix joins all values with
hyphens: sweep@resnet-adam-cosine, sweep@vgg-sgd-step, etc.
matrix variables
Inside a matrix stage, these variables are available:
| Variable | Value |
|---|---|
${item.var} | The value of matrix variable var for this combination |
${key} | Hyphen-joined combination (e.g., resnet-imagenet) |
matrix:
model: [resnet, vgg]
dataset: [imagenet, cifar10]
# ${item.model} = "resnet"
# ${item.dataset} = "imagenet"
# ${key} = "resnet-imagenet"Dynamic matrix values
Matrix values can reference ${param} expressions that resolve before
expansion:
vars:
- models_to_test: [resnet, vgg]
stages:
train:
matrix:
model: ${models_to_test}
dataset: [imagenet, cifar10]
cmd: "python train.py --model ${item.model} --data ${item.dataset}"The ${models_to_test} reference is resolved first (from vars or
params), then the matrix is expanded over the resulting list.
Combining with templating
Global vars: and params are available inside foreach and matrix
stage templates. All substitution sources combine:
vars:
- script_dir: src/ml
- output_base: results
stages:
train:
matrix:
model: [resnet, vgg]
dataset: [imagenet, cifar10]
cmd: "python ${script_dir}/train.py --model ${item.model} --data ${item.dataset}"
deps:
- ${script_dir}/train.py
- data/${item.dataset}/
params:
- model.lr
- model.epochs
outs:
- ${output_base}/${item.model}-${item.dataset}.pklResolution order within an expanded stage:
- Item-local variables (
${item},${key},${index}) - Global
vars: - Params files
- Environment (
${env.VAR})
Item-local variables take precedence. If your params file has a key
named item, it's shadowed inside foreach/matrix templates.
Interaction with DAG, lockfile, and --validate
DAG
Expanded stages appear as individual nodes in the DAG. Dependencies between expanded stages and other stages work normally:
stages:
preprocess:
foreach: [raw_a, raw_b]
do:
cmd: "python clean.py ${item}"
outs:
- ${item}_clean.csv
merge:
cmd: "python merge.py"
deps:
- raw_a_clean.csv
- raw_b_clean.csv
outs:
- merged.csvThe DAG infers edges: preprocess@raw_a → merge and
preprocess@raw_b → merge (via matching output/dep paths).
View the expanded DAG:
crab dag
# Shows: preprocess@raw_a, preprocess@raw_b, merge
# with edges from both preprocess stages to mergeLockfile
The lockfile records expanded stages individually. No foreach or
matrix structure is preserved:
# crab.lock (excerpt)
stages:
preprocess@raw_a:
stage_hash: "b3:abc..."
cmd:
shell: "python clean.py raw_a"
deps:
- path: "raw_a.csv"
hash: "b3:..."
outs:
- path: "raw_a_clean.csv"
hash: "b3:..."
preprocess@raw_b:
stage_hash: "b3:def..."
cmd:
shell: "python clean.py raw_b"
# ...Each expanded stage has its own hash and is cached independently. Changing one item's input only invalidates that item's stage.
--validate
Validation expands all foreach and matrix stages and reports them
individually:
$ crab run --validate
Pipeline validated successfully.
Stages (5):
preprocess@raw_a ✓
preprocess@raw_b ✓
preprocess@raw_c ✓
merge ✓
report ✓
No undefined references.
No name collisions.Validation catches:
- Undefined
${item.field}references (field doesn't exist on the item) - Name collisions between expanded stages
- Empty
foreachlists ormatrixvariables - Invalid suffixes that violate the stage name grammar
crab status
Status reports expanded stages individually:
$ crab status
preprocess@raw_a: up-to-date
preprocess@raw_b: outdated
changed deps:
- raw_b.csv (content modified)
preprocess@raw_c: up-to-date
merge: outdated (upstream: preprocess@raw_b)Selective execution
Run a single expanded stage by its full name:
crab run preprocess@raw_bOr use glob patterns to target a subset:
crab run --stages 'preprocess@*'
crab run --stages 'train@resnet-*'