Workflow Quickstart
This quickstart creates a two-stage pipeline, runs it, records metrics, and pushes the stage cache so another clone or CI job can reuse the result.
Prerequisites
You need a Crab repository with a configured remote:
mkdir ml-project
cd ml-project
git init
crab init crab://my-bucket/ml-project
crab config set workflow.enabled trueIf this is an existing repository, run the same crab config set command from
the repo root. Workflow commands are disabled until workflow.enabled is true.
Create a Small Project
mkdir -p src data models metrics
cat > data/raw.csv <<'EOF'
x,y
1,2
2,4
3,6
EOF
cat > params.yaml <<'EOF'
train:
multiplier: 2
EOFAdd tiny scripts for the example:
cat > src/prepare.py <<'EOF'
from pathlib import Path
raw = Path("data/raw.csv").read_text()
Path("data/prepared.csv").write_text(raw.replace(",", "\t"))
EOF
cat > src/train.py <<'EOF'
import json
from pathlib import Path
Path("models/model.txt").write_text("trained\n")
Path("metrics/train.json").write_text(json.dumps({"accuracy": 0.91}))
EOFDefine the Pipeline
Create crab.yaml:
params:
- params.yaml
metrics:
- metrics/train.json
stages:
prepare:
cmd: python src/prepare.py
deps:
- src/prepare.py
- data/raw.csv
outs:
- data/prepared.csv
train:
cmd: python src/train.py
deps:
- src/train.py
- data/prepared.csv
params:
- train.multiplier
outs:
- models/model.txt
metrics:
- metrics/train.jsonValidate before running:
crab run --validateRun and Inspect
crab run
crab workflow status
crab metrics show
crab workflow dag --format mermaidThe first run executes both stages and writes crab.lock. A second run should
skip unchanged stages:
crab run --dryUse --explain-miss when a stage runs and you expected a cache hit:
crab run --dry --explain-missCommit the Reproducible State
Commit the workflow definition, params, lockfile, scripts, and small metrics:
git add crab.yaml crab.lock params.yaml src data/raw.csv metrics/train.json
git commit -m "add reproducible workflow"Large outputs can be tracked with Crab before committing if your project produces models, datasets, or reports that should live in object storage:
crab track "models/**"
crab add models/model.txt
git add .gitattributes models/model.txt
git commit -m "track model output"Share the Stage Cache
Push git state and stage cache:
crab push
crab workflow push-cache --allFor day-to-day runs, publish cache entries as each stage completes:
crab run --cache-pushA clone or CI job can reuse remote cache entries:
crab clone crab://my-bucket/ml-project
cd ml-project
crab config set workflow.enabled true
crab run --cache-only--cache-only exits with code 3 on a cache miss. That is useful in CI when a
job should prove the committed lockfile and remote cache are complete.
Next Steps
- Learn the mental model in Concepts.
- Add richer stages in Authoring Stages.
- Publish cache from CI with Remote Cache and CI.