CI/CD Integration
This guide walks through configuring Crab in CI/CD pipelines so your automation can clone repositories with large files, hydrate them for builds or training, and push results back to object storage — all without manual intervention.
Overview
Crab works in CI the same way it works locally: a single binary handles the git remote protocol and large-file materialization. The main differences in CI are:
- You install Crab as a build step rather than from a package manager
- Credentials come from environment variables or OIDC federation
- You typically hydrate only the files needed for the current job
- Workflow caching (
crab run --cache-only) can skip expensive stages
Prerequisites
- A Crab-enabled repository (initialized with
crab init) - Cloud credentials (AWS, GCP, or Azure) available in your CI environment
- The
crabbinary accessible on$PATH
Installing Crab in CI
Download the latest release binary and place it on $PATH. The binary
is self-contained — no runtime dependencies beyond libc.
# Download and install (Linux x86_64)
curl -fsSL https://github.com/CrabBuild/crab-release/releases/latest/download/crab-linux-x86_64.tar.gz \
| tar -xz -C /usr/local/bin
# Create the git remote helper symlink
ln -sf /usr/local/bin/crab /usr/local/bin/git-remote-crab
# Verify
crab versionFor macOS runners, replace linux-x86_64 with darwin-aarch64 or
darwin-x86_64.
Configuring Credentials
Crab reads cloud credentials from standard environment variables. Set these as secrets in your CI platform.
AWS (S3)
export AWS_ACCESS_KEY_ID="$SECRET_AWS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="$SECRET_AWS_SECRET"
export AWS_DEFAULT_REGION="us-east-1"For OIDC-based authentication (recommended for GitHub Actions), use the
aws-actions/configure-aws-credentials action to assume a role without
long-lived keys.
GCP (GCS)
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"Or use Workload Identity Federation for keyless auth in GitHub Actions.
Azure (Blob Storage)
export AZURE_STORAGE_ACCOUNT="$SECRET_STORAGE_ACCOUNT"
export AZURE_STORAGE_KEY="$SECRET_STORAGE_KEY"See the AWS credentials, GCP credentials, and Azure credentials pages for detailed provider setup.
Cloning and Hydrating
A typical CI job clones the repository and hydrates the files needed for the build:
# Clone the repository (pointer files only — fast)
git clone crab://my-bucket/my-repo .
# Hydrate only the files this job needs
crab hydrate 'data/inputs/**'
crab hydrate 'models/base.pkl'Selective hydration keeps CI fast by downloading only what the current job requires. For jobs that need everything:
crab hydrate --allSee crab clone and
crab hydrate for all
available options.
Pushing Results
After a build or training run produces new artifacts, stage and push them:
# Stage new outputs
crab add models/trained.pkl metrics/results.json
# Commit
git commit -m "CI: training run $(date -u +%Y%m%dT%H%M%S)"
# Push to object storage
git pushSee crab add and
crab push for detailed
options.
Using Workflow Caching in CI
If your repository uses crab.yaml pipelines, CI can replay cached
stages instead of re-running expensive computations:
# Fail if any stage is not cached (reproducibility check)
crab run --cache-only
# Or run normally and push new cache entries for teammates
crab run --cache-pushSee crab run for the full
option reference.
Example: GitHub Actions
name: Train and Push
on:
push:
branches: [main]
jobs:
train:
runs-on: ubuntu-latest
permissions:
id-token: write # Required for OIDC
contents: read
steps:
- name: Configure AWS credentials (OIDC)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/ci-crab-role
aws-region: us-east-1
- name: Install Crab
run: |
curl -fsSL https://github.com/CrabBuild/crab-release/releases/latest/download/crab-linux-x86_64.tar.gz \
| tar -xz -C /usr/local/bin
ln -sf /usr/local/bin/crab /usr/local/bin/git-remote-crab
crab version
- name: Clone repository
run: git clone crab://my-bucket/my-repo .
- name: Hydrate training data
run: crab hydrate 'data/**' 'models/base.pkl'
- name: Run training pipeline
run: crab run --cache-push
- name: Push results
run: |
git config user.email "ci@example.com"
git config user.name "CI Bot"
crab add models/ metrics/
git commit -m "CI: training run ${{ github.sha }}"
git pushExample: GitLab CI
stages:
- train
variables:
AWS_DEFAULT_REGION: us-east-1
train:
stage: train
image: ubuntu:22.04
before_script:
# Install Crab
- apt-get update && apt-get install -y curl git
- curl -fsSL https://github.com/CrabBuild/crab-release/releases/latest/download/crab-linux-x86_64.tar.gz
| tar -xz -C /usr/local/bin
- ln -sf /usr/local/bin/crab /usr/local/bin/git-remote-crab
- crab version
# Configure git
- git config --global user.email "ci@example.com"
- git config --global user.name "CI Bot"
script:
# Clone and hydrate
- git clone crab://$CRAB_BUCKET/$CRAB_REPO .
- crab hydrate 'data/**'
# Run pipeline with cache
- crab run --cache-push
# Push results
- crab add models/ metrics/
- git commit -m "CI training run $CI_COMMIT_SHORT_SHA"
- git push
variables:
AWS_ACCESS_KEY_ID: $CI_AWS_KEY_ID
AWS_SECRET_ACCESS_KEY: $CI_AWS_SECRET
CRAB_BUCKET: my-bucket
CRAB_REPO: my-repoTips for CI Pipelines
Keep hydration selective. Only hydrate the files your job actually needs. This reduces download time and egress costs.
Use --cache-only for reproducibility checks. In PR pipelines, run
crab run --cache-only to verify that all stages have cached results
without re-executing anything. This catches cases where someone forgot
to push their cache.
Set --lock-timeout for parallel jobs. If multiple CI jobs push to
the same repository concurrently, increase the lock timeout to avoid
transient failures:
git push # Crab handles ref CAS retries automaticallyDehydrate before pushing. If your job hydrated files that it did not modify, dehydrate them before committing to keep the git index clean:
crab dehydrate --all
crab add models/output.pkl
git commit -m "CI: new model"
git pushCache the Crab binary. On GitHub Actions, cache the downloaded binary to avoid re-downloading on every run:
- name: Cache Crab binary
uses: actions/cache@v4
with:
path: /usr/local/bin/crab
key: crab-${{ runner.os }}-latestRelated Commands
crab clone— clone a Crab repositorycrab hydrate— materialize file content from object storagecrab dehydrate— replace files with pointer blobscrab add— stage files for Crab trackingcrab push— push objects to the remotecrab run— execute workflow stages with cachingcrab auth status— verify credential configuration