Importing Existing Buckets

If you already have files in cloud storage — datasets, models, media assets — you don't need to download them locally and re-upload through Crab. The import command reads objects directly from your bucket and creates a Crab-managed git repository in place.

What Import Does

Import flow — crab import reads files from an existing bucket and creates a Crab-managed git repository with pointers and metadata.

Import reads your existing files, chunks them, builds the Crab metadata (xorbs, shards, file indices), and creates a git repository with pointer blobs. The source files stay untouched — they're read, not moved.

After import, you have a fully functional Crab repository that you can clone, hydrate, push to, and collaborate on.

Same-Bucket Import

The most common case: your files are already in S3 and you want to add Crab management without moving data:

crab import \
  --from s3://my-bucket/datasets/v2/ \
  --to   s3://my-bucket/repos/v2

When source and target are in the same bucket, Crab avoids re-uploading bytes — it creates the metadata layout alongside your existing objects.

Cross-Bucket Import

Move data management from one bucket to another:

crab import \
  --from s3://data-lake-prod/models/ \
  --to   s3://crab-repos/models-v1

Versioned Bucket History

If your source bucket has S3 versioning enabled, Crab can reconstruct git history from object versions:

Each time window (default: 1 hour) becomes one git commit. Delete markers surface as git deletions.

# Auto-detect versioning, one commit per hour
crab import \
  --from s3://versioned-bucket/prod/ \
  --to   s3://versioned-bucket/repos/prod

# Coarser commits — one per day
crab import \
  --from s3://versioned-bucket/prod/ \
  --to   s3://versioned-bucket/repos/prod \
  --window 24h

Filtering What Gets Imported

You don't have to import everything:

# Only model files
crab import \
  --from s3://my-bucket/data/ \
  --to   s3://my-bucket/repos/models \
  --include '*.safetensors' --include '*.bin'

# Everything except logs
crab import \
  --from s3://my-bucket/data/ \
  --to   s3://my-bucket/repos/data \
  --exclude 'logs/**'

Preview Before Importing

See what would be imported without actually doing it:

crab import \
  --from s3://my-bucket/datasets/ \
  --to   s3://my-bucket/repos/datasets \
  --dry-run

Shows file count, total bytes, extension breakdown, and planned commit count.

Resumable Imports

Large imports (millions of files, terabytes of data) can take hours. If interrupted, resume from where you left off:

# First run (interrupted)
crab import --from s3://bucket/data/ --to s3://bucket/repos/data --into ./data

# Resume after interruption
crab import --into ./data --resume

The import journal (.crab/import-journal.db) tracks progress per-object. Only pending and failed objects are retried on resume.

After Import

Once import completes, you have a standard Crab repository:

cd imported-repo
crab status              # see tracked files
crab hydrate --all       # materialize content
git remote -v            # origin points to your target URL

Collaborators can clone it:

crab clone crab://my-bucket/repos/data

Supported Sources

Provider	URL Format
AWS S3	`s3://bucket/prefix/`
Google Cloud Storage	`gs://bucket/prefix/`
Azure Blob Storage	`az://container/prefix/`
Local filesystem	`file:///path/to/dir/`

CLI Reference

For complete command syntax, all options (versioning, time ranges, author templates, error codes), and JSON output format, see the crab import reference.

Importing Existing Buckets

What Import Does

Import flow — crab import reads files from an existing bucket and creates a Crab-managed git repository with pointers and metadata.

After import, you have a fully functional Crab repository that you can clone, hydrate, push to, and collaborate on.

Same-Bucket Import

The most common case: your files are already in S3 and you want to add Crab management without moving data:

crab import \
  --from s3://my-bucket/datasets/v2/ \
  --to   s3://my-bucket/repos/v2

When source and target are in the same bucket, Crab avoids re-uploading bytes — it creates the metadata layout alongside your existing objects.

Cross-Bucket Import

Move data management from one bucket to another:

crab import \
  --from s3://data-lake-prod/models/ \
  --to   s3://crab-repos/models-v1

Versioned Bucket History

If your source bucket has S3 versioning enabled, Crab can reconstruct git history from object versions:

Each time window (default: 1 hour) becomes one git commit. Delete markers surface as git deletions.

# Auto-detect versioning, one commit per hour
crab import \
  --from s3://versioned-bucket/prod/ \
  --to   s3://versioned-bucket/repos/prod

# Coarser commits — one per day
crab import \
  --from s3://versioned-bucket/prod/ \
  --to   s3://versioned-bucket/repos/prod \
  --window 24h

Filtering What Gets Imported

You don't have to import everything:

# Only model files
crab import \
  --from s3://my-bucket/data/ \
  --to   s3://my-bucket/repos/models \
  --include '*.safetensors' --include '*.bin'

# Everything except logs
crab import \
  --from s3://my-bucket/data/ \
  --to   s3://my-bucket/repos/data \
  --exclude 'logs/**'

Preview Before Importing

See what would be imported without actually doing it:

crab import \
  --from s3://my-bucket/datasets/ \
  --to   s3://my-bucket/repos/datasets \
  --dry-run

Shows file count, total bytes, extension breakdown, and planned commit count.

Resumable Imports

Large imports (millions of files, terabytes of data) can take hours. If interrupted, resume from where you left off:

# First run (interrupted)
crab import --from s3://bucket/data/ --to s3://bucket/repos/data --into ./data

# Resume after interruption
crab import --into ./data --resume

The import journal (.crab/import-journal.db) tracks progress per-object. Only pending and failed objects are retried on resume.

After Import

Once import completes, you have a standard Crab repository:

cd imported-repo
crab status              # see tracked files
crab hydrate --all       # materialize content
git remote -v            # origin points to your target URL

Collaborators can clone it:

crab clone crab://my-bucket/repos/data

Supported Sources

Provider	URL Format
AWS S3	`s3://bucket/prefix/`
Google Cloud Storage	`gs://bucket/prefix/`
Azure Blob Storage	`az://container/prefix/`
Local filesystem	`file:///path/to/dir/`

CLI Reference

For complete command syntax, all options (versioning, time ranges, author templates, error codes), and JSON output format, see the crab import reference.

Importing Existing Buckets

What Import Does

Same-Bucket Import

Cross-Bucket Import

Versioned Bucket History

Filtering What Gets Imported

Preview Before Importing

Resumable Imports

After Import

Supported Sources

CLI Reference

On this page

Importing Existing Buckets

What Import Does

Same-Bucket Import

Cross-Bucket Import

Versioned Bucket History

Filtering What Gets Imported

Preview Before Importing

Resumable Imports

After Import

Supported Sources

CLI Reference

On this page