Use Cases

One tool for every large-file workflow

From multi-gigabyte model weights to game asset libraries to CI fixtures, Crab fits anywhere teams already use Git. No servers, no LFS endpoints — just your cloud bucket.

Read the Docs Contact Us

ML & AI Teams

Version a 70 GB checkpoint like it's source code

Track model weights, datasets, and adapters with the same git workflow that ships your training code — and re-upload only the bytes that actually changed.

The problem. Every fine-tune produces a fresh multi-gigabyte checkpoint. Git LFS uploads the whole file every time. Custom S3 scripts drift away from the commit graph, so reproducing an experiment six months later means reverse-engineering filenames in a shared bucket.

The Crab solution. Content-defined chunking splits weights into variable-sized blocks, so a fine-tune that touches a few transformer layers re-uploads only the changed chunks. The 3-tier dedup pipeline (session → shard → DB index) shares chunks across branches and forks. Lazy checkout lets researchers clone a 200 GB experiment repo in seconds and hydrate only the artifacts their evaluation needs.

70%: Storage savings on incremental fine-tunes
10×: Faster clone vs. Git LFS for multi-GB repos
$0: SaaS fees — pay only your cloud storage

Model versioning

Branch, tag, and roll back model weights with the same Git workflow you already use for code. git checkout v0.4.2 — get the whole stack.

Dataset pinning

Lock experiments to immutable dataset commits. Reproduce any run by checking out one ref — no separate data registry to keep in sync.

Chunk-level reuse

A 1 GB checkpoint that shares 95% of its bytes with the previous epoch uploads ~50 MB. LoRA adapters cost almost nothing to ship.

Data Science

One commit per dataset, forever reproducible

Stop emailing S3 keys. Datasets live in the same repo as the notebook, with chunk-level dedup that handles append-only growth efficiently.

The problem. Notebooks reference Parquet, CSV, and feather files that change week-over-week. Teams email links to S3 keys, copy data into personal scratch buckets, and spend hours diffing “v3 final FINAL” folders. Reproducibility breaks the moment a colleague overwrites a path.

The Crab solution. Datasets and analysis sit in the same Git repo. Content-defined chunking handles append-only growth — adding a million rows to a 50 GB Parquet file uploads only the new tail chunks. 3-tier dedup shares chunks across feature-engineering branches, so exploring 10 variants doesn't cost 10× the storage. Every notebook is pinned to an exact dataset commit.

85%: Storage savings on append-only Parquet datasets
60%: Less time reconciling data versions across teammates
1:1: Notebook-to-dataset commit traceability

Dataset versioning

Roll back to any prior dataset revision with git checkout. No external metadata store. No drift between code and data.

Shared datasets

git clone crab://bucket/datasets and hydrate on demand — no copying gigabytes across machines just to read one column.

Reproducible experiments

Every analysis links to an immutable dataset commit. Re-run a notebook six months later and get identical results.

Game Development

500 GB project, 30-second clone

Texture atlases, audio banks, FBX meshes, and baked lighting all live in Git. Re-exports upload deltas. The editor opens before the disk fills.

The problem. Game projects accumulate hundreds of gigabytes of binary assets. Re-exporting a single texture re-uploads the whole file under Git LFS. A fresh clone takes hours, fills the disk, and the artist still hasn't opened the editor.

The Crab solution. Content-defined chunking detects that a re-exported texture shares most of its bytes with the previous version and uploads only the deltas. The optional FUSE mount presents the full asset tree as a virtual filesystem — the editor opens immediately and chunks stream in on first read. History stays cheap because every duplicate chunk across the entire project is stored once.

80%: Reduction in upload size on iterative asset re-exports
50×: Faster initial clone via FUSE-backed lazy checkout
90%: Cloud storage savings on multi-year asset history

Asset deduplication

Texture atlases, audio banks, and meshes are chunked. Re-exports upload only the changed bytes — not the whole asset.

Full history, low cost

Keep every revision of every asset in Git without ballooning your S3 bill. Old chunks survive once, not per-revision.

FUSE mount

Browse a 500 GB project as if it were on disk. Hydrate only the chunks the editor opens. No more 'wait for clone' Slack pings.

Large Binary Assets

No per-file ceiling. No SaaS in the middle.

Scientific archives, medical imaging, geospatial tiles, CAD assemblies — store any file at any size, byte-identical, directly in object storage.

The problem. Scientific archives, medical imaging, geospatial tiles, and CAD assemblies push individual files into the tens of gigabytes. Git LFS chokes above 5 GB per file, requires a managed server, and never deduplicates within a single blob. Plain S3 uploads lose all version history and branching.

The Crab solution. Crab has no per-file size ceiling — files are stored as xorbs (compressed chunk packs) directly in object storage. Resumable uploads survive flaky connections. Lazy checkout lets a workstation pull only the slices of a 200 GB volume that a given task needs. Verification is byte-identical via Blake3 hashes, end-to-end.

∞: No per-file size cap (vs. 5 GB on Git LFS)
75%: Cost ratio vs. duplicated full-file storage
100%: Byte-identical reconstruction, verified by Blake3

Crab vs. Git LFS vs. DVC vs. Hugging Face Hub

Feature	Crab	Git LFS	DVC	Hugging Face Hub
Maximum file size	Unlimited	5 GB (GitHub)	Unlimited	50 GB per file
Deduplication method	Content-defined chunking + 3-tier dedup	None (whole-file)	Whole-file content hashing	Xet chunk-level dedup (newer repos)
Server infrastructure required	None — object storage only	LFS server (self-hosted or SaaS)	None — object storage only	Hugging Face hosted SaaS
Supported storage backends	S3, GCS, Azure Blob (any object store)	Git host's LFS service	S3, GCS, Azure, SSH, local	Hugging Face Hub only
Lazy / partial checkout	Yes — pointer blobs + FUSE mount	No — full file on checkout	Partial via dvc pull <path>	Yes — hf hub download per file
Git compatibility level	Native — git remote helper	Native — git extension	Sidecar — separate dvc CLI	Git-compatible mirror (LFS/Xet)
Cloud-native auth (IAM/SA/MI)
No SaaS dependency

Enterprise

Zero new vendors. Your VPC, your IAM, your bucket.

Crab is a single binary that talks to object storage with the credentials your team already manages. No SaaS, no separate control plane, nothing for security review to gate.

The problem. Security and platform teams resist adding another SaaS vendor to the supply chain. Existing LFS solutions require provisioning servers, managing certificates, rotating credentials, and re-running SOC 2 review for yet another hosted service — all to push a few gigabytes through a bucket the organization already owns.

The Crab solution. Crab is a single binary with cloud-native authentication — IAM roles on AWS, service accounts on GCP, managed identities on Azure. There is no Crab server, no separate control plane, and no data egress outside your VPC. The 3-tier deduplication runs entirely on the developer's machine and the bucket; audit logs flow through the cloud provider's existing tooling.

0: New servers, services, or SaaS vendors to onboard
100%: Of data stays inside your existing VPC and IAM perimeter
1: Binary to deploy across developer machines and CI runners

Cloud-native auth

IAM roles, service accounts, and managed identities. Reuse the security posture you already audit — no new credentials to rotate.

Single binary

One static binary on developer machines and CI runners. No control plane to deploy, scale, or patch.

Zero attack surface

No third-party servers, no proxy, no data egress outside your VPC. The bucket you already own is the entire dependency.

DevOps & CI/CD

Cut CI clone time by 90%, on every PR

Lazy checkout pulls only the fixtures a job actually reads. Runner-local chunk cache means repeat builds skip the network entirely.

The problem. CI runners spend most of their wall-clock time cloning repos that contain trained models, signed artifacts, container base layers, or test fixtures measured in gigabytes. Cache misses on a self-hosted runner translate directly into minutes of paid compute and slower feedback on every PR.

The Crab solution. CI jobs use lazy checkout to clone only pointer blobs and hydrate on demand — a test that needs one fixture pulls one fixture, not the whole 80 GB suite. The chunk cache on a runner is reused across jobs, so subsequent builds hit local disk instead of S3. Resumable uploads mean a flaky runner doesn't restart a 5 GB push from scratch.

90%: Reduction in CI clone time on large-asset repos
65%: Lower compute cost per pipeline run
0: Wasted bytes from re-pushing after a transient failure

Lazy checkout in CI

Pull only the artifacts a job actually reads. A 100 GB repo behaves like a 100 MB clone — and feedback lands faster.

Runner-local chunk cache

Warm cache across pipeline runs. Repeat builds hit local disk instead of object storage. Same fixtures, near-zero pull time.

Resumable uploads

Network blips don't restart a multi-gigabyte push. Resume from the last completed xorb — flaky runners stop costing money.

Ready to ship large files like source code?

Pick the plan that fits, or jump straight into the CLI guide.

View Pricing Contact Us

Feature

Crab

Git LFS

DVC

Hugging Face Hub

Maximum file size

Unlimited

5 GB (GitHub)

Unlimited

50 GB per file

Deduplication method

Content-defined chunking + 3-tier dedup

None (whole-file)

Whole-file content hashing

Xet chunk-level dedup (newer repos)

Server infrastructure required

None — object storage only

LFS server (self-hosted or SaaS)

None — object storage only

Hugging Face hosted SaaS

Supported storage backends

S3, GCS, Azure Blob (any object store)

Git host's LFS service

S3, GCS, Azure, SSH, local

Hugging Face Hub only

Lazy / partial checkout

Yes — pointer blobs + FUSE mount

No — full file on checkout

Partial via dvc pull <path>

Yes — hf hub download per file

Git compatibility level

Native — git remote helper

Native — git extension

Sidecar — separate dvc CLI

Git-compatible mirror (LFS/Xet)

Cloud-native auth (IAM/SA/MI)

No SaaS dependency

One tool for every large-file workflow

Version a 70 GB checkpoint like it's source code

ML and AI Teams

One commit per dataset, forever reproducible

Data Science

500 GB project, 30-second clone

Game Development

No per-file ceiling. No SaaS in the middle.

Large Binary Assets

Crab vs. Git LFS vs. DVC vs. Hugging Face Hub

Zero new vendors. Your VPC, your IAM, your bucket.

Enterprise

Cut CI clone time by 90%, on every PR

DevOps and CI/CD

Ready to ship large files like source code?

One tool for every large-file workflow

Version a 70 GB checkpoint like it's source code

ML and AI Teams

One commit per dataset, forever reproducible

Data Science

500 GB project, 30-second clone

Game Development

No per-file ceiling. No SaaS in the middle.

Large Binary Assets

Crab vs. Git LFS vs. DVC vs. Hugging Face Hub

Zero new vendors. Your VPC, your IAM, your bucket.

Enterprise

Cut CI clone time by 90%, on every PR

DevOps and CI/CD

Ready to ship large files like source code?