One tool for every large-file workflow
From multi-gigabyte model weights to game asset libraries to CI fixtures, Crab fits anywhere teams already use Git. No servers, no LFS endpoints — just your cloud bucket.
Version a 70 GB checkpoint like it's source code
Track model weights, datasets, and adapters with the same git workflow that ships your training code — and re-upload only the bytes that actually changed.
ML and AI Teams
The problem. Every fine-tune produces a fresh multi-gigabyte checkpoint. Git LFS uploads the whole file every time. Custom S3 scripts drift away from the commit graph, so reproducing an experiment six months later means reverse-engineering filenames in a shared bucket.
The Crab solution. Content-defined chunking splits weights into variable-sized blocks, so a fine-tune that touches a few transformer layers re-uploads only the changed chunks. The 3-tier dedup pipeline (session → shard → DB index) shares chunks across branches and forks. Lazy checkout lets researchers clone a 200 GB experiment repo in seconds and hydrate only the artifacts their evaluation needs.
- 70%
- Storage savings on incremental fine-tunes
- 10×
- Faster clone vs. Git LFS for multi-GB repos
- $0
- SaaS fees — pay only your cloud storage
One commit per dataset, forever reproducible
Stop emailing S3 keys. Datasets live in the same repo as the notebook, with chunk-level dedup that handles append-only growth efficiently.
Data Science
The problem. Notebooks reference Parquet, CSV, and feather files that change week-over-week. Teams email links to S3 keys, copy data into personal scratch buckets, and spend hours diffing “v3 final FINAL” folders. Reproducibility breaks the moment a colleague overwrites a path.
The Crab solution. Datasets and analysis sit in the same Git repo. Content-defined chunking handles append-only growth — adding a million rows to a 50 GB Parquet file uploads only the new tail chunks. 3-tier dedup shares chunks across feature-engineering branches, so exploring 10 variants doesn't cost 10× the storage. Every notebook is pinned to an exact dataset commit.
- 85%
- Storage savings on append-only Parquet datasets
- 60%
- Less time reconciling data versions across teammates
- 1:1
- Notebook-to-dataset commit traceability
500 GB project, 30-second clone
Texture atlases, audio banks, FBX meshes, and baked lighting all live in Git. Re-exports upload deltas. The editor opens before the disk fills.
Game Development
The problem. Game projects accumulate hundreds of gigabytes of binary assets. Re-exporting a single texture re-uploads the whole file under Git LFS. A fresh clone takes hours, fills the disk, and the artist still hasn't opened the editor.
The Crab solution. Content-defined chunking detects that a re-exported texture shares most of its bytes with the previous version and uploads only the deltas. The optional FUSE mount presents the full asset tree as a virtual filesystem — the editor opens immediately and chunks stream in on first read. History stays cheap because every duplicate chunk across the entire project is stored once.
- 80%
- Reduction in upload size on iterative asset re-exports
- 50×
- Faster initial clone via FUSE-backed lazy checkout
- 90%
- Cloud storage savings on multi-year asset history
No per-file ceiling. No SaaS in the middle.
Scientific archives, medical imaging, geospatial tiles, CAD assemblies — store any file at any size, byte-identical, directly in object storage.
Large Binary Assets
The problem. Scientific archives, medical imaging, geospatial tiles, and CAD assemblies push individual files into the tens of gigabytes. Git LFS chokes above 5 GB per file, requires a managed server, and never deduplicates within a single blob. Plain S3 uploads lose all version history and branching.
The Crab solution. Crab has no per-file size ceiling — files are stored as xorbs (compressed chunk packs) directly in object storage. Resumable uploads survive flaky connections. Lazy checkout lets a workstation pull only the slices of a 200 GB volume that a given task needs. Verification is byte-identical via Blake3 hashes, end-to-end.
- ∞
- No per-file size cap (vs. 5 GB on Git LFS)
- 75%
- Cost ratio vs. duplicated full-file storage
- 100%
- Byte-identical reconstruction, verified by Blake3
Crab vs. Git LFS vs. DVC vs. Hugging Face Hub
| Feature | Crab | Git LFS | DVC | Hugging Face Hub |
|---|---|---|---|---|
| Maximum file size | Unlimited | 5 GB (GitHub) | Unlimited | 50 GB per file |
| Deduplication method | Content-defined chunking + 3-tier dedup | None (whole-file) | Whole-file content hashing | Xet chunk-level dedup (newer repos) |
| Server infrastructure required | None — object storage only | LFS server (self-hosted or SaaS) | None — object storage only | Hugging Face hosted SaaS |
| Supported storage backends | S3, GCS, Azure Blob (any object store) | Git host's LFS service | S3, GCS, Azure, SSH, local | Hugging Face Hub only |
| Lazy / partial checkout | Yes — pointer blobs + FUSE mount | No — full file on checkout | Partial via dvc pull <path> | Yes — hf hub download per file |
| Git compatibility level | Native — git remote helper | Native — git extension | Sidecar — separate dvc CLI | Git-compatible mirror (LFS/Xet) |
| Cloud-native auth (IAM/SA/MI) | ||||
| No SaaS dependency |
Zero new vendors. Your VPC, your IAM, your bucket.
Crab is a single binary that talks to object storage with the credentials your team already manages. No SaaS, no separate control plane, nothing for security review to gate.
Enterprise
The problem. Security and platform teams resist adding another SaaS vendor to the supply chain. Existing LFS solutions require provisioning servers, managing certificates, rotating credentials, and re-running SOC 2 review for yet another hosted service — all to push a few gigabytes through a bucket the organization already owns.
The Crab solution. Crab is a single binary with cloud-native authentication — IAM roles on AWS, service accounts on GCP, managed identities on Azure. There is no Crab server, no separate control plane, and no data egress outside your VPC. The 3-tier deduplication runs entirely on the developer's machine and the bucket; audit logs flow through the cloud provider's existing tooling.
- 0
- New servers, services, or SaaS vendors to onboard
- 100%
- Of data stays inside your existing VPC and IAM perimeter
- 1
- Binary to deploy across developer machines and CI runners
Cut CI clone time by 90%, on every PR
Lazy checkout pulls only the fixtures a job actually reads. Runner-local chunk cache means repeat builds skip the network entirely.
DevOps and CI/CD
The problem. CI runners spend most of their wall-clock time cloning repos that contain trained models, signed artifacts, container base layers, or test fixtures measured in gigabytes. Cache misses on a self-hosted runner translate directly into minutes of paid compute and slower feedback on every PR.
The Crab solution. CI jobs use lazy checkout to clone only pointer blobs and hydrate on demand — a test that needs one fixture pulls one fixture, not the whole 80 GB suite. The chunk cache on a runner is reused across jobs, so subsequent builds hit local disk instead of S3. Resumable uploads mean a flaky runner doesn't restart a 5 GB push from scratch.
- 90%
- Reduction in CI clone time on large-asset repos
- 65%
- Lower compute cost per pipeline run
- 0
- Wasted bytes from re-pushing after a transient failure
Ready to ship large files like source code?
Pick the plan that fits, or jump straight into the CLI guide.