How Crab Avoids Storing Duplicate Data
Learn how Crab splits files at content-determined boundaries, deduplicates chunks across local and remote indexes, and packs new chunks into compressed xorbs.
- 1Split each file at content-defined boundaries.
- 2Hash every chunk with BLAKE3 so identical bytes get the same identity.
- 3Check local session, shard metadata, and global indexes for existing chunks.
- 4Pack only new chunks into xorbs and upload those objects.
- 5Record metadata so future pushes and hydrations can reuse the same chunks.
Why Deduplication Matters
Every time you modify a large file and push it, traditional Git stores the entire file again. Change one layer in a 4 GB model? That's another 4 GB on your storage bill. Multiply that across a team pushing daily, and costs spiral fast.
Crab takes a different approach. Instead of storing whole files, it breaks them into smaller pieces, checks which pieces already exist, and only uploads the genuinely new ones. For iterative work where most bytes stay stable, that can turn a large file update into a much smaller upload.
What You'll Learn
- How Crab finds smart split points in your files (content-defined chunking)
- Why small edits don't cause massive re-uploads
- How three layers of dedup checking eliminate redundant storage
- How chunks get packed and addressed for fast retrieval
Key Takeaways
- Only new data gets uploaded. Crab checks multiple dedup indexes before packing chunks for upload.
- Smart chunking survives edits. Content-defined boundaries mean inserting or deleting bytes only affects nearby chunks — the rest deduplicates perfectly.
- Storage costs scale with unique content. A 500 MB edit with high overlap might upload only the changed slice.
- It's streaming. The pipeline processes files incrementally instead of loading whole artifacts into memory.
How Crab Finds Natural Split Points
Think of content-defined chunking like finding natural paragraph breaks in a book. Instead of splitting every 64 pages regardless of content, you split at sentence endings — so if someone inserts a paragraph in chapter 3, only that chapter's splits change. The rest of the book stays identical.
Crab uses a rolling hash (called Gearhash) that slides over your file byte by byte. When the hash hits a specific pattern, that's a chunk boundary. Because boundaries depend on the content itself, small edits only shift nearby boundaries — everything else stays the same and deduplicates against the previous version.
The target chunk size is 64 KB, with a minimum of 32 KB and maximum of 128 KB. For a 1 GB file, that's roughly 16,000 chunks — each independently addressable and dedupable.
Three Layers of Dedup Checking
Once a file is chunked, Crab checks each chunk against three tiers of previously stored content — from cheapest to most expensive:
Tier 1 — Session memory: A hash set of chunks seen during this push. Catches duplicates within the same operation (like copied assets). Cost: zero — just a hash lookup.
Tier 2 — Shard index: A compact remote index mapping chunk hashes to storage locations. Catches data already in this repository without listing the object store.
Tier 3 — Global registry: Covers all chunks across all repos sharing the same storage backend. Catches cross-repo duplicates. Queries are batched to amortize latency.
In practice, a typical push looks like this:
Chunk Classification (push of 847 chunks):
─────────────────────────────────────────────
New (will upload): 142 chunks (16.8%)
Session duplicate: 89 chunks (10.5%)
Already stored: 616 chunks (72.7%)
─────────────────────────────────────────────
Upload savings: 83.2% of chunks skippedFor iterative development, many chunks are already stored remotely. The more overlap a new version has with previous versions, the less data Crab needs to upload.
How New Chunks Get Packed and Stored
Chunks that pass all three dedup tiers — the genuinely new data — get grouped into larger archives called xorbs (roughly 64 MB each). This avoids creating millions of tiny objects in your cloud bucket.
Each xorb is compressed with zstd and addressed by its Blake3 hash — a fast, cryptographic content hash. This means you can always verify data integrity on download, and identical content always maps to the same address.
What This Means for Your Workflow
The combination of content-defined chunking and tiered deduplication means:
- Faster pushes. Less data to upload means shorter wait times. The win depends on how much of the new file overlaps existing chunks and how fast your network is.
- Lower storage costs. You pay for unique content, not total file size. Teams working with large binary assets (ML models, game textures, video) see the biggest savings.
- Edits stay cheap. Changing a few layers in a model file doesn't re-upload the whole thing. Only the chunks near your edit are new.
- It works across repos. The global registry catches duplicates even across different repositories sharing the same bucket — useful for teams with shared assets.
The chunking and dedup pipeline streams through your files. Crab does not need to hold the whole file in memory before it can classify and upload chunks.
Example Overlap Patterns
| Workload | What Usually Deduplicates |
|---|---|
| ML model weights with a few layers changed | Most unchanged layers |
| Game texture atlases with minor edits | Unchanged texture regions |
| Re-exported datasets with stable rows | Chunks around unchanged records |
| First push of entirely new content | Nothing yet; there is no history to dedup against |
The ratio improves as the shard index sees more of your repository's history. The best workloads are the ones where each new version shares long stretches of bytes with earlier versions.
Summary
Crab's deduplication pipeline ensures your storage costs and push times scale with the amount of genuinely new content — not the total size of files you touched. Content-defined chunking finds stable split points that survive edits, three tiers of dedup checking eliminate redundant uploads, and xorb packing keeps your cloud bucket organized and efficient. For teams working with large binary assets, this translates directly into faster iteration and lower bills.