How Crab Avoids Storing Duplicate Data

Deduplication in one pass

1Split each file at content-defined boundaries.
2Hash every chunk with BLAKE3 so identical bytes get the same identity.
3Check local session, shard metadata, and global indexes for existing chunks.
4Pack only new chunks into xorbs and upload those objects.
5Record metadata so future pushes and hydrations can reuse the same chunks.

Why Deduplication Matters

Every time you modify a large file and push it, traditional Git stores the entire file again. Change one layer in a 4 GB model? That's another 4 GB on your storage bill. Multiply that across a team pushing daily, and costs spiral fast.

Crab takes a different approach. Instead of storing whole files, it breaks them into smaller pieces, checks which pieces already exist, and only uploads the genuinely new ones. For iterative work where most bytes stay stable, that can turn a large file update into a much smaller upload.

What You'll Learn

How Crab finds smart split points in your files (content-defined chunking)
Why small edits don't cause massive re-uploads
How three layers of dedup checking eliminate redundant storage
How chunks get packed and addressed for fast retrieval

Key Takeaways

Only new data gets uploaded. Crab checks multiple dedup indexes before packing chunks for upload.
Smart chunking survives edits. Content-defined boundaries mean inserting or deleting bytes only affects nearby chunks — the rest deduplicates perfectly.
Storage costs scale with unique content. A 500 MB edit with high overlap might upload only the changed slice.
It's streaming. The pipeline processes files incrementally instead of loading whole artifacts into memory.

How Crab Finds Natural Split Points

Think of content-defined chunking like finding natural paragraph breaks in a book. Instead of splitting every 64 pages regardless of content, you split at sentence endings — so if someone inserts a paragraph in chapter 3, only that chapter's splits change. The rest of the book stays identical.

Crab uses a rolling hash (called Gearhash) that slides over your file byte by byte. When the hash hits a specific pattern, that's a chunk boundary. Because boundaries depend on the content itself, small edits only shift nearby boundaries — everything else stays the same and deduplicates against the previous version.

The target chunk size is 64 KB, with a minimum of 32 KB and maximum of 128 KB. For a 1 GB file, that's roughly 16,000 chunks — each independently addressable and dedupable.

Three Layers of Dedup Checking

Once a file is chunked, Crab checks each chunk against three tiers of previously stored content — from cheapest to most expensive:

Tier 1 — Session memory: A hash set of chunks seen during this push. Catches duplicates within the same operation (like copied assets). Cost: zero — just a hash lookup.

Tier 2 — Shard index: A compact remote index mapping chunk hashes to storage locations. Catches data already in this repository without listing the object store.

Tier 3 — Global registry: Covers all chunks across all repos sharing the same storage backend. Catches cross-repo duplicates. Queries are batched to amortize latency.

In practice, a typical push looks like this:

Chunk Classification (push of 847 chunks):
─────────────────────────────────────────────
New (will upload):         142 chunks  (16.8%)
Session duplicate:          89 chunks  (10.5%)
Already stored:            616 chunks  (72.7%)
─────────────────────────────────────────────
Upload savings: 83.2% of chunks skipped

Real-world impact

For iterative development, many chunks are already stored remotely. The more overlap a new version has with previous versions, the less data Crab needs to upload.

How New Chunks Get Packed and Stored

Chunks that pass all three dedup tiers — the genuinely new data — get grouped into larger archives called xorbs (roughly 64 MB each). This avoids creating millions of tiny objects in your cloud bucket.

Each xorb is compressed with zstd and addressed by its Blake3 hash — a fast, cryptographic content hash. This means you can always verify data integrity on download, and identical content always maps to the same address.

What This Means for Your Workflow

The combination of content-defined chunking and tiered deduplication means:

Faster pushes. Less data to upload means shorter wait times. The win depends on how much of the new file overlaps existing chunks and how fast your network is.
Lower storage costs. You pay for unique content, not total file size. Teams working with large binary assets (ML models, game textures, video) see the biggest savings.
Edits stay cheap. Changing a few layers in a model file doesn't re-upload the whole thing. Only the chunks near your edit are new.
It works across repos. The global registry catches duplicates even across different repositories sharing the same bucket — useful for teams with shared assets.

Bounded resources

The chunking and dedup pipeline streams through your files. Crab does not need to hold the whole file in memory before it can classify and upload chunks.

Example Overlap Patterns

Workload	What Usually Deduplicates
ML model weights with a few layers changed	Most unchanged layers
Game texture atlases with minor edits	Unchanged texture regions
Re-exported datasets with stable rows	Chunks around unchanged records
First push of entirely new content	Nothing yet; there is no history to dedup against

The ratio improves as the shard index sees more of your repository's history. The best workloads are the ones where each new version shares long stretches of bytes with earlier versions.

Summary

Crab's deduplication pipeline ensures your storage costs and push times scale with the amount of genuinely new content — not the total size of files you touched. Content-defined chunking finds stable split points that survive edits, three tiers of dedup checking eliminate redundant uploads, and xorb packing keeps your cloud bucket organized and efficient. For teams working with large binary assets, this translates directly into faster iteration and lower bills.

Deduplication in one pass

1Split each file at content-defined boundaries.
2Hash every chunk with BLAKE3 so identical bytes get the same identity.
3Check local session, shard metadata, and global indexes for existing chunks.
4Pack only new chunks into xorbs and upload those objects.
5Record metadata so future pushes and hydrations can reuse the same chunks.

Why Deduplication Matters

What You'll Learn

How Crab finds smart split points in your files (content-defined chunking)
Why small edits don't cause massive re-uploads
How three layers of dedup checking eliminate redundant storage
How chunks get packed and addressed for fast retrieval

Key Takeaways

Only new data gets uploaded. Crab checks multiple dedup indexes before packing chunks for upload.
Smart chunking survives edits. Content-defined boundaries mean inserting or deleting bytes only affects nearby chunks — the rest deduplicates perfectly.
Storage costs scale with unique content. A 500 MB edit with high overlap might upload only the changed slice.
It's streaming. The pipeline processes files incrementally instead of loading whole artifacts into memory.

How Crab Finds Natural Split Points

The target chunk size is 64 KB, with a minimum of 32 KB and maximum of 128 KB. For a 1 GB file, that's roughly 16,000 chunks — each independently addressable and dedupable.

Three Layers of Dedup Checking

Once a file is chunked, Crab checks each chunk against three tiers of previously stored content — from cheapest to most expensive:

Tier 1 — Session memory: A hash set of chunks seen during this push. Catches duplicates within the same operation (like copied assets). Cost: zero — just a hash lookup.

Tier 2 — Shard index: A compact remote index mapping chunk hashes to storage locations. Catches data already in this repository without listing the object store.

Tier 3 — Global registry: Covers all chunks across all repos sharing the same storage backend. Catches cross-repo duplicates. Queries are batched to amortize latency.

In practice, a typical push looks like this:

Chunk Classification (push of 847 chunks):
─────────────────────────────────────────────
New (will upload):         142 chunks  (16.8%)
Session duplicate:          89 chunks  (10.5%)
Already stored:            616 chunks  (72.7%)
─────────────────────────────────────────────
Upload savings: 83.2% of chunks skipped

Real-world impact

For iterative development, many chunks are already stored remotely. The more overlap a new version has with previous versions, the less data Crab needs to upload.

How New Chunks Get Packed and Stored

What This Means for Your Workflow

The combination of content-defined chunking and tiered deduplication means:

Faster pushes. Less data to upload means shorter wait times. The win depends on how much of the new file overlaps existing chunks and how fast your network is.
Lower storage costs. You pay for unique content, not total file size. Teams working with large binary assets (ML models, game textures, video) see the biggest savings.
Edits stay cheap. Changing a few layers in a model file doesn't re-upload the whole thing. Only the chunks near your edit are new.
It works across repos. The global registry catches duplicates even across different repositories sharing the same bucket — useful for teams with shared assets.

Bounded resources

The chunking and dedup pipeline streams through your files. Crab does not need to hold the whole file in memory before it can classify and upload chunks.

Example Overlap Patterns

Workload	What Usually Deduplicates
ML model weights with a few layers changed	Most unchanged layers
Game texture atlases with minor edits	Unchanged texture regions
Re-exported datasets with stable rows	Chunks around unchanged records
First push of entirely new content	Nothing yet; there is no history to dedup against

The ratio improves as the shard index sees more of your repository's history. The best workloads are the ones where each new version shares long stretches of bytes with earlier versions.

How Crab Avoids Storing Duplicate Data

Why Deduplication Matters

What You'll Learn

Key Takeaways

How Crab Finds Natural Split Points

Three Layers of Dedup Checking

How New Chunks Get Packed and Stored

What This Means for Your Workflow

Example Overlap Patterns

Summary

Browse Core Internals

Inside Crab's 14-Step Push Pipeline

Related guides

How Crab Avoids Storing Duplicate Data

Why Deduplication Matters

What You'll Learn

Key Takeaways

How Crab Finds Natural Split Points

Three Layers of Dedup Checking

How New Chunks Get Packed and Stored

What This Means for Your Workflow

Example Overlap Patterns

Summary

Browse Core Internals

Inside Crab's 14-Step Push Pipeline

Related guides

Technical detail: Gearhash parameters

Browse Core Internals

Inside Crab's 14-Step Push Pipeline

Related guides

Technical detail: Gearhash parameters

Browse Core Internals

Inside Crab's 14-Step Push Pipeline

Related guides