Inside Crab's 14-Step Push Pipeline
Crab's push pipeline breaks into local staging, dedup-aware upload, metadata publication, and atomic finalization. Here's where recovery and consistency boundaries live.
- 1Stage large-file content locally and replace Git blobs with lightweight pointers.
- 2Classify chunks against known content so only new bytes need upload.
- 3Pack and upload new xorbs with resumable progress tracking.
- 4Publish metadata before refs so readers never observe missing data.
- 5Finalize with an atomic ref update or fail cleanly with nothing half-visible.
What Happens When You Push — and Why It's Safe
Every time you git push a Crab repository, your files travel through a pipeline that chunks, deduplicates, uploads, and atomically publishes your changes to cloud storage. There's no server coordinating any of it — just your machine talking directly to S3, GCS, or Azure Blob.
Internally, the pipeline has 14 steps. You don't need to remember any of them. What matters is the design promise: nothing can go wrong in a way that loses data or corrupts your repository. If your laptop reboots mid-push, you pick up exactly where you left off. If two teammates push at the same time, one wins cleanly and the other rebases. There is no in-between state for anyone to observe.
This post groups those 14 steps into three phases you can actually picture, then walks through what each phase is responsible for and why it's safe to fail in the middle of any of them.
What You'll Learn
- How Crab breaks a push into three phases that map to local prep, network upload, and atomic publish
- Why most of your data never needs uploading at all (deduplication)
- How content-addressed storage and compare-and-swap make every failure mode recoverable
TL;DR
- Phase 1 (Local):
crab addchunks your files and stores them on disk. No network calls — a 4 GB file stages in seconds. - Phase 2 (Upload): Only new, unique chunks go up. Typical pushes ship 5–20% of the raw bytes you added.
- Phase 3 (Finalize): Metadata and refs are published atomically. A push is either fully visible or invisible — never half-applied.
- Crash-safe: A write journal tracks progress so an interrupted push resumes without re-uploading what already made it.
The Three Phases at a Glance
Think of a Crab push like mailing a package. First you pack the box on your kitchen counter (local staging). Then you ship only the items the recipient doesn't already have (smart upload). Finally the recipient signs for delivery and the order is officially complete (atomic finalization). If anything goes wrong, the box never disappears — you simply re-attempt delivery.
Phase 1: Stage Locally — No Network Needed
When: During crab add and git commit. Entirely on your machine.
When you run crab add large-model.bin, Crab splits the file into variable-size chunks using content-defined chunking (CDC). Think of it like finding natural paragraph breaks in a book — boundaries are determined by the bytes themselves, not fixed offsets. The practical benefit: a small edit only changes one or two chunks instead of shifting every chunk after the edit point.
These chunks get hashed and stored in a local staging database (a SlateDB instance under .crab/staging/). Crab then writes a tiny pointer blob — under 1 KB — that records the file's chunk list and overall content hash. Git commits this pointer instead of the multi-gigabyte original, which is why your repo stays small.
The key insight is that nothing leaves your machine yet. A 4 GB checkpoint stages in seconds because there's no upload to wait for. You can stage a hundred files, change your mind, run crab add again, and never burn a single byte of bandwidth.
Phase 2: Upload Only What's Actually New
When: During git push, handled by the git-remote-crab helper that git invokes for crab:// URLs.
This is where deduplication earns its keep. Before uploading anything, Crab pulls down a compact index of chunks already stored remotely and walks every local chunk against it. Each chunk falls into one of three buckets:
Push Classification Summary
════════════════════════════════════════════════════════════
Chunks scanned: 2,341
────────────────────────────────────────────────────────
Category A (new): 187 (8.0%) → upload
Category B (session-dedup): 54 (2.3%) → skip
Category C (global-dedup): 2,100 (89.7%) → skip
────────────────────────────────────────────────────────
Upload required: 187 chunks (~11.7 MB)
Upload avoided: 2,154 chunks (~134.6 MB)
════════════════════════════════════════════════════════════In this real-world example, 92% of the pushed data was already in the cloud from previous pushes. Only 11.7 MB actually traveled over the network — versus 146 MB if every chunk had been uploaded blindly.
Category A chunks (genuinely new bytes) get packed into compressed bundles called xorbs, each around 64 MB, and uploaded to object storage in parallel. If a previous push was interrupted partway through, Crab's write journal already knows which xorbs landed successfully. Those get skipped on the retry without you doing anything.
Phase 3: Publish Atomically So Readers Never See Half a Push
When: After every xorb is safely persisted in object storage.
With your data uploaded, Crab builds two pieces of metadata: a shard that maps files to their chunks, and a file index that lets readers look up files by name. Both are uploaded as immutable, content-addressed objects. So far, nothing visible has changed — readers still see the previous repository state.
The push only becomes visible at the very last step: a compare-and-swap (CAS) update of the ref. CAS is the same primitive databases use for safe concurrent writes. Crab reads the current ref, prepares the new value, and writes it only if nobody else changed it in the meantime:
if current_ref == expected_old_value:
current_ref = new_value ← push succeeds, changes go live
else:
reject ← someone else pushed first; you rebaseBefore this single instant, the world sees the old state. After it, the world sees your complete changes. There is no observable in-between, which is what "atomic" actually means in practice.
Your uploaded xorbs are still valid and content-addressed. After you git pull --rebase and try again, the next push deduplicates against everything you already uploaded — only the metadata and ref need to be redone, which usually takes seconds.
Why Every Failure Mode Is Recoverable
The pipeline is designed so a crash at any point leaves the system in a state you can simply retry from. Each phase has a clear recovery story:
- Crash during Phase 1? Local staging is just files in a SlateDB instance. Re-run
crab addand Crab picks up where it stopped — chunking is deterministic, so identical inputs produce identical chunks. - Crash during Phase 2? The write journal records each xorb upload as it completes. The next push reads the journal, skips finished xorbs, and resumes from the first unfinished one.
- Crash during Phase 3? Xorbs are immutable and content-addressed, so they're harmless sitting in storage. The ref hasn't moved yet, which means no reader is seeing partial state. Retrying simply redoes the metadata and CAS.
Every operation along the way is idempotent. Hashing the same bytes always produces the same chunk hash. Uploading the same xorb twice is a content-equivalent no-op. CAS either commits cleanly or fails cleanly — there is no fuzzy middle.
How Concurrent Pushes Stay Consistent
Multiple people can push to the same repository simultaneously without coordinating through a server. The pipeline handles the race naturally because every contested step uses content-addressed objects or CAS, never blind overwrites.
- Xorb uploads never conflict. They're content-addressed — two people uploading identical bytes simply land the same object at the same key, and S3 sees that as one write either way.
- Metadata updates use CAS. If two pushes race, exactly one wins the metadata round and the other retries against the merged state.
- Ref updates use CAS. The losing push pulls, rebases, and re-pushes. Because all its uploaded xorbs are still valid, the retry is fast — just rebuild metadata and re-attempt the ref CAS.
The worst case is a rejected push that asks you to rebase. Even in that scenario, no bandwidth is wasted: the next push deduplicates against everything already uploaded, including your own previous attempt.
What This Means for You
You don't need to memorize 14 steps to use Crab safely. The mental model that fits in your head is the three phases above: stage locally, upload only what's new, publish atomically. The rest is the pipeline keeping its promises behind the scenes.
Local-First
Staging is instant and offline. Nothing crosses the network during crab add or git commit.
Minimal Uploads
High-overlap changes skip known chunks. Only genuinely new data crosses the wire.
Crash-Safe
The write journal lets interrupted pushes resume in seconds without re-uploading completed xorbs.
Atomic Visibility
The ref CAS is the single point a push becomes visible. Readers never see partial state.