How Crab Plugs Into Your Git Workflow
Crab uses two Git extension points — remote helpers and filter processes — to keep large files transparent. One binary fills both roles and shares the same staging/config path.
Crab Doesn't Replace Git — It Extends It
You already know git push, git pull, and git add. Crab doesn't change any of those. It plugs into git's built-in extension system so large files get chunked, deduplicated, and stored in cloud storage automatically. You keep your existing workflow; Crab handles the heavy lifting in the background.
What You'll Learn
- How Crab hooks into git without changing your commands
- The two extension points that make transparent large-file handling possible
- Why a single binary serving both roles avoids subtle version-skew bugs
- Why the filter process stays fast even with thousands of files
Key Takeaways
- Zero workflow changes. You keep using
git push,git pull,git add— Crab handles the rest behind the scenes. - Two roles, one binary. A remote helper (transport) and a filter process (file transformation) ship as the same compiled binary.
- No version skew. Both roles share one binary, one staging area, and one config — so they can't drift out of sync.
- Fast at scale. The filter stays running for the entire git operation, so adding 500 large files is nearly as fast as adding 1.
How One Binary Serves Two Roles
Git was designed to be extended without forking. Crab plugs into two of those extension points:
- Remote helper. When you
git pushorgit pull, git spawns a helper program to talk to the remote. Crab's helper speaks to S3, GCS, or Azure instead of a git server. - Filter process. When you
git adda large file, the filter swaps it for a tiny pointer. On checkout, it reconstructs the original file from cloud storage.
What's unusual about Crab is that both roles are the same compiled binary. Git decides which mode to enter based on the name it invoked the binary with — a Unix trick called argv[0] dispatch. The same idea powers BusyBox, where a single executable provides dozens of utilities.
The single-binary design has practical consequences. There's no version skew between transport and filter — they always run identical code. They share the same staging area, configuration, and chunk cache, so chunks staged during git add are immediately available during git push. And installation is one command instead of three.
How Push and Fetch Work Behind the Scenes
When you run git push crab://bucket/repo, git doesn't know how to talk to S3. It looks for a program called git-remote-crab on your PATH, spawns it, and communicates over a simple text protocol on stdin and stdout.
The conversation looks roughly like this:
- Handshake. Git asks "what can you do?" and the helper replies with its capabilities.
- List refs. Git asks "what branches exist on the remote?" The helper reads the ref manifest from cloud storage.
- Push. Git sends the refs to update. The helper runs Crab's push pipeline: read staged chunks, pack them into compressed archives (called xorbs), upload to S3, update the manifest.
- Report. The helper tells git whether each ref update succeeded or failed.
It's the same pattern that git-remote-https uses for GitHub — just pointed at object storage instead of a git server. From git's perspective, S3 is the remote.
Type git push, git pull, git clone — everything works as expected. Crab handles the heavy lifting behind the scenes. No new commands to learn for everyday work.
How Large Files Stay Transparent
The remote helper handles transport. The filter process handles content transformation — it's what makes large files appear normal in your working tree while only tiny pointers actually live in git's object database.
Two operations make this work:
- Clean runs on
git add. It chunks and deduplicates your 4 GB model file, stages the chunks locally, and gives git a small text pointer to commit. - Smudge runs on
git checkout. It takes the pointer and reconstructs the original file from local cache or cloud storage.
The performance trick is git's long-running filter protocol. Older filter drivers spawn a fresh process per file — fine for ten files, catastrophic for a thousand. The long-running protocol spawns the filter once and streams every operation over a persistent pipe, so the expensive setup (opening the staging database, loading the chunk cache, warming the dedup bloom filter) only happens once.
In practice, this means staging 500 large files is nearly as fast as staging 1. The dedup bloom filter, staging database handle, and chunk cache all live across the entire git add command.
What the Pointer Looks Like
When the filter cleans a file, it produces a small text pointer that git actually commits:
version https://crab.io/spec/v1
oid blake3:a7f3b2c1d4e5f6789012345678901234567890123456789012345678901234ab
size 1073741824Under 200 bytes regardless of whether the original was 100 MB or 100 GB. The oid is a Blake3 hash that uniquely identifies the file's content; size records the original byte count. On checkout, the smudge filter uses this pointer to reconstruct the exact original file from cloud storage (or local cache if it's already been downloaded).
How Installation Connects Everything
When you run make install, three things happen:
- The release binary builds.
- It's installed to
~/.cargo/bin/crab. - A symlink is created:
~/.cargo/bin/git-remote-crab → crab.
Then your git config ties it together:
[filter "crab"]
process = crab filter-process
required = trueThe required = true flag matters. Without it, git would silently commit raw file content if the filter exits unexpectedly — easy to miss until you accidentally push a 4 GB file directly into git's object database. With required, git aborts on filter failure and you get a clear error instead of a corrupted commit.
Don't copy the binary manually or use cargo install. The Makefile keeps the binary and git-remote-crab symlink in sync. A stale symlink pointing at an old binary is the most common source of hard-to-debug failures.
The Complete Flow: Add, Commit, Push
Putting it all together, here's what happens end-to-end when you push a large file:
git add large-model.bin→ the filter chunks the file, stages the chunks locally, and hands git a pointer blob.git commit→ git stores the pointer (not the 4 GB file) in the commit object.git push origin main→ git spawnsgit-remote-crab.- The remote helper reads staged chunks, classifies them as new vs. already-uploaded, packs new ones into xorbs, uploads to S3, updates the manifest, and reports success.
When a collaborator clones or pulls, the same machinery runs in reverse. Git spawns git-remote-crab to fetch refs and pack data. On checkout, the filter sees pointer blobs and reconstructs files from cloud storage or local cache.
Your team uses standard git commands, large files live efficiently in cloud storage, and nobody has to run a server.
What This Means for Your Workflow
The single-binary approach gives you three things that matter day-to-day.
No version skew. Multi-binary git extensions are notorious for subtle bugs when components drift out of sync after a partial upgrade. Crab's transport and filter always run the same code by construction.
Shared state. Both roles read the same staging area, configuration, and chunk cache. Chunks staged during git add are immediately available during git push — no redundant work, no second pass.
Simple installation. One make install and you're done. No package managers, no separate services, no daemon to keep running. Your repo URL just looks like crab://bucket/repo instead of https://github.com/....
The deeper point: Crab doesn't reinvent git. It uses git's own well-defined extension points to add capabilities git doesn't have on its own. Your existing tooling, scripts, IDE integrations, and CI jobs keep working — they're still talking to git the same way. Crab just makes large-file storage cheap, fast, and serverless underneath.