Garbage Collection in a Serverless World

Crab Cleans Up Old Data Automatically — and Safely

Every push you make stores compressed file data in cloud storage. Over time, as you overwrite files and delete branches, some of that data becomes orphaned — nothing references it anymore, but it still costs you money sitting in S3.

Most version control systems solve this with a long-running server that sweeps unused data in the background. Crab doesn't have a server. Each command is a short-lived CLI process that runs and exits.

So how does cleanup work without a daemon? Crab finds unused data, waits a safety period, then deletes it — like a recycle bin for your cloud bucket. Items are not deleted the moment you "delete" them. By default, recent objects are protected for 24 hours so in-flight operations have time to publish their metadata.

TL;DR

Recycle-bin model: unused data is marked, held past the grace period, then permanently deleted
No server needed: GC runs as a regular CLI command (crab gc) — no daemon, no cron job required
Safe by design: in-flight pushes and clones can never have their data swept out from under them
Worst case is wasted storage: GC never deletes referenced data, even if interrupted mid-run

How Crab Decides What's Garbage

Think of your repository as a family tree. At the top sit your branches — refs like main and release-1.2. Each branch points to commits, which point to files, which point to chunks stored inside compressed archives called xorbs (think of a xorb as a small zip file holding many deduplicated chunks).

A xorb is live if any branch still needs chunks inside it. GC's only job is to find xorbs nothing depends on anymore — orphans hanging off no branch. Deleted feature branches, force-pushed history, files removed in later commits: all common sources of orphan xorbs that quietly add up over months.

Crab uses a classic two-pass algorithm called mark-sweep, the same approach used in language runtimes like Java and Go. The mark phase walks the reference graph from your branches outward, building a set of every xorb still in use. The sweep phase compares that set against what's actually in cloud storage. Anything not in the live set is a candidate for deletion — but only if it's old enough.

Imagine cleaning out a self-storage unit. First, you make a list of which boxes are still labeled and needed (mark). Then you check when the unlabeled boxes were last touched. If they are past the configured grace window, they are safe to toss (sweep). Anything recent stays because someone might still be in the middle of labeling it.

Why the Wait Matters

You might wonder: why not delete unused data immediately? The answer is concurrency.

Multiple people and CI systems can work with the same repository simultaneously. Their operations overlap in ways that are hard to predict, especially when uploads take minutes or hours.

Picture this race: you start a push. During that push, Crab discovers that a chunk it needs already lives in xorb X7. It plans to reference X7 in its new metadata. Meanwhile, on another machine, someone else's crab gc run sees X7 as unreferenced — your push hasn't finished writing its metadata yet — and deletes it. Your push completes, but now points to data that no longer exists. Hydration fails permanently.

The grace period prevents this race. Every xorb is timestamped at upload time, and objects inside the grace window are retained regardless of whether they appear in the live set yet. The default is 24 hours, and custom non-force grace periods are clamped to a one-hour minimum.

This covers every realistic concurrent operation:

Operation	Typical Duration	Safe?
Longest push	Minutes	✓ 24 hours >> minutes
Clone of massive repo	Hours	✓ Covered by the default window
Suspended laptop push	Longer than expected	✓ Retained if still inside the configured grace window

The timestamp comes from S3's Last-Modified metadata, set by the storage backend at upload time. Clients can't spoof it.

Running GC When You Need It

You can trigger GC manually or let Crab handle it. The manual approach gives you full control — start with a dry run to see what would be cleaned up, then run it for real:

# See what would be deleted (safe preview)
crab gc --dry-run

# Actually collect garbage
crab gc

# Force mode bypasses the grace filter after confirmation.
# Use only when you know no concurrent pushes are running.
crab gc --force

For predictable cleanup, many teams run crab gc --dry-run before a scheduled crab gc in CI. The command stays explicit, which keeps deletion timing under your control.

How Two People Don't Collide

When multiple people share a repository, only one GC can run at a time. Crab uses a lock file in cloud storage (.crab/locks/gc.lock) with a one-hour expiry. The lock is acquired with a conditional write — create-if-not-exists — so two simultaneous GC attempts can't both succeed. If a process crashes, the lock expires automatically and the next run picks up safely.

The elegant part: GC doesn't need to coordinate with pushes beyond the grace period. New xorbs from in-flight pushes are recent, so a normal GC run keeps them even if their final metadata has not appeared yet. The grace period decouples routine cleanup from push coordination.

What Makes It Safe

Crab's GC is designed so the worst possible outcome is wasted storage — never data loss. That guarantee comes from layering several conservative checks on top of the grace period:

Grace period: recent objects are retained even if they appear unreferenced right now
Double-check: each xorb targeted for deletion is verified against the live set one final time during the sweep
Conservative on errors: if listing storage objects fails partway through, GC only deletes what it can definitively prove is unreferenced — partial information means partial cleanup, not risky guesses
Cancellation-safe: you can Ctrl+C mid-sweep without worry — each deletion is independent and idempotent, so a partial run leaves the repository fully consistent
Resumable: running GC twice produces the same result, and interrupted runs can be re-run with no special recovery steps

If anything looks off — a network blip, an unexpected response from cloud storage, a missing shard — GC stops. It never tries to "guess" whether something is safe to delete. The trade-off is straightforward: occasionally Crab keeps a bit of garbage longer than strictly necessary, in exchange for the certainty that referenced data never disappears.

After a successful sweep, you get a summary report:

{
  "xorbs_deleted": 47,
  "shards_deleted": 3,
  "bytes_reclaimed": 3124019200,
  "dry_run": false
}

That's roughly 3 GB reclaimed in this run — without a server, without coordination overhead, without risk to anyone's in-flight work.

What This Means for You

GC is the final piece of Crab's storage lifecycle. When you push, data flows in — chunked, deduplicated, packed into xorbs. When data becomes obsolete (deleted branches, overwritten files, squashed history), GC flows it back out. Together these two forces keep your storage costs proportional to what you actually use.

The key insight is that time solves most of the coordination problem. Instead of synchronizing GC with every normal operation, Crab waits past the grace window before deleting unreferenced objects. If an operation has not published its metadata by then, conservative checks still make the worst case wasted storage rather than guessed deletion.

You get automatic cost control with zero operational overhead. No servers to maintain, no databases to tune, no coordination services to monitor. Just a CLI command that safely reclaims storage whenever you're ready — exactly like emptying a recycle bin you trust.

Crab Cleans Up Old Data Automatically — and Safely

TL;DR

Recycle-bin model: unused data is marked, held past the grace period, then permanently deleted
No server needed: GC runs as a regular CLI command (crab gc) — no daemon, no cron job required
Safe by design: in-flight pushes and clones can never have their data swept out from under them
Worst case is wasted storage: GC never deletes referenced data, even if interrupted mid-run

How Crab Decides What's Garbage

Why the Wait Matters

You might wonder: why not delete unused data immediately? The answer is concurrency.

Multiple people and CI systems can work with the same repository simultaneously. Their operations overlap in ways that are hard to predict, especially when uploads take minutes or hours.

This covers every realistic concurrent operation:

Operation	Typical Duration	Safe?
Longest push	Minutes	✓ 24 hours >> minutes
Clone of massive repo	Hours	✓ Covered by the default window
Suspended laptop push	Longer than expected	✓ Retained if still inside the configured grace window

The timestamp comes from S3's Last-Modified metadata, set by the storage backend at upload time. Clients can't spoof it.

Running GC When You Need It

You can trigger GC manually or let Crab handle it. The manual approach gives you full control — start with a dry run to see what would be cleaned up, then run it for real:

# See what would be deleted (safe preview)
crab gc --dry-run

# Actually collect garbage
crab gc

# Force mode bypasses the grace filter after confirmation.
# Use only when you know no concurrent pushes are running.
crab gc --force

For predictable cleanup, many teams run crab gc --dry-run before a scheduled crab gc in CI. The command stays explicit, which keeps deletion timing under your control.

How Two People Don't Collide

What Makes It Safe

Crab's GC is designed so the worst possible outcome is wasted storage — never data loss. That guarantee comes from layering several conservative checks on top of the grace period:

Grace period: recent objects are retained even if they appear unreferenced right now
Double-check: each xorb targeted for deletion is verified against the live set one final time during the sweep
Conservative on errors: if listing storage objects fails partway through, GC only deletes what it can definitively prove is unreferenced — partial information means partial cleanup, not risky guesses
Cancellation-safe: you can Ctrl+C mid-sweep without worry — each deletion is independent and idempotent, so a partial run leaves the repository fully consistent
Resumable: running GC twice produces the same result, and interrupted runs can be re-run with no special recovery steps

After a successful sweep, you get a summary report:

{
  "xorbs_deleted": 47,
  "shards_deleted": 3,
  "bytes_reclaimed": 3124019200,
  "dry_run": false
}

That's roughly 3 GB reclaimed in this run — without a server, without coordination overhead, without risk to anyone's in-flight work.

Garbage Collection in a Serverless World

Crab Cleans Up Old Data Automatically — and Safely

TL;DR

How Crab Decides What's Garbage

Why the Wait Matters

Running GC When You Need It

How Two People Don't Collide

What Makes It Safe

What This Means for You

Lazy Checkout & FUSE: Working with Terabyte Repos

Cost Optimization: S3 Storage Classes and Request Budgets

Related guides

Garbage Collection in a Serverless World

Crab Cleans Up Old Data Automatically — and Safely

TL;DR

How Crab Decides What's Garbage

Why the Wait Matters

Running GC When You Need It

How Two People Don't Collide

What Makes It Safe

What This Means for You

Lazy Checkout & FUSE: Working with Terabyte Repos

Cost Optimization: S3 Storage Classes and Request Budgets

Related guides