Tracking & Hydration

The Large Files view (labeled "Large Files" in the activity bar) is the control center for Crab's large-file management. It handles tracking, hydration, dehydration, and storage inspection for files managed by the Crab chunking engine.

Accessing the Large Files View

Click the Package icon in the activity bar
This view is specific to Crab — it manages files that are too large for standard git (models, datasets, binaries, media)

Concepts

Pointer Files

When a large file is tracked by Crab, git stores a small pointer blob instead of the full content. The pointer contains:

Content hash (Blake3)
Original file size
Chunk references

Hydration States

Hydrated — full file content is available locally
Dehydrated — only the pointer blob exists; content is in cloud storage
Partial — some chunks are cached locally

Tracking

Files must be explicitly tracked by Crab before they're managed as large files. Tracking tells the clean/smudge filter to handle them.

Layout

The view has multiple tabs:

Files Tab

The Files tab shows all Crab-tracked files with:

View modes — List, Tree (hierarchical), or Card (grid with previews)
File type icons — distinct icons for archives, images, video, audio, data files, ML models
Size display — human-readable sizes
Hydration status — per-file indicator (Hydrated, Pointer, or Partial)
Untracked section — large files not yet tracked by Crab, with a one-click Track button

Storage Map Tab

Visual representation of storage usage:

Xorb distribution across cloud storage
Deduplication savings
Storage cost estimates

Hydrate Tab

Bulk hydration controls:

Hydrate all — download all pointer files
Hydrate by pattern — glob-based selection (e.g., *.safetensors)
Dehydrate all — replace all hydrated files with pointers
Selective hydrate — checkbox selection for individual files

Progress is shown inline with per-file download status.

Datasets Tab

For repositories containing structured datasets:

Parquet file previews (schema + sample rows)
Dataset statistics
Column-level metadata

Operations

Tracking a File

The "Untracked Large Files" section shows files above the size threshold that aren't tracked
Click Track next to a file (or select multiple + batch track)
Crab runs crab add which:
- Chunks the file using content-defined chunking (gearhash)
- Deduplicates chunks against the session/shard/DB index
- Stages chunks in .crab/staging/
- Writes a pointer blob to the git index

Hydrating Files

Select files to hydrate (or use "Hydrate All")
Click Hydrate — the agent runs crab hydrate
Progress shows per-file download status
Hydrated files replace pointer blobs with full content
The file is now readable by any tool

Dehydrating Files

Select hydrated files (or use "Dehydrate All")
Click Dehydrate — replaces content with pointer blobs
Frees local disk space while keeping files in cloud storage
Dehydrate skips dirty files (uncommitted changes)

Push Flow

After tracking new files:

The Push Flow Panel appears showing staged chunks
Displays upload size estimate and xorb count
Click Push to upload chunks to cloud storage
Progress shows upload status with retry on failure

Staging Stats

The view shows staging area statistics:

Number of staged chunks
Total staged size
Segment file count
Option to clean abandoned staging data

Right-click a file for:

Hydrate / Dehydrate
Preview (opens format-specific preview)
Copy pointer hash
Show in Explorer
Remove tracking

Integration

The Explorer view shows hydration decorations on pointer files
The Changes view shows Crab-staged files in a separate section
The Dashboard shows overall hydration progress
Push operations coordinate between git push and Crab push

Understanding the Data Model

Content-Defined Chunking

When you track a file with Crab, it is split into variable-size chunks using a gearhash algorithm. This means:

Small edits to a large file only produce new chunks for the changed regions
Identical content across files is stored only once (deduplication)
Chunks are compressed and stored as "xorbs" in cloud storage

Three-Tier Deduplication

Crab deduplicates at three levels:

Session — within the current add operation
Shard — against recently pushed data
DB Index — against the entire repository history

This means re-adding a slightly modified 10 GB file might only upload a few megabytes of new chunks.

Pointer Blob Format

A pointer blob stored in git looks like:

crab/v1
hash: blake3:abc123...
size: 4294967296
chunks: 1024

This tiny blob (< 200 bytes) replaces the multi-gigabyte file in git, keeping the repository fast to clone and browse.

Tips and Best Practices

Track early — track large files before your first commit to avoid bloating the git history
Hydrate selectively — only hydrate files you need to work with; keep the rest as pointers to save disk space
Dehydrate before pulling — dehydrate all files before git pull to avoid merge conflicts on hydrated content
Use patterns — hydrate by glob pattern (e.g., *.py) to get just the files relevant to your current task
Check staging stats — monitor the staging area size before pushing to estimate upload time

On this page

On this page