Phase 4: Storage Deduplication
-2026-02-15
-When multiple tesseras share the same photo, the same audio clip, or the same -fragment data, the old storage layer kept separate copies of each. On a node -storing thousands of tesseras for the network, this duplication adds up fast. -Phase 4 continues with storage deduplication: a content-addressable store (CAS) -that ensures every unique piece of data is stored exactly once on disk, -regardless of how many tesseras reference it.
-The design is simple and proven: hash the content with BLAKE3, use the hash as -the filename, and maintain a reference count in SQLite. When two tesseras -include the same 5 MB photo, one file exists on disk with a refcount of 2. When -one tessera is deleted, the refcount drops to 1 and the file stays. When the -last reference is released, a periodic sweep cleans up the orphan.
-What was built
-CAS schema migration (tesseras-storage/migrations/004_dedup.sql) — Three
-new tables:
-
-
cas_objects— tracks every object in the store: BLAKE3 hash (primary key), -byte size, reference count, and creation timestamp
-blob_refs— maps logical blob identifiers (tessera hash + memory hash + -filename) to CAS hashes, replacing the old filesystem path convention
-fragment_refs— maps logical fragment identifiers (tessera hash + fragment -index) to CAS hashes, replacing the oldfragments/directory layout
-
Indexes on the hash columns ensure O(1) lookups during reads and reference -counting.
-CasStore (tesseras-storage/src/cas.rs) — The core content-addressable
-storage engine. Files are stored under a two-level prefix directory:
-<root>/<2-char-hex-prefix>/<full-hash>.blob. The store provides five
-operations:
-
-
put(hash, data)— writes data to disk if not already present, increments -refcount. Returns whether a dedup hit occurred.
-get(hash)— reads data from disk by hash
-release(hash)— decrements refcount. If it reaches zero, the on-disk file is -deleted immediately.
-contains(hash)— checks existence without reading
-ref_count(hash)— returns the current reference count
-
All operations are atomic within a single SQLite transaction. The refcount is -the source of truth — if the refcount says the object exists, the file must be -on disk.
-CAS-backed FsBlobStore (tesseras-storage/src/blob.rs) — Rewritten to
-delegate all storage to the CAS. When a blob is written, its BLAKE3 hash is
-computed and passed to cas.put(). A row in blob_refs maps the logical path
-(tessera + memory + filename) to the CAS hash. Reads look up the CAS hash via
-blob_refs and fetch from cas.get(). Deleting a tessera releases all its blob
-references in a single transaction.
CAS-backed FsFragmentStore (tesseras-storage/src/fragment.rs) — Same
-pattern for erasure-coded fragments. Each fragment's BLAKE3 checksum is already
-computed during Reed-Solomon encoding, so it's used directly as the CAS key.
-Fragment verification now checks the CAS hash instead of recomputing from
-scratch — if the CAS says the data is intact, it is.
Sweep garbage collector (cas.rs:sweep()) — A periodic GC pass that handles
-three edge cases the normal refcount path can't:
-
-
- Orphan files — files on disk with no corresponding row in
cas_objects. -Can happen after a crash mid-write. Files younger than 1 hour are skipped -(grace period for in-flight writes); older orphans are deleted.
- - Leaked refcounts — rows in
cas_objectswith refcount zero that weren't -cleaned up (e.g., if the process died between decrementing and deleting). -These rows are removed.
- - Idempotent — running sweep twice produces the same result. -
The sweep is wired into the existing repair loop in tesseras-replication, so
-it runs automatically every 24 hours alongside fragment health checks.
Migration from old layout (tesseras-storage/src/migration.rs) — A
-copy-first migration strategy that moves data from the old directory-based
-layout (blobs/<tessera>/<memory>/<file> and
-fragments/<tessera>/<index>.shard) into the CAS. The migration:
-
-
- Checks the storage version in
storage_meta(version 1 = old layout, version -2 = CAS)
- - Walks the old
blobs/andfragments/directories
- - Computes BLAKE3 hashes and inserts into CAS via
put()— duplicates are -automatically deduplicated
- - Creates corresponding
blob_refs/fragment_refsentries
- - Removes old directories only after all data is safely in CAS -
- Updates the storage version to 2 -
The migration runs on daemon startup, is idempotent (safe to re-run), and -reports statistics: files migrated, duplicates found, bytes saved.
-Prometheus metrics (tesseras-storage/src/metrics.rs) — Ten new metrics for
-observability:
| Metric | Description |
|---|---|
cas_objects_total | Total unique objects in the CAS |
cas_bytes_total | Total bytes stored |
cas_dedup_hits_total | Number of writes that found an existing object |
cas_bytes_saved_total | Bytes saved by deduplication |
cas_gc_refcount_deletions_total | Objects deleted when refcount reached zero |
cas_gc_sweep_orphans_cleaned_total | Orphan files removed by sweep |
cas_gc_sweep_leaked_refs_cleaned_total | Leaked refcount rows cleaned |
cas_gc_sweep_skipped_young_total | Young orphans skipped (grace period) |
cas_gc_sweep_duration_seconds | Time spent in sweep GC |
Property-based tests — Two proptest tests verify CAS invariants under random -inputs:
--
-
refcount_matches_actual_refs— after N random put/release operations, the -refcount always matches the actual number of outstanding references
-cas_path_is_deterministic— the same hash always produces the same -filesystem path
-
Integration test updates — All integration tests across tesseras-core,
-tesseras-replication, tesseras-embedded, and tesseras-cli updated for the
-new CAS-backed constructors. Tamper-detection tests updated to work with the CAS
-directory layout.
347 tests pass across the workspace. Clippy clean with -D warnings.
Architecture decisions
--
-
- BLAKE3 as CAS key: the content hash we already compute for integrity
-verification doubles as the deduplication key. No additional hashing step —
-the hash computed during
createorreplicateis reused as the CAS address.
- - SQLite refcount over filesystem reflinks: we considered using -filesystem-level copy-on-write (reflinks on btrfs/XFS), but that would tie -Tesseras to specific filesystems. SQLite refcounting works on any filesystem, -including FAT32 on cheap USB drives and ext4 on Raspberry Pis. -
- Two-level hex prefix directories: storing all CAS objects in a flat
-directory would slow down filesystems with millions of entries. The
-
<2-char prefix>/split limits any single directory to ~65k entries before a -second prefix level is needed. This matches the approach used by Git's object -store.
- - Grace period for orphan files: the sweep GC skips files younger than 1 -hour to avoid deleting objects that are being written by a concurrent -operation. This is a pragmatic choice — it trades a small window of potential -orphans for crash safety without requiring fsync or two-phase commit. -
- Copy-first migration: the migration copies data to CAS before removing old -directories. If the process is interrupted, the old data is still intact and -migration can be re-run. This is slower than moving files but guarantees no -data loss. -
- Sweep in repair loop: rather than adding a separate GC timer, the CAS -sweep piggybacks on the existing 24-hour repair loop. This keeps the daemon -simple — one background maintenance cycle handles both fragment health and -storage cleanup. -
What comes next
--
-
- Phase 4 continued — security audits, OS packaging (Alpine, Arch, Debian, -OpenBSD, FreeBSD) -
- Phase 5: Exploration and Culture — public tessera browser by -era/location/theme/language, institutional curation, genealogy integration -(FamilySearch, Ancestry), physical media export (M-DISC, microfilm, acid-free -paper with QR), AI-assisted context -
Storage deduplication completes the storage efficiency story for Tesseras. A -node that stores fragments for thousands of users — common for institutional -nodes and always-on full nodes — now pays the disk cost of unique data only. -Combined with Reed-Solomon erasure coding (which already minimizes redundancy at -the network level), the system achieves efficient storage at both the local and -distributed layers.
- -