summaryrefslogtreecommitdiffstats
path: root/news/phase4-storage-deduplication/index.html
diff options
context:
space:
mode:
authormurilo ijanc2026-03-24 21:41:06 -0300
committermurilo ijanc2026-03-24 21:41:06 -0300
commitf186b71ca51e83837db60de13322394bb5e6d348 (patch)
treecd7940eaa16b83d2cde7b18123411bfb161f7ebb /news/phase4-storage-deduplication/index.html
downloadwebsite-f186b71ca51e83837db60de13322394bb5e6d348.tar.gz
Initial commit
Import existing tesseras.net website content.
Diffstat (limited to 'news/phase4-storage-deduplication/index.html')
-rw-r--r--news/phase4-storage-deduplication/index.html217
1 files changed, 217 insertions, 0 deletions
diff --git a/news/phase4-storage-deduplication/index.html b/news/phase4-storage-deduplication/index.html
new file mode 100644
index 0000000..d499b4a
--- /dev/null
+++ b/news/phase4-storage-deduplication/index.html
@@ -0,0 +1,217 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+ <meta charset="utf-8">
+ <meta name="viewport" content="width=device-width, initial-scale=1">
+ <title>Phase 4: Storage Deduplication — Tesseras</title>
+ <meta name="description" content="A new content-addressable storage layer eliminates duplicate data across tesseras, reducing disk usage and enabling automatic garbage collection.">
+ <!-- Open Graph -->
+ <meta property="og:type" content="article">
+ <meta property="og:title" content="Phase 4: Storage Deduplication">
+ <meta property="og:description" content="A new content-addressable storage layer eliminates duplicate data across tesseras, reducing disk usage and enabling automatic garbage collection.">
+ <meta property="og:image" content="https://tesseras.net/images/social.jpg">
+ <meta property="og:image:width" content="1200">
+ <meta property="og:image:height" content="630">
+ <meta property="og:site_name" content="Tesseras">
+ <!-- Twitter Card -->
+ <meta name="twitter:card" content="summary_large_image">
+ <meta name="twitter:title" content="Phase 4: Storage Deduplication">
+ <meta name="twitter:description" content="A new content-addressable storage layer eliminates duplicate data across tesseras, reducing disk usage and enabling automatic garbage collection.">
+ <meta name="twitter:image" content="https://tesseras.net/images/social.jpg">
+ <link rel="stylesheet" href="https://tesseras.net/style.css?h=21f0f32121928ee5c690">
+
+
+ <link rel="alternate" type="application/atom+xml" title="Tesseras" href="https://tesseras.net/atom.xml">
+
+
+ <link rel="icon" type="image/png" sizes="32x32" href="https://tesseras.net/images/favicon.png?h=be4e123a23393b1a027d">
+
+</head>
+<body>
+ <header>
+ <h1>
+ <a href="https:&#x2F;&#x2F;tesseras.net/">
+ <img src="https://tesseras.net/images/logo-64.png?h=c1b8d0c4c5f93b49d40b" alt="Tesseras" width="40" height="40" class="logo">
+ Tesseras
+ </a>
+ </h1>
+ <nav>
+
+ <a href="https://tesseras.net/about/">About</a>
+ <a href="https://tesseras.net/news/">News</a>
+ <a href="https://tesseras.net/releases/">Releases</a>
+ <a href="https://tesseras.net/faq/">FAQ</a>
+ <a href="https://tesseras.net/subscriptions/">Subscriptions</a>
+ <a href="https://tesseras.net/contact/">Contact</a>
+
+ </nav>
+ <nav class="lang-switch">
+
+ <strong>English</strong> | <a href="/pt-br&#x2F;news&#x2F;phase4-storage-deduplication&#x2F;">Português</a>
+
+ </nav>
+ </header>
+
+ <main>
+
+<article>
+ <h2>Phase 4: Storage Deduplication</h2>
+ <p class="news-date">2026-02-15</p>
+ <p>When multiple tesseras share the same photo, the same audio clip, or the same
+fragment data, the old storage layer kept separate copies of each. On a node
+storing thousands of tesseras for the network, this duplication adds up fast.
+Phase 4 continues with storage deduplication: a content-addressable store (CAS)
+that ensures every unique piece of data is stored exactly once on disk,
+regardless of how many tesseras reference it.</p>
+<p>The design is simple and proven: hash the content with BLAKE3, use the hash as
+the filename, and maintain a reference count in SQLite. When two tesseras
+include the same 5 MB photo, one file exists on disk with a refcount of 2. When
+one tessera is deleted, the refcount drops to 1 and the file stays. When the
+last reference is released, a periodic sweep cleans up the orphan.</p>
+<h2 id="what-was-built">What was built</h2>
+<p><strong>CAS schema migration</strong> (<code>tesseras-storage/migrations/004_dedup.sql</code>) — Three
+new tables:</p>
+<ul>
+<li><code>cas_objects</code> — tracks every object in the store: BLAKE3 hash (primary key),
+byte size, reference count, and creation timestamp</li>
+<li><code>blob_refs</code> — maps logical blob identifiers (tessera hash + memory hash +
+filename) to CAS hashes, replacing the old filesystem path convention</li>
+<li><code>fragment_refs</code> — maps logical fragment identifiers (tessera hash + fragment
+index) to CAS hashes, replacing the old <code>fragments/</code> directory layout</li>
+</ul>
+<p>Indexes on the hash columns ensure O(1) lookups during reads and reference
+counting.</p>
+<p><strong>CasStore</strong> (<code>tesseras-storage/src/cas.rs</code>) — The core content-addressable
+storage engine. Files are stored under a two-level prefix directory:
+<code>&lt;root&gt;/&lt;2-char-hex-prefix&gt;/&lt;full-hash&gt;.blob</code>. The store provides five
+operations:</p>
+<ul>
+<li><code>put(hash, data)</code> — writes data to disk if not already present, increments
+refcount. Returns whether a dedup hit occurred.</li>
+<li><code>get(hash)</code> — reads data from disk by hash</li>
+<li><code>release(hash)</code> — decrements refcount. If it reaches zero, the on-disk file is
+deleted immediately.</li>
+<li><code>contains(hash)</code> — checks existence without reading</li>
+<li><code>ref_count(hash)</code> — returns the current reference count</li>
+</ul>
+<p>All operations are atomic within a single SQLite transaction. The refcount is
+the source of truth — if the refcount says the object exists, the file must be
+on disk.</p>
+<p><strong>CAS-backed FsBlobStore</strong> (<code>tesseras-storage/src/blob.rs</code>) — Rewritten to
+delegate all storage to the CAS. When a blob is written, its BLAKE3 hash is
+computed and passed to <code>cas.put()</code>. A row in <code>blob_refs</code> maps the logical path
+(tessera + memory + filename) to the CAS hash. Reads look up the CAS hash via
+<code>blob_refs</code> and fetch from <code>cas.get()</code>. Deleting a tessera releases all its blob
+references in a single transaction.</p>
+<p><strong>CAS-backed FsFragmentStore</strong> (<code>tesseras-storage/src/fragment.rs</code>) — Same
+pattern for erasure-coded fragments. Each fragment's BLAKE3 checksum is already
+computed during Reed-Solomon encoding, so it's used directly as the CAS key.
+Fragment verification now checks the CAS hash instead of recomputing from
+scratch — if the CAS says the data is intact, it is.</p>
+<p><strong>Sweep garbage collector</strong> (<code>cas.rs:sweep()</code>) — A periodic GC pass that handles
+three edge cases the normal refcount path can't:</p>
+<ol>
+<li><strong>Orphan files</strong> — files on disk with no corresponding row in <code>cas_objects</code>.
+Can happen after a crash mid-write. Files younger than 1 hour are skipped
+(grace period for in-flight writes); older orphans are deleted.</li>
+<li><strong>Leaked refcounts</strong> — rows in <code>cas_objects</code> with refcount zero that weren't
+cleaned up (e.g., if the process died between decrementing and deleting).
+These rows are removed.</li>
+<li><strong>Idempotent</strong> — running sweep twice produces the same result.</li>
+</ol>
+<p>The sweep is wired into the existing repair loop in <code>tesseras-replication</code>, so
+it runs automatically every 24 hours alongside fragment health checks.</p>
+<p><strong>Migration from old layout</strong> (<code>tesseras-storage/src/migration.rs</code>) — A
+copy-first migration strategy that moves data from the old directory-based
+layout (<code>blobs/&lt;tessera&gt;/&lt;memory&gt;/&lt;file&gt;</code> and
+<code>fragments/&lt;tessera&gt;/&lt;index&gt;.shard</code>) into the CAS. The migration:</p>
+<ol>
+<li>Checks the storage version in <code>storage_meta</code> (version 1 = old layout, version
+2 = CAS)</li>
+<li>Walks the old <code>blobs/</code> and <code>fragments/</code> directories</li>
+<li>Computes BLAKE3 hashes and inserts into CAS via <code>put()</code> — duplicates are
+automatically deduplicated</li>
+<li>Creates corresponding <code>blob_refs</code> / <code>fragment_refs</code> entries</li>
+<li>Removes old directories only after all data is safely in CAS</li>
+<li>Updates the storage version to 2</li>
+</ol>
+<p>The migration runs on daemon startup, is idempotent (safe to re-run), and
+reports statistics: files migrated, duplicates found, bytes saved.</p>
+<p><strong>Prometheus metrics</strong> (<code>tesseras-storage/src/metrics.rs</code>) — Ten new metrics for
+observability:</p>
+<table><thead><tr><th>Metric</th><th>Description</th></tr></thead><tbody>
+<tr><td><code>cas_objects_total</code></td><td>Total unique objects in the CAS</td></tr>
+<tr><td><code>cas_bytes_total</code></td><td>Total bytes stored</td></tr>
+<tr><td><code>cas_dedup_hits_total</code></td><td>Number of writes that found an existing object</td></tr>
+<tr><td><code>cas_bytes_saved_total</code></td><td>Bytes saved by deduplication</td></tr>
+<tr><td><code>cas_gc_refcount_deletions_total</code></td><td>Objects deleted when refcount reached zero</td></tr>
+<tr><td><code>cas_gc_sweep_orphans_cleaned_total</code></td><td>Orphan files removed by sweep</td></tr>
+<tr><td><code>cas_gc_sweep_leaked_refs_cleaned_total</code></td><td>Leaked refcount rows cleaned</td></tr>
+<tr><td><code>cas_gc_sweep_skipped_young_total</code></td><td>Young orphans skipped (grace period)</td></tr>
+<tr><td><code>cas_gc_sweep_duration_seconds</code></td><td>Time spent in sweep GC</td></tr>
+</tbody></table>
+<p><strong>Property-based tests</strong> — Two proptest tests verify CAS invariants under random
+inputs:</p>
+<ul>
+<li><code>refcount_matches_actual_refs</code> — after N random put/release operations, the
+refcount always matches the actual number of outstanding references</li>
+<li><code>cas_path_is_deterministic</code> — the same hash always produces the same
+filesystem path</li>
+</ul>
+<p><strong>Integration test updates</strong> — All integration tests across <code>tesseras-core</code>,
+<code>tesseras-replication</code>, <code>tesseras-embedded</code>, and <code>tesseras-cli</code> updated for the
+new CAS-backed constructors. Tamper-detection tests updated to work with the CAS
+directory layout.</p>
+<p>347 tests pass across the workspace. Clippy clean with <code>-D warnings</code>.</p>
+<h2 id="architecture-decisions">Architecture decisions</h2>
+<ul>
+<li><strong>BLAKE3 as CAS key</strong>: the content hash we already compute for integrity
+verification doubles as the deduplication key. No additional hashing step —
+the hash computed during <code>create</code> or <code>replicate</code> is reused as the CAS address.</li>
+<li><strong>SQLite refcount over filesystem reflinks</strong>: we considered using
+filesystem-level copy-on-write (reflinks on btrfs/XFS), but that would tie
+Tesseras to specific filesystems. SQLite refcounting works on any filesystem,
+including FAT32 on cheap USB drives and ext4 on Raspberry Pis.</li>
+<li><strong>Two-level hex prefix directories</strong>: storing all CAS objects in a flat
+directory would slow down filesystems with millions of entries. The
+<code>&lt;2-char prefix&gt;/</code> split limits any single directory to ~65k entries before a
+second prefix level is needed. This matches the approach used by Git's object
+store.</li>
+<li><strong>Grace period for orphan files</strong>: the sweep GC skips files younger than 1
+hour to avoid deleting objects that are being written by a concurrent
+operation. This is a pragmatic choice — it trades a small window of potential
+orphans for crash safety without requiring fsync or two-phase commit.</li>
+<li><strong>Copy-first migration</strong>: the migration copies data to CAS before removing old
+directories. If the process is interrupted, the old data is still intact and
+migration can be re-run. This is slower than moving files but guarantees no
+data loss.</li>
+<li><strong>Sweep in repair loop</strong>: rather than adding a separate GC timer, the CAS
+sweep piggybacks on the existing 24-hour repair loop. This keeps the daemon
+simple — one background maintenance cycle handles both fragment health and
+storage cleanup.</li>
+</ul>
+<h2 id="what-comes-next">What comes next</h2>
+<ul>
+<li><strong>Phase 4 continued</strong> — security audits, OS packaging (Alpine, Arch, Debian,
+OpenBSD, FreeBSD)</li>
+<li><strong>Phase 5: Exploration and Culture</strong> — public tessera browser by
+era/location/theme/language, institutional curation, genealogy integration
+(FamilySearch, Ancestry), physical media export (M-DISC, microfilm, acid-free
+paper with QR), AI-assisted context</li>
+</ul>
+<p>Storage deduplication completes the storage efficiency story for Tesseras. A
+node that stores fragments for thousands of users — common for institutional
+nodes and always-on full nodes — now pays the disk cost of unique data only.
+Combined with Reed-Solomon erasure coding (which already minimizes redundancy at
+the network level), the system achieves efficient storage at both the local and
+distributed layers.</p>
+
+</article>
+
+ </main>
+
+ <footer>
+ <p>&copy; 2026 Tesseras Project. <a href="/atom.xml">News Feed</a> · <a href="https://git.sr.ht/~ijanc/tesseras">Source</a></p>
+ </footer>
+</body>
+</html>