summaryrefslogtreecommitdiffstats
path: root/news/phase4-storage-deduplication/index.html
blob: d499b4a55b10a0a6c0b78a7006e9a62cb47a4dbd (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Phase 4: Storage Deduplication — Tesseras</title>
    <meta name="description" content="A new content-addressable storage layer eliminates duplicate data across tesseras, reducing disk usage and enabling automatic garbage collection.">
    <!-- Open Graph -->
    <meta property="og:type" content="article">
    <meta property="og:title" content="Phase 4: Storage Deduplication">
    <meta property="og:description" content="A new content-addressable storage layer eliminates duplicate data across tesseras, reducing disk usage and enabling automatic garbage collection.">
    <meta property="og:image" content="https://tesseras.net/images/social.jpg">
    <meta property="og:image:width" content="1200">
    <meta property="og:image:height" content="630">
    <meta property="og:site_name" content="Tesseras">
    <!-- Twitter Card -->
    <meta name="twitter:card" content="summary_large_image">
    <meta name="twitter:title" content="Phase 4: Storage Deduplication">
    <meta name="twitter:description" content="A new content-addressable storage layer eliminates duplicate data across tesseras, reducing disk usage and enabling automatic garbage collection.">
    <meta name="twitter:image" content="https://tesseras.net/images/social.jpg">
    <link rel="stylesheet" href="https://tesseras.net/style.css?h=21f0f32121928ee5c690">
    
        
            <link rel="alternate" type="application/atom+xml" title="Tesseras" href="https://tesseras.net/atom.xml">
        
    
    <link rel="icon" type="image/png" sizes="32x32" href="https://tesseras.net/images/favicon.png?h=be4e123a23393b1a027d">
    
</head>
<body>
    <header>
        <h1>
            <a href="https:&#x2F;&#x2F;tesseras.net/">
                <img src="https://tesseras.net/images/logo-64.png?h=c1b8d0c4c5f93b49d40b" alt="Tesseras" width="40" height="40" class="logo">
                Tesseras
            </a>
        </h1>
        <nav>
            
                <a href="https://tesseras.net/about/">About</a>
                <a href="https://tesseras.net/news/">News</a>
                <a href="https://tesseras.net/releases/">Releases</a>
                <a href="https://tesseras.net/faq/">FAQ</a>
                <a href="https://tesseras.net/subscriptions/">Subscriptions</a>
                <a href="https://tesseras.net/contact/">Contact</a>
            
        </nav>
        <nav class="lang-switch">
            
                <strong>English</strong> | <a href="/pt-br&#x2F;news&#x2F;phase4-storage-deduplication&#x2F;">Português</a>
            
        </nav>
    </header>

    <main>
        
<article>
    <h2>Phase 4: Storage Deduplication</h2>
    <p class="news-date">2026-02-15</p>
    <p>When multiple tesseras share the same photo, the same audio clip, or the same
fragment data, the old storage layer kept separate copies of each. On a node
storing thousands of tesseras for the network, this duplication adds up fast.
Phase 4 continues with storage deduplication: a content-addressable store (CAS)
that ensures every unique piece of data is stored exactly once on disk,
regardless of how many tesseras reference it.</p>
<p>The design is simple and proven: hash the content with BLAKE3, use the hash as
the filename, and maintain a reference count in SQLite. When two tesseras
include the same 5 MB photo, one file exists on disk with a refcount of 2. When
one tessera is deleted, the refcount drops to 1 and the file stays. When the
last reference is released, a periodic sweep cleans up the orphan.</p>
<h2 id="what-was-built">What was built</h2>
<p><strong>CAS schema migration</strong> (<code>tesseras-storage/migrations/004_dedup.sql</code>) — Three
new tables:</p>
<ul>
<li><code>cas_objects</code> — tracks every object in the store: BLAKE3 hash (primary key),
byte size, reference count, and creation timestamp</li>
<li><code>blob_refs</code> — maps logical blob identifiers (tessera hash + memory hash +
filename) to CAS hashes, replacing the old filesystem path convention</li>
<li><code>fragment_refs</code> — maps logical fragment identifiers (tessera hash + fragment
index) to CAS hashes, replacing the old <code>fragments/</code> directory layout</li>
</ul>
<p>Indexes on the hash columns ensure O(1) lookups during reads and reference
counting.</p>
<p><strong>CasStore</strong> (<code>tesseras-storage/src/cas.rs</code>) — The core content-addressable
storage engine. Files are stored under a two-level prefix directory:
<code>&lt;root&gt;/&lt;2-char-hex-prefix&gt;/&lt;full-hash&gt;.blob</code>. The store provides five
operations:</p>
<ul>
<li><code>put(hash, data)</code> — writes data to disk if not already present, increments
refcount. Returns whether a dedup hit occurred.</li>
<li><code>get(hash)</code> — reads data from disk by hash</li>
<li><code>release(hash)</code> — decrements refcount. If it reaches zero, the on-disk file is
deleted immediately.</li>
<li><code>contains(hash)</code> — checks existence without reading</li>
<li><code>ref_count(hash)</code> — returns the current reference count</li>
</ul>
<p>All operations are atomic within a single SQLite transaction. The refcount is
the source of truth — if the refcount says the object exists, the file must be
on disk.</p>
<p><strong>CAS-backed FsBlobStore</strong> (<code>tesseras-storage/src/blob.rs</code>) — Rewritten to
delegate all storage to the CAS. When a blob is written, its BLAKE3 hash is
computed and passed to <code>cas.put()</code>. A row in <code>blob_refs</code> maps the logical path
(tessera + memory + filename) to the CAS hash. Reads look up the CAS hash via
<code>blob_refs</code> and fetch from <code>cas.get()</code>. Deleting a tessera releases all its blob
references in a single transaction.</p>
<p><strong>CAS-backed FsFragmentStore</strong> (<code>tesseras-storage/src/fragment.rs</code>) — Same
pattern for erasure-coded fragments. Each fragment's BLAKE3 checksum is already
computed during Reed-Solomon encoding, so it's used directly as the CAS key.
Fragment verification now checks the CAS hash instead of recomputing from
scratch — if the CAS says the data is intact, it is.</p>
<p><strong>Sweep garbage collector</strong> (<code>cas.rs:sweep()</code>) — A periodic GC pass that handles
three edge cases the normal refcount path can't:</p>
<ol>
<li><strong>Orphan files</strong> — files on disk with no corresponding row in <code>cas_objects</code>.
Can happen after a crash mid-write. Files younger than 1 hour are skipped
(grace period for in-flight writes); older orphans are deleted.</li>
<li><strong>Leaked refcounts</strong> — rows in <code>cas_objects</code> with refcount zero that weren't
cleaned up (e.g., if the process died between decrementing and deleting).
These rows are removed.</li>
<li><strong>Idempotent</strong> — running sweep twice produces the same result.</li>
</ol>
<p>The sweep is wired into the existing repair loop in <code>tesseras-replication</code>, so
it runs automatically every 24 hours alongside fragment health checks.</p>
<p><strong>Migration from old layout</strong> (<code>tesseras-storage/src/migration.rs</code>) — A
copy-first migration strategy that moves data from the old directory-based
layout (<code>blobs/&lt;tessera&gt;/&lt;memory&gt;/&lt;file&gt;</code> and
<code>fragments/&lt;tessera&gt;/&lt;index&gt;.shard</code>) into the CAS. The migration:</p>
<ol>
<li>Checks the storage version in <code>storage_meta</code> (version 1 = old layout, version
2 = CAS)</li>
<li>Walks the old <code>blobs/</code> and <code>fragments/</code> directories</li>
<li>Computes BLAKE3 hashes and inserts into CAS via <code>put()</code> — duplicates are
automatically deduplicated</li>
<li>Creates corresponding <code>blob_refs</code> / <code>fragment_refs</code> entries</li>
<li>Removes old directories only after all data is safely in CAS</li>
<li>Updates the storage version to 2</li>
</ol>
<p>The migration runs on daemon startup, is idempotent (safe to re-run), and
reports statistics: files migrated, duplicates found, bytes saved.</p>
<p><strong>Prometheus metrics</strong> (<code>tesseras-storage/src/metrics.rs</code>) — Ten new metrics for
observability:</p>
<table><thead><tr><th>Metric</th><th>Description</th></tr></thead><tbody>
<tr><td><code>cas_objects_total</code></td><td>Total unique objects in the CAS</td></tr>
<tr><td><code>cas_bytes_total</code></td><td>Total bytes stored</td></tr>
<tr><td><code>cas_dedup_hits_total</code></td><td>Number of writes that found an existing object</td></tr>
<tr><td><code>cas_bytes_saved_total</code></td><td>Bytes saved by deduplication</td></tr>
<tr><td><code>cas_gc_refcount_deletions_total</code></td><td>Objects deleted when refcount reached zero</td></tr>
<tr><td><code>cas_gc_sweep_orphans_cleaned_total</code></td><td>Orphan files removed by sweep</td></tr>
<tr><td><code>cas_gc_sweep_leaked_refs_cleaned_total</code></td><td>Leaked refcount rows cleaned</td></tr>
<tr><td><code>cas_gc_sweep_skipped_young_total</code></td><td>Young orphans skipped (grace period)</td></tr>
<tr><td><code>cas_gc_sweep_duration_seconds</code></td><td>Time spent in sweep GC</td></tr>
</tbody></table>
<p><strong>Property-based tests</strong> — Two proptest tests verify CAS invariants under random
inputs:</p>
<ul>
<li><code>refcount_matches_actual_refs</code> — after N random put/release operations, the
refcount always matches the actual number of outstanding references</li>
<li><code>cas_path_is_deterministic</code> — the same hash always produces the same
filesystem path</li>
</ul>
<p><strong>Integration test updates</strong> — All integration tests across <code>tesseras-core</code>,
<code>tesseras-replication</code>, <code>tesseras-embedded</code>, and <code>tesseras-cli</code> updated for the
new CAS-backed constructors. Tamper-detection tests updated to work with the CAS
directory layout.</p>
<p>347 tests pass across the workspace. Clippy clean with <code>-D warnings</code>.</p>
<h2 id="architecture-decisions">Architecture decisions</h2>
<ul>
<li><strong>BLAKE3 as CAS key</strong>: the content hash we already compute for integrity
verification doubles as the deduplication key. No additional hashing step —
the hash computed during <code>create</code> or <code>replicate</code> is reused as the CAS address.</li>
<li><strong>SQLite refcount over filesystem reflinks</strong>: we considered using
filesystem-level copy-on-write (reflinks on btrfs/XFS), but that would tie
Tesseras to specific filesystems. SQLite refcounting works on any filesystem,
including FAT32 on cheap USB drives and ext4 on Raspberry Pis.</li>
<li><strong>Two-level hex prefix directories</strong>: storing all CAS objects in a flat
directory would slow down filesystems with millions of entries. The
<code>&lt;2-char prefix&gt;/</code> split limits any single directory to ~65k entries before a
second prefix level is needed. This matches the approach used by Git's object
store.</li>
<li><strong>Grace period for orphan files</strong>: the sweep GC skips files younger than 1
hour to avoid deleting objects that are being written by a concurrent
operation. This is a pragmatic choice — it trades a small window of potential
orphans for crash safety without requiring fsync or two-phase commit.</li>
<li><strong>Copy-first migration</strong>: the migration copies data to CAS before removing old
directories. If the process is interrupted, the old data is still intact and
migration can be re-run. This is slower than moving files but guarantees no
data loss.</li>
<li><strong>Sweep in repair loop</strong>: rather than adding a separate GC timer, the CAS
sweep piggybacks on the existing 24-hour repair loop. This keeps the daemon
simple — one background maintenance cycle handles both fragment health and
storage cleanup.</li>
</ul>
<h2 id="what-comes-next">What comes next</h2>
<ul>
<li><strong>Phase 4 continued</strong> — security audits, OS packaging (Alpine, Arch, Debian,
OpenBSD, FreeBSD)</li>
<li><strong>Phase 5: Exploration and Culture</strong> — public tessera browser by
era/location/theme/language, institutional curation, genealogy integration
(FamilySearch, Ancestry), physical media export (M-DISC, microfilm, acid-free
paper with QR), AI-assisted context</li>
</ul>
<p>Storage deduplication completes the storage efficiency story for Tesseras. A
node that stores fragments for thousands of users — common for institutional
nodes and always-on full nodes — now pays the disk cost of unique data only.
Combined with Reed-Solomon erasure coding (which already minimizes redundancy at
the network level), the system achieves efficient storage at both the local and
distributed layers.</p>

</article>

    </main>

    <footer>
        <p>&copy; 2026 Tesseras Project. <a href="/atom.xml">News Feed</a> · <a href="https://git.sr.ht/~ijanc/tesseras">Source</a></p>
    </footer>
</body>
</html>