Storage Engine: Segments and Extents

ZeroFS stores file contents in immutable segment objects written directly to the object store. An LSM tree on the same object store holds only metadata: inodes, directory entries, and one 32-byte pointer per extent. This page documents the on-object format, the write and read paths, and space reclamation.

Overview

File data is split into 32 KiB extents. Each extent is compressed, then encrypted into a self-describing frame. Frames are packed into segments: immutable objects written once to the object store and removed only by garbage collection. The two stores divide the work:

StoreHolds
Segment objects (segments/…)File data: compressed, encrypted frames plus a directory and footer
Metadata LSM tree (on the same object store)Inodes, directory entries, tombstones, stats; one 32-byte FrameLoc per extent; per-segment live-byte counters

Segment object keys are segments/{shard}/{epoch}/{counter}:

  • epoch — the writer epoch, a fencing counter bumped on every writer open. Two writer terms can never collide on an object key.
  • counter — a per-process sequence number, starting at 0 each open.
  • shard — the low byte of the counter, as two hex digits. Consecutive segments round-robin across 256 prefixes, avoiding object-store hot-prefix throttling on a monotonic key tail. Reads resolve exact keys from the FrameLoc, so sharding costs the read path nothing.

A write touches only RAM and the metadata transaction. Object-store requests happen at segment seal and metadata flush, not per write.

On-Object Format

A segment has three parts, written in order:

[frame_0][frame_1]...[frame_k-1]   packed, no padding
[directory]                        AEAD-sealed reverse map (GC and recovery only)
[footer]                           64 bytes, plaintext, written last

Each frame on the wire is [len: u32 LE][sealed_frame]. The sealed frame is [nonce: 24 bytes][ciphertext + tag: 16 bytes]: the extent is compressed (per the compression setting), then encrypted with XChaCha20-Poly1305. The key is derived from the master key via HKDF-SHA256 with the label zerofs-v1-segment, domain-separated from the metadata SST-block key. The AEAD associated data binds each frame to its segment ID, frame index, inode, and extent index — a frame cannot be moved to another segment, slot, or logical block without failing authentication.

The FrameLoc is the LSM value of an extent key: 32 bytes, little-endian.

OffsetSizeField
08Segment epoch
88Segment counter
164Frame index
208Byte offset of the frame's length prefix within the segment
284Byte length: length prefix plus sealed frame

The footer is 64 bytes, fixed layout:

OffsetSizeField
04Magic ZSEG
44Version (1)
84Frame count
128Directory offset
204Directory length
248Sealed epoch
328Counter
408Sealed sequence number
488Total object length
564CRC32C over directory and footer prefix
604Reserved, zero

The CRC32C is a keyless torn-write detector for the directory and footer only; frames carry their own AEAD tags. The directory is a reverse map — one 28-byte entry per frame (byte offset, length, inode, extent index) — sealed as a single AEAD frame. Normal reads never touch it; only garbage collection, segment packing, and standby recovery do.

Write Path

  1. A file write splits into 32 KiB extents. Partially overwritten extents are read back first, up to 20 in parallel; full overwrites and writes beyond EOF skip the read. A 32 MiB tail cache holds each inode's most recently written extent, so sequential appends do not re-read the tail. All-zero extents become key deletions (holes).
  2. Each non-zero extent is sealed — compressed, encrypted — and appended to an in-RAM open segment buffer. The FrameLoc put and the segment live-byte credits and debits commit in the same transaction as the inode update. No object-store request is issued on the write path.
  3. When the open buffer crosses 256 MiB, it rotates and uploads in the background as one multipart PUT with 8 concurrent parts. At most 4 seals run in flight; this bounds un-uploaded segment RAM to 4 × 256 MiB and backpressures writers. A failed background upload is retried at the next flush.

The fsync barrier

fsync (and every other flush) runs a fixed sequence: drain in-flight seals, re-upload any failed seal, seal the current open buffer, then flush the metadata LSM. If sealing fails, the metadata flush does not run. The invariant: a durable metadata manifest never references an un-uploaded segment. A crash between segment upload and metadata flush leaves an orphaned segment object, which garbage collection reclaims — never a durable pointer to a missing object. This invariant is exercised by failpoint crash tests.

The LSM tree's write-ahead log is permanently off and not configurable: a WAL flush is not gated by the seal barrier and could persist a FrameLoc to a segment that was never uploaded.

Flushes are triggered by:

  • Client fsync, NFS COMMIT, NBD flush
  • The periodic flush every flush_interval_secs (default 30 s)
  • Every commit batch when sync_writes = true
  • Each segment garbage-collection pass
  • Shutdown

Between flushes, up to 256 MiB of committed writes exist only in the open RAM buffer, plus up to 4 × 256 MiB in sealing buffers. Under replication, the standby also holds these writes: a frame still in leader RAM ships with the replicated operation, and a promoted standby rebuilds any missing segment objects at takeover. See Durability & Consistency for the guarantee model.

Read Path

A point read looks up the extent key, decodes the FrameLoc, and checks the open buffer and sealing map first (read-your-writes without a network round trip). On a miss it issues one ranged GET of exactly the frame's byte span, then decrypts and decompresses. Frames are self-describing; the segment directory is never consulted on reads.

Multi-extent reads scan the extent keyspace and coalesce every maximal run that is contiguous in one segment — same segment, consecutive frame indices, adjacent byte ranges — into a single ranged GET. Missing extents are holes and return zeros. Segment packing re-sorts a file's live extents by (inode, extent index), so a whole-file read of a packed file can be one GET.

Two prefetch layers apply:

  • Logical read-ahead: after 2 consecutive sequential reads, an 8 MiB window of upcoming extents is prefetched, up to 16 fetches in parallel. It fires only when the window crosses into a different segment object; within one segment the physical layer covers it.
  • Physical prefetch: objects are cached in 128 KiB parts; a sequential stream's fetch window ramps from 128 KiB to 8 MiB, kept up to 4 windows deep. Concurrent misses on the same window share one GET.

The parts cache stores raw object bytes: extent data is compressed and encrypted on the local cache disk. Just-sealed segments are inserted into the parts cache from the bytes in hand, so a read after write is served locally. Cache sizing is documented on the Caching page.

Space Reclamation

The metadata LSM tracks live bytes per segment. Transactions carry signed deltas; the single-writer commit worker folds them into absolute counters. Overwriting or deleting an extent debits the segment that held its old frame.

A garbage-collection pass runs every 60 s on the writer:

  1. Seal and flush under the barrier, so the durable view matches memory.
  2. Scan the per-segment counters — O(number of segments), not extents — and the streamed segments/ listing.
  3. Delete dead segments: a segment with zero live bytes gets a delete horizon on first sight (at least 60 s in the future, extended 30 s past any ephemeral checkpoint expiry) and is deleted on a later pass, at most 1024 per pass. Every delete is gated by a fail-closed verification that re-reads the segment directory and confirms no extent still points at it; any error skips the delete (a leak, never data loss). Persistent checkpoints protect the segments they can reference.
  4. Pack fragmented segments: segments under 50% live or smaller than 1 MiB are repacked into fresh segments targeting 256 MiB, at most 64 sources and 256 MiB per round. Live frames are fetched via coalesced ranged GETs, re-sorted by (inode, extent index) for physical contiguity, sealed durably, then repointed with a conditional swap under the per-inode lock. Drained sources are not deleted in the same pass — they become dead and are removed later, so in-flight reads stay valid. Dense full-size segments are never rewritten.

Directory reads during GC are two ranged GETs — the 64-byte footer suffix, then the directory span — never the whole object.

File data no longer rides LSM compaction. The embedded compactor still runs in zerofs run, but rewrites only metadata SSTs. Deleted-file extent pointers are removed by the tombstone sweep described on the Garbage Collection page; the segment bytes are then reclaimed by the pass above.

Design Constants

The data plane has no tuning knobs. These values are compile-time constants:

ConstantValue
Extent size32 KiB
FrameLoc size32 bytes
Segment seal threshold256 MiB
Max in-flight seals4
Segment upload concurrency8 multipart parts
Parallel extent reads per operation20
Tail cache32 MiB
Logical read-ahead window8 MiB, after 2 sequential reads, 16 fetches max
Physical part size128 KiB
Physical fetch window128 KiB to 8 MiB, 4 windows deep
Segment GC interval60 s
Dead-segment delete horizon≥ 60 s; checkpoint expiry + 30 s; persistent checkpoint margin 300 s
Segment deletes per GC pass≤ 1024
Packing candidateslive < 50%, or size < 1 MiB
Packing target / per-round cap256 MiB / 64 segments, 256 MiB

Configuration

The engine responds to these configuration keys:

KeyDefaultEffect
[filesystem] compressionzstd-3Per-frame codec: lz4 or zstd-1 through zstd-22. The algorithm is auto-detected on read, so changing it on an existing store is safe; old and new frames coexist.
[filesystem] ignore_fsyncfalseClient fsync returns without a seal and flush. Intended for HA, where the standby already holds the write. Cannot be combined with sync_writes.
[lsm] flush_interval_secs30 (min 5)Interval of the periodic barrier-gated flush: seal the open segment, then flush metadata.
[lsm] sync_writesfalseForces a segment seal and metadata flush after every coalesced commit batch. Expensive: the WAL is off, so each batch waits for a full seal plus memtable flush.
[lsm] l0_max_ssts256 (min 4)Level-0 SST backlog for the metadata LSM; also applied per key.
[lsm] max_concurrent_compactions8 (min 1)Concurrency of the embedded metadata compactor.

[lsm] wal_enabled and [lsm] max_unflushed_gb no longer exist; a configuration that sets either fails to parse. Cache sizing ([cache] disk_size_gb, memory_size_gb, warm_metadata) is covered on the Caching page.

Was this page helpful?