Garbage Collection

Deleting a file is immediate at the filesystem level: the unlink transaction commits and the file is gone from the namespace. The bytes in the bucket are reclaimed later, by background stages that each run on their own schedule. File contents live in immutable segment objects under segments/; reclamation ends when segment GC deletes those objects. A bucket that shrinks minutes after a deletion is operating normally.

What Deletion Does

Unlink removes the inode in a single transaction. What happens to the file's extents depends on the file's size:

Files of 10 extents or fewer (320 KiB at the 32 KiB extent size) have their extent keys deleted in the same transaction.
Larger files leave a tombstone, a record of the file's remaining size. The extents stay in place until the garbage collector processes the tombstone.

If the file is still open, the unlink only detaches the inode from the namespace and records it as an orphan; the same small-file/tombstone decision runs when the last handle closes.

Deleting an extent key removes a 32-byte pointer (a FrameLoc) from the metadata store and debits the live-byte counter of the segment that holds the extent's frame. It does not touch the segment object; the data bytes stay in the bucket until segment GC removes or repacks the segment.

Renaming over an existing file deletes the replaced file through the same path. In both cases the operation returns before any space is reclaimed.

Deletion decision

file size <= 10 extents (320 KiB)
  -> extent keys deleted in the
     unlink transaction

file size  > 10 extents
  -> tombstone written;
     extents remain until GC

file still open at unlink
  -> inode orphaned; decision
     deferred to last close

The Reclamation Pipeline

File contents are packed as encrypted frames into immutable segment objects; the metadata LSM holds one 32-byte pointer per extent and a live-byte counter per segment. Space returns in two stages for file data and one for metadata. Each is eventual, and none is triggered by the deletion itself:

Tombstone GC deletes the file's extent pointer keys and debits each source segment's live-byte counter. No object is deleted; the segment objects still hold the frames.
Segment GC deletes segment objects whose live-byte counter has reached zero and repacks fragmented ones. This is where the bucket's billed size for file data drops.
Metadata compaction and object GC rewrite and delete metadata SSTs. The SSTs hold only metadata (inodes, directories, extent pointers, counters), so this stage reclaims metadata space, not file data.

A delay between deleting files and the bucket shrinking is normal operation, not a fault. Each stage runs in the background on its own schedule.

Stage 1: Tombstone GC

The read-write instance runs a continuous background task that drains the tombstone queue. Each pass processes the queue in rounds of up to 10,000 extents across up to 10,000 tombstones; rounds repeat back to back until the queue is empty, and the task then sleeps 10 seconds before the next pass.

Large files are collected tail-first: each round deletes extents from the end of the file and updates the tombstone's remaining size. A partially collected file is a valid intermediate state, and an interrupted pass resumes from the recorded remaining size.

Each delete runs under the per-inode write lock and, in the same transaction, debits the live-byte counter of every removed frame's source segment. Those counters are how the next stage finds dead and fragmented segments.

After this stage, the extent pointers are gone from the metadata store. The bucket holds the same segment objects it held before.

GC pass

loop:
  round:
    process up to 10,000 tombstones
    delete up to 10,000 extent keys
    (tail-first per file)
    debit segment live-byte counters
  if queue not empty: next round
  else: sleep 10 s, start next pass

Stage 2: Segment GC

The writer runs a dedicated segment GC task: one pass at startup, then one per interval. The interval adapts across three configurable tiers: a 60-second base; a 15-second drain while the store is active but its dead backlog is large (dead space at or above 20%, or reserve-deferred seams remaining); and a 5-second fast tier while the previous pass hit a work budget with actionable backlog remaining and the store was completely idle since that pass (zero bytes read or written, zero mutations, no seal in flight). Each pass repeats its selection and compaction step back-to-back: a configurable floor of batches (4 by default) runs regardless of client activity, so a loaded store drains several batches per pass instead of one, and past the floor the pass continues only while the store stays idle, up to 32 batches, re-checking between batches and yielding within one batch of any arriving operation. The barrier and counter scan run once per pass, so batching amortizes them across the drain rather than paying them per batch. An error or a drained backlog returns the cadence to the base interval, and the loop never sleeps longer than the base interval. Each pass logs its chosen plan (tier, next interval, batch floor, and the deciding reason). A pass begins by sealing the open segment buffer and flushing the metadata store under the flush barrier, so the durable view matches the in-memory view; a segment GC pass is therefore also a flush trigger. On a fast idle pass the barrier is a near no-op because the idle test admitted no writes since the last one; the pass's cost is the reclamation work itself which can be a full round of gathers, segment PUTs, and deletes plus about two small bookkeeping PUTs of fixed overhead (the previous pass's own commits flushing).

The pass then reads every per-segment live-byte counter in one scan; it lists nothing on the object store. Only segments sealed before the pass are eligible. Two actions follow:

Dead-segment deletion. A segment whose counter is zero is a delete candidate. It is deleted only after a per-segment delete horizon has passed, and only after a verification step reads the segment's directory and confirms that no extent pointer still references it. Verification is fail-closed: any read error keeps the segment, since a leaked segment is preferred to lost data. The object DELETE is the point where the bucket's billed size drops.

Segment compaction. Segments that are fragmented (live bytes below 50% of the object size) or small (below 1 MiB) are repacked: their live frames are fetched with coalesced ranged GETs, sorted by (inode, extent) so a file's data becomes physically contiguous, and sealed into new segments of up to 256 MiB. Drained sources are not deleted in the same pass: they become dead and a later pass removes them, which keeps in-flight reads of the old locations valid. Candidates that recent reads fanned out across are selected first, up to half the per-pass budget; the rest is filled most-fragmented-first. A dense segment with no dead bytes is rewritten by exactly one path, a hot-seam chain repack, described under Read-Directed Compaction; one carrying dead bytes above the scrub floor falls to the tail scrub below once write-cold.

Parameter	Value
Pass interval	adaptive: 60 s base / 15 s while active with a large dead backlog / 5 s while saturated and idle (all configurable)
Delete-horizon floor	now + 60 s
Fragmentation threshold	live bytes < 50% of object size
Small-segment threshold	< 1 MiB
Tail-scrub band	dead fraction in (floor, 50%], write-cold only; floor default 5%, configurable
Repack target size	256 MiB
Compaction budget	64 segments / 256 MiB live per batch (byte budget configurable via `compact_round_max_mib`, 64–4096 MiB); a floor of 4 batches per pass under load, up to 32 while the store stays idle
Minimum payoff to compact	1 MiB freed, or 64 MiB of live bytes gathered (a quarter of the seal threshold), or at least one chain packed
Segment deletes per pass	1,024
Orphan sweep interval	24 h wall-clock, persisted across restarts

A pass issues a 64-byte footer GET plus a directory GET per segment it verifies or repacks, and one DELETE per dead segment; a pass that repacks additionally issues the coalesced ranged GETs for the live frames and one PUT per packed segment. No pass lists the object store, so steady-state GC traffic is proportional to the garbage produced, not to the data owned.

Tail scrubbing. The two candidacy rules above leave a band of dead space unreclaimed: a segment between the scrub floor (default 5%) and 50% dead that is never written again stays above the fragmentation threshold indefinitely, so a store whose files were each about 20% deleted would reclaim nothing from it. Pass budget left over after deletion and compaction is spent repacking such segments, most-dead-first, but only when they are write-cold: sealed at least 30 seal rotations before the current open segment within the current process's lifetime, or the whole store write-quiescent for 5 minutes (the only path by which data sealed before the last restart qualifies). The scrub consumes only budget that deletion and compaction did not use, so it runs mostly during idle periods; after an upgrade, a store with an accumulated tail works it off over the following passes.

Orphan sweep. A segment object with no counter at all (from a crash between sealing an object and the commit that would have credited it, or a repack whose relocated extents were all overwritten before the repoint) is invisible to the counter scan. A slow sweep reclaims these: once every 24 hours of wall-clock time (the last-run timestamp is persisted, so frequent restarts cannot postpone it), it streams one LIST of segments/, issued per shard prefix with bounded concurrency, and point-checks each listed object's counter. Counter-less objects pass the same directory verification as dead segments and are deleted after the same style of horizon, capped at 1,024 per sweep. This is the only object listing in the system, and orphans are rare.

Checkpoints Delay Reclamation

Checkpoints pin segment objects:

An ephemeral checkpoint (including a read replica's auto-renewed reader checkpoint) pushes the delete horizon to its expiry time plus a 30-second clock-skew margin, never below the 60-second floor.
A persistent checkpoint cannot be timed out. While one exists, segment deletion and compaction are paused entirely; the counters keep tracking garbage and reclamation catches up once the checkpoint is deleted. The orphan sweep is unaffected, because no manifest, and therefore no checkpoint, ever references an orphan.
If the checkpoint list cannot be read, the pass is skipped. Reclamation fails closed.

Stage 3: Metadata Compaction and Object GC

The metadata LSM holds only inodes, directory entries, tombstones, extent pointers, and segment counters. Its compaction merges metadata SSTs; a coordinator in the writer process polls every 5 seconds and schedules work size-tiered, executed by an embedded worker. There is no standalone compactor process; metadata compaction always runs inside zerofs run.

The metadata store's object garbage collector, also in the writer process, scans four object directories (WAL, manifest, compacted SSTs, and compaction state) every 1 minute, and deletes objects that the current manifest no longer references and that are at least 1 minute old. These deletions reclaim metadata space only; file data is reclaimed by segment GC.

Where Reclamation Runs

Instance	Tombstone GC	Segment GC	Metadata compaction + object GC
Read-write	yes	yes	yes
Read-only mount	no	no	no
Checkpoint mount	no	no	no

All reclamation runs in the writer process. Read-only mounts and checkpoint mounts open the database through a reader and run no stage, but their checkpoints delay segment reclamation as described above. The bucket shrinks only while the read-write instance is running.

Crash Safety

A crash after a compaction seal but before the repoint leaves the source segments in place and readable; the packed segment is an orphan with no references, and a later pass deletes it.
A crash between a segment DELETE and the drop of its counter key leaks one stale counter key, nothing else.
A counter that under-counts leads to a skipped delete (the fail-closed directory verification), never to a deleted live segment.

These paths are exercised by failpoint crash tests that abort the operation at each injection point, reopen the filesystem, and verify consistency.

Observability

The Prometheus endpoint exports four counters for tombstone GC:

Metric	Meaning
`zerofs_tombstones_created_total`	Tombstones created by deletions
`zerofs_tombstones_processed_total`	Tombstones fully collected
`zerofs_gc_extents_deleted_total`	Extent keys deleted by GC
`zerofs_gc_runs_total`	GC passes started

All four counters count from process start. Within one run, zerofs_tombstones_created_total minus zerofs_tombstones_processed_total approximates the tombstone backlog; a gap that grows over time means deletions are outpacing collection. Files of 10 extents or fewer never create tombstones, so a small-file workload shows no tombstone activity. The monitor dashboard (zerofs monitor) and the web UI display the same counters.

Segment GC exports its own metric family, listed in full on the Prometheus page: footprint gauges that track the store's dead-space ratio between passes (zerofs_segment_appended_bytes, zerofs_segment_live_bytes, zerofs_segment_reclaimable_bytes, zerofs_segment_dead_ratio), per-pass counters for the work done (zerofs_segment_gc_segments_deleted_total, zerofs_segment_gc_deleted_bytes_total, zerofs_segment_gc_segments_compacted_total, zerofs_segment_gc_frames_relocated_total, zerofs_segment_gc_tail_scrubbed_total, zerofs_segment_gc_chains_packed_total, zerofs_segment_gc_orphans_reclaimed_total, plus the read-directed nomination and hot-seam counters), and backlog gauges (zerofs_segment_gc_candidate_backlog, zerofs_segment_gc_awaiting_delete, zerofs_segment_gc_saturated, the last being 1 whenever a pass leaves actionable backlog). Each pass also logs one info-level summary line carrying the same numbers, including the unpacked-chain breakdown by cause (deferred, warm, over-reserve); of those, only the reserve-deferred count is exported as a gauge (zerofs_segment_gc_chains_deferred), while warm and over-reserve appear in the log line alone. It then logs the chosen cadence plan (tier, next interval, batch floor, and reason), and the bucket's object count and size under segments/ reflect the net effect.

NBD TRIM

TRIM on an NBD device skips tombstones: the discard transaction deletes the pointer keys of fully covered extents and zeroes the covered portion of partially covered extents (an extent that becomes all zeroes is also deleted). The same transaction debits the source segments' live-byte counters; segment GC then reclaims the space on its own schedule.

Configuration

Values are configurable through the [gc] section (see Configuration): the base pass interval, the idle drain interval, the busy-backlog drain interval and its dead-space trigger, the per-pass batch floor, the per-round compaction budget, the tail-scrub floor, and the read-directed switch (read_directed, default true).