Durability & Consistency
This document describes the durability and consistency guarantees provided by ZeroFS, including write atomicity, operation ordering, and crash recovery semantics.
Overview
ZeroFS implements a transactional storage model using an LSM-tree database designed for object storage backends. This architecture provides three key properties:
- Write Atomicity: All filesystem operations are atomic at the transaction level
- Strict Ordering: Operations are totally ordered and this order is preserved across crashes
- Instant Recovery: No filesystem check or journal replay is required after unexpected termination
Write Atomicity
Each filesystem operation executes within a single database transaction. A transaction bundles all related modifications into an atomic unit that either commits entirely or has no effect.
Transaction Scope
A write operation includes the following modifications within a single transaction:
- File data chunks
- Inode metadata (size, timestamps, mode)
- Directory entry updates
- Global statistics counters
The database commits these changes using a WriteBatch, which provides all-or-nothing semantics. Partial writes cannot occur.
Failure Modes
| Event | Outcome |
|---|---|
| Crash before commit | Transaction discarded, no changes visible |
| Crash during commit | Transaction discarded, no changes visible |
| Crash after commit | All changes visible |
The intermediate state where some but not all modifications are visible is not possible.
Transaction Structure
// Create transaction
let mut txn = db.new_transaction();
// Bundle all modifications
chunk_store.write(&mut txn, id, offset, data);
inode.size = new_size;
inode.mtime = now;
inode_store.save(&mut txn, id, &inode);
directory_store.update_entry(&mut txn, ...);
// Atomic commit
db.write(txn).await;
Operation Ordering
ZeroFS provides a total order over all write operations. This order is preserved exactly across process termination and restart.
Ordering Guarantee
The WriteCoordinator assigns monotonically increasing sequence numbers to operations. Each operation must wait for all predecessors to complete before committing:
let seq = write_coordinator.allocate_sequence();
seq.wait_for_predecessors().await;
db.write(txn).await;
seq.mark_committed();
This mechanism ensures that if operation A completes before operation B begins, then in any consistent state of the filesystem, the visibility of B implies the visibility of A.
Formal Property
For operations A and B where A completes before B starts:
visible(B) → visible(A)
The contrapositive also holds: if A is not visible, B cannot be visible.
Comparison with Other Filesystems
Many filesystems permit write reordering for performance optimization:
- ext4: May reorder writes within journal transactions
- XFS: Delayed allocation can reorder data relative to metadata
- Hardware: SSDs and disk controllers may reorder writes internally
Such reordering can result in states where a later write is visible while an earlier write is not. Applications that depend on ordering semantics (databases, logging systems, configuration management) may observe inconsistent state after recovery.
ZeroFS eliminates this class of problems by construction. The sequence number mechanism enforces a total order that survives crashes.
Crash Recovery
ZeroFS requires no recovery procedure after unexpected termination.
Recovery-Free Design
The underlying storage model has several properties that eliminate the need for crash recovery:
Atomic Commits: WriteBatch operations are atomic at the database level. A transaction either appears in full or not at all.
Immutable Storage: SST files written to object storage are immutable. Once uploaded, they cannot be corrupted by subsequent operations.
Self-Describing State: The database manifest describes the complete set of valid SST files. On startup, ZeroFS reads the manifest and has immediate access to the latest consistent state.
No Write-Ahead Log Dependencies: Unlike filesystems that require WAL replay to reconstruct state, ZeroFS state is fully materialized in SST files and the manifest.
Startup Procedure
After any termination (graceful or unexpected):
- Read manifest from object storage
- Load database state from referenced SST files
- Resume operation
No scanning, replay, or repair is performed. The time to resume operation is independent of filesystem size or the nature of the previous termination.
Durability Semantics
ZeroFS follows standard POSIX durability semantics. Write operations are buffered in memory and persisted to durable storage upon explicit synchronization.
Buffering Model
Write data flows through the following stages:
- Application Buffer: User-space buffers managed by the application
- Page Cache: OS-managed in-memory buffers
- Memtable: In-memory LSM tree buffer
- SST Files: Immutable files in object storage
Data moves from memtable to SST files when:
- The application calls
fsync() - The memtable reaches capacity
- The process terminates gracefully
- Periodic background flushes
POSIX Compliance
This behavior conforms to POSIX semantics, which specify that write() transfers data to system buffers while fsync() ensures data reaches stable storage. All major filesystems (ext4, XFS, APFS, ZFS...) implement this model.
Synchronization API
// POSIX interface
write(fd, data, len); // Buffer in memory
fsync(fd); // Persist to storage
# Python
f.write(data)
f.flush()
os.fsync(f.fileno())
// Go
f.Write(data)
f.Sync()
// Rust
file.write_all(data)?;
file.sync_all()?;
Applications requiring immediate durability must call the appropriate synchronization function.
Protocol Considerations
The durability guarantees available to applications depend on the access protocol.
9P Protocol
The 9P protocol provides direct mapping of POSIX synchronization semantics:
fsync()system call generates aTfsyncprotocol message- ZeroFS flushes all buffered data to object storage
- The call returns only after durability is confirmed
This provides strong guarantees suitable for applications with strict durability requirements.
NFS Protocol
NFS client implementations do not reliably invoke the COMMIT operation:
- Clients typically report writes as stable without issuing COMMIT
- ZeroFS accepts this to avoid per-write latency penalties
- Effective durability depends on client behavior
For workloads requiring predictable durability semantics, the 9P protocol is recommended.
Verification Through Crash Testing
ZeroFS verifies its consistency guarantees through systematic crash simulation using failpoints injected throughout the data path.
Failpoint Coverage
Failpoints are placed at critical points within each filesystem operation, allowing tests to simulate crashes at any stage:
| Operation | Failpoints |
|---|---|
| write | after chunk write, after inode update, after commit |
| create | after inode allocation, after directory entry, after commit |
| remove | after inode delete, after tombstone, after directory unlink, after commit |
| rename | after target delete, after source unlink, after new entry, after commit |
| mkdir | after inode allocation, after directory entry, after commit |
| truncate | after chunk deletion, after inode update, after commit |
| link | after directory entry, after inode update, after commit |
| symlink | after inode allocation, after directory entry, after commit |
| rmdir | after inode delete, after directory cleanup |
| gc | after chunk delete, after tombstone update |
Each failpoint triggers an immediate process termination, simulating power loss or crash at that exact point.
Consistency Verification
After each simulated crash, a comprehensive consistency checker validates the filesystem state:
Consistency Checks
verify_all()
enumerate_inodes()
enumerate_tombstones()
walk_directory_tree()
verify_directory_counts()
verify_nlink_counts()
verify_directory_nlinks()
find_orphaned_inodes()
verify_stats_counters()
verify_tombstones()
verify_file_chunks()
verify_inode_counter()
verify_orphaned_chunks()
verify_dir_entry_scan_consistency()
verify_orphaned_directory_metadata()
verify_dir_cookie_counters()
The checker detects: dangling references, orphaned inodes, nlink mismatches, missing chunks, stale tombstones, counter inconsistencies, and directory entry corruption.
Test Methodology
For each failpoint, the test suite:
- Performs a filesystem operation with the failpoint enabled
- Terminates the process at the injection point
- Restarts the filesystem from object storage
- Runs the full consistency checker
- Verifies either complete rollback or complete commit—never partial state
This systematic approach ensures that the atomicity and ordering guarantees hold under all crash scenarios.
Continuous Integration Testing
In addition to crash simulation, ZeroFS runs extensive integration tests on every commit through multiple industry-standard test suites.
POSIX Compliance
pjdfstest: A POSIX filesystem compliance test suite originally developed for FreeBSD. Tests cover file creation, permissions, hard links, symbolic links, timestamps, and other POSIX semantics. Runs on both NFS and 9P protocols.
xfstests: The Linux kernel's filesystem test suite, originally developed for XFS but now used across all major Linux filesystems. Provides extensive coverage of edge cases, concurrent operations, and error handling.
Stress Testing
stress-ng: Exercises filesystem operations under concurrent load, including directory operations, file metadata, links, renames, and attribute modifications. Tests run for sustained periods on both NFS and 9P.
Linux Kernel Compilation: Compiles the Linux kernel source tree on ZeroFS, exercising real-world workload patterns including parallel file creation, compilation, and linking across thousands of files.
Layered Filesystem Testing
The CI suite includes a test that creates a ZFS pool on a ZeroFS NBD block device:
- Create a 3GB block device file via 9P
- Connect via NBD and create a ZFS pool
- Extract the Linux kernel source (~80,000 files)
- Compute checksums of all files
- Export the ZFS pool and restart ZeroFS
- Reimport the pool and verify all checksums match
This test validates data integrity through multiple filesystem layers and across process restarts.
Test Matrix
| Test Suite | Protocol | Coverage |
|---|---|---|
| Unit tests | — | Core filesystem logic |
| Failpoint crash tests | — | Crash consistency at each operation stage |
| pjdfstest | NFS, 9P | POSIX compliance |
| xfstests | NFS, 9P | Linux filesystem semantics |
| stress-ng | NFS, 9P | Concurrent operations under load |
| Kernel compilation | NFS, 9P | Real-world build workload |
| ZFS integration | NBD + 9P | Block device integrity across restarts |
All tests run on every pull request and merge to the main branch.
Backend Durability
ZeroFS delegates storage durability to the object storage backend.
| Backend | Designed Durability | Replication |
|---|---|---|
| Amazon S3 | 99.999999999% | Automatic across availability zones |
| Azure Blob Storage | 99.999999999% | Configurable geo-redundancy |
| Google Cloud Storage | 99.999999999% | Configurable multi-region |
These services maintain multiple replicas across independent failure domains. Once data is persisted to object storage, it is protected against hardware failures, facility outages, and other localized events without additional configuration.
Summary
ZeroFS provides the following durability and consistency guarantees:
- Name
Write Atomicity- Description
Filesystem operations are atomic. Each operation either completes fully or has no effect. Partial or torn writes are not possible.
- Name
Total Ordering- Description
All operations are assigned sequence numbers that define a total order. This order is preserved across crashes. If operation B is visible after recovery, all operations that completed before B started are also visible.
- Name
Immediate Recovery- Description
No recovery procedure is required after unexpected termination. The filesystem resumes operation by reading the current manifest from object storage.
- Name
POSIX Durability- Description
Data is persisted to stable storage upon fsync() invocation, consistent with POSIX semantics and the behavior of other filesystems.
- Name
Backend Replication- Description
Persisted data inherits the durability properties of the object storage backend, typically providing automatic replication across multiple failure domains.