Engineering · High availability

Making POSIX filesystems replicated and highly available

ZeroFS stores a POSIX filesystem in a log-structured database that lives in an S3 bucket. Running that on one node is straightforward. Surviving the loss of the node is harder than it first looks, because the obvious ways to add a second one quietly weaken durability, and on a filesystem durability is the entire contract. Downtime is recoverable; a write the caller already saw acknowledged is not. Holding that line across a failover is most of what high availability in ZeroFS comes down to.

Where a single node runs out

Object storage is durable but slow and charged per request, so committing every filesystem operation as its own upload would saddle each write with tens of milliseconds and a billed PUT. ZeroFS does what any LSM engine does and buffers writes in an in-memory memtable, flushing them to the bucket in batches: on fsync, when the memtable fills, and on a periodic timer. A write() returns as soon as the data reaches the memtable, and fsync() is what pushes it the rest of the way to S3.

That leaves a single node exposed in two unrelated ways. Writes that have been acknowledged but not yet flushed exist only in that node's memory, so a crash before the next flush loses them; POSIX allows this, since the durability boundary is fsync rather than write, but the gap between the two is real. And the data that has already reached S3, while perfectly safe, has nobody to serve it for as long as the node is down. The first problem is about un-flushed writes, the second about an absent server, and a standby node can answer both at once. The work is in adding that standby without reopening the durability hole the single node already had.

Only one writer, ever

Two nodes over one bucket can fall into split-brain: both decide they are the leader and write at the same time. On a filesystem that means corruption, and once two writers have interleaved their updates there is no untangling them afterward. The protection therefore cannot depend on good timing; it has to hold under arbitrary delays, network partitions, and clock skew.

ZeroFS is single-writer per database, and the writer is chosen by the storage layer rather than negotiated between the nodes. Opening the database for writing issues a conditional update to its manifest that increments a writer_epoch and fences whoever held it before; the previous writer's next manifest write fails outright. Only one node can commit durable state at a time. The fenced leader's in-flight SST files are left unreferenced and later reclaimed by garbage collection, so they never overwrite live data, and the leader stops the moment its first write is rejected. Both nodes can briefly believe they lead, but only one will ever succeed in committing; the other finds it has been fenced and stops. This holds without reference to any clock, and everything else is layered on top of it.

Fencing relies on a conditional write (put-if-not-exists), which S3, Azure Blob, and Google Cloud Storage support directly; for S3-compatible stores that don't, ZeroFS reaches the same guarantee through Redis. A node also asks its peer which of them is currently active at startup instead of trusting its configured role, so two writers can never come up together from cold.

Why two nodes, not three

The instinct from Raft and Paxos is that fault-tolerant agreement wants an odd number of participants: three nodes so a majority survives the loss of one, five to survive two. That arithmetic exists because the cluster has to agree with itself about who leads and which writes are committed, and a majority is how it breaks a tie without splitting. ZeroFS holds no such vote. The one fact the whole scheme rests on, which node currently holds the writer epoch, is settled by a single compare-and-swap against the manifest in the bucket, and it is the object store, not the nodes, that picks the winner.

A linearizable compare-and-swap register is on its own enough to solve consensus for any number of contenders; it is the strong primitive a protocol like Raft spends pages rebuilding out of weaker ones. The object store offers it directly through conditional writes. The agreement three machines would otherwise maintain between themselves therefore lives in the bucket, next to the committed data it is agreeing about. That leaves the ZeroFS nodes to do nothing but serve: one node is already correct on its own, and the second is there to take over and to hold the un-fsync'd tail, not to make up a quorum. Because there is no quorum to maintain, an even number of nodes is fine. If the bucket is unreachable the filesystem stops, but so would anything else reading from it, and the data inside is no less safe.

Keeping a standby in step

With that settled, the standby's job is to hold the writes the leader has accepted but not yet flushed. Replication is semi-synchronous and strictly ordered: for each write, the leader sends it to the standby and waits for an acknowledgement before applying the write locally and answering the client. Because the standby acknowledges first, anything a client has been able to observe was already replicated, and the standby never trails the visible state. Writes it has received but not yet applied sit in a tail buffer, ready to be replayed if it has to take over.

Waiting for the standby is only reasonable while the standby is there. If it falls behind or drops off the network, the leader stops shipping and carries on alone rather than stalling the filesystem behind an unreachable replica, then resumes shipping once the standby returns. ZeroFS calls these states Connected and Solo. A leader in Solo makes its writes as durable as a standalone node's, which is to say the un-fsync'd ones are no longer backed by a second copy. Solo weakens durability and leaves safety untouched, since fencing owes nothing to replication: a leader running alone still cannot be joined by a second writer.

When the leader dies

The leader emits a heartbeat every 100 ms, each carrying its writer epoch. When the standby has gone without one for the takeover TTL of roughly two seconds, it promotes itself: it opens the database for writing, which fences the old leader, replays the tail it has buffered, and starts serving. Should the old leader come back, its heartbeats carry a stale epoch and are ignored, so it cannot reclaim the lead once a takeover is under way; and if it reaches a standby that still holds an un-replayed tail, that contact is the cue to promote at once rather than wait out the timer.

sequenceDiagram
    participant C as Client
    participant L as Leader
    participant S as Standby
    participant O as Object storage
    Note over C,O: Steady state
    C->>L: write
    L->>S: ship (semi-sync)
    S-->>L: replicated
    L-->>C: ack (after local apply)
    L->>O: flush on fsync
    L->>S: heartbeats (writer epoch)
    Note over L: Leader crashes or is partitioned
    L--xS: heartbeats stop
    Note over S: no heartbeat for ~2s (takeover TTL)
    S->>O: open as writer, fence old leader
    S->>S: replay buffered tail
    Note over S: now serving as the new leader
    C->>L: in-flight write
    L--xC: fail, timeout, or not-leader
    C->>S: reroute, resend with op-id
    S-->>C: definitive result (deduplicated)
    Note over L: a returning old leader stays fenced
          
A leader failure end to end: steady state, takeover, and the client's rerouted retry.

Clients hold the addresses of both nodes, so when one stops serving they find the other and carry on against it. The delicate case is the retry, since an operation in flight when the leader died may or may not have committed. Every mutating call carries a stable operation id that the new leader uses to deduplicate, which lets a client resend a mkdir or a rename and get a real answer back instead of a timeout or a duplicated effect. A deposed leader has to stop answering reads as well as writes: a node serves only while its lease is valid, the lease follows the storage engine's view of the database, and the manifest poll that notices the new epoch closes the stale database within a single interval. An idle, read-only ex-leader therefore steps aside on its own, without having to attempt a write to learn that it has lost.

The fsync that never lies

Replication narrows the un-fsync'd window but cannot close it. Lose both nodes at once, or lose a lone node while it runs Solo, and writes that were acknowledged but never flushed are gone. The easy thing would be to let that loss pass quietly and call it the price of the design. ZeroFS doesn't, because a durability guarantee that silently fails part of the time is one nothing can be built on. What it promises instead is exact, and it holds on a single node as much as on a pair: a successful fsync means every write acknowledged on that descriptor is durable in object storage, and under HA, that it also survives any failover. Where a recovery did not in fact carry those writes, the fsync returns ESTALE; it will not report success over data that no longer exists.

The whole thing turns on one value, a lineage token, which names a single unbroken durable lineage. A takeover keeps the token when it can show it inherited the leader's data, which is exactly the case of a connected standby replaying its tail, and mints a fresh one when it cannot: after a cold restart of both nodes, or after a takeover that follows a spell in Solo, where the un-shipped writes never reached the standby to begin with. A leader entering Solo records the lineage as un-inheritable in object storage before it acknowledges its first Solo write, which forces any later takeover to regenerate rather than lay claim to writes it does not hold. Each un-fsync'd change the client makes is tagged with the token current at the time; on fsync the client presents the oldest token still outstanding, the leader flushes and checks it against the live lineage, and a mismatch fails the call. Three cases follow from that one comparison:

What happenedTokenfsync result
Clean single-node crashKept: the standby held the writes and replayed the tailSucceeds transparently, as on a single node
Both nodes lost before a flushRegenerated: a cold restart can't prove it inherited anythingESTALE: the writes were only in memory
Takeover after SoloRegenerated: the Solo leader tainted the lineage firstESTALE: the new leader never got the writes

The check is per descriptor and follows POSIX faithfully: it accounts for the whole file across every open handle, says nothing about any other file, and covers metadata, directory entries, and rename across both directories as well as ordinary data writes. When a failure does cross the line, the application hears about it at its next fsync: that call fails cleanly rather than succeeding falsely, and since the first fsync after a write is the binding promise, the failure is an honest statement that the un-fsync'd work since the last successful one is gone. Redo it and fsync again. The client libraries surface this as a distinct stale-handle error, so an application can tell it apart from any other failure and recover deliberately.

Trying to break it

A claim like "no split-brain" or "an fsync never lies" is worth only the testing behind it, so ZeroFS is checked with Jepsen, the framework built to find exactly these failures under fault injection.

Two layers run against it. The first treats a single node as a black box: Jepsen issues random sequences of filesystem operations over a 9P mount and checks every result against a reference model, shrinking any divergence to a minimal failing case. A crash mode kills the server mid-run, drops the in-memory memtable, recovers from object storage, and confirms the surviving state is consistent with the last fsync. This layer runs in CI on every change.

The second stands up a real leader and standby over MinIO and turns a nemesis loose on them: it kills the leader, the standby, or both; partitions the link between the two; pauses the object store so neither can reach it; and freezes the leader past its lease with a stop signal, all while a client keeps writing and calling fsync. After each run the checker asks the questions the guarantees above depend on. Did two nodes ever both commit? Did a write that was acknowledged and fsync'd ever vanish across a failover? When a frozen leader thaws after the standby has already taken over, does it come back fenced instead of serving stale data? Across these runs the pair shows no lost, duplicated, or stale reads, and the thawed leader finds itself fenced and steps aside rather than corrupting anything. The whole suite is in the repo.

What it doesn't do

A few limits are worth stating plainly. This is two nodes and a single standby, so it buys availability through failover and not write throughput; there is still exactly one writer, and spreading reads across machines is a separate feature. Failover is bounded rather than instantaneous, since on a crash the standby waits out the two-second takeover TTL, and clients ride that gap by blocking and retrying instead of erroring. Advisory byte-range locks live in one server's memory and are not replicated, so a failover drops them; they are advisory and do not gate data, but the same caveat attends networked file locks in general and is worth designing around. And fsync is still the durability contract: the connected standby additionally shelters un-fsync'd writes, but that is a cushion, not a reason to stop calling fsync where it matters. The full configuration, the complete table of guarantees, and each failover case in detail live in the high availability docs.


None of this closes the un-fsync'd window; no filesystem backed by object storage can. What it guarantees is that nothing crosses that window silently. Fencing settles the writer without trusting a clock, so a partition cannot corrupt the filesystem. Semi-sync keeps the standby current, so an ordinary failover loses nothing. And when a failure reaches past what replication covers, the next fsync returns an error instead of a false success. The guarantees ZeroFS makes are the ones it can keep, and it reports the failures it cannot prevent instead of hiding them.

Go deeper

The full design, with every failover case diagrammed.