9P Protocol Extensions
ZeroFS serves standard 9P2000.L, the Linux dialect of the protocol. Its own client negotiates a set of private extensions on top of it: fewer round trips on metadata-heavy work, idempotent retry across a reconnect, fid rebinding that survives a server restart, and an fsync that verifies durability rather than assuming it. This page documents what each one does and how it is framed on the wire.
Overview
The extensions are transparent. A stock kernel 9P client never proposes them, so it negotiates plain 9P2000.L and behaves exactly as it would against any other 9P server. Only the bundled client uses them: zerofs mount and the client libraries both wrap the same zerofs-client core, so they share one set of capabilities and one negotiation path.
Every extension was added to remove a specific cost in the standard protocol:
- 9P2000.L makes the client walk a path and then stat it, or open a file and then read it, as separate requests. The fast paths fold those pairs into one round trip.
- 9P2000.L has no way to retry a mutating request safely: a
mkdirwhose reply was lost cannot be re-sent without risking a double-apply. An op-id makes retries idempotent. - A reconnecting client cannot re-bind a fid to a file that was renamed while it was disconnected. Rebinding by inode id fixes that.
- 9P
Tfsyncreturns success whether or not the connection's writes are still the live durable copy. The durability-verified variant returns success only when they are.
Negotiation
Capabilities are negotiated as part of the standard Tversion/Rversion exchange. The client proposes the highest level it supports as a single version string; the server echoes back the highest level it can honor. Levels are cumulative and the suffixes stack, so the client offers 9P2000.L.zerofs5.zerofs4.zerofs3 and an older server that only understands .zerofs replies with that, leaving the client to degrade rather than send messages the server cannot parse. Matching is by substring, so each side independently recognizes the levels it knows.
| Level | Version string | Adds |
|---|---|---|
| base | 9P2000.L | Standard Linux 9P. What a stock kernel client gets. |
| 1 | 9P2000.L.zerofs | Fast paths: Twalkgetattr (walk + getattr) and Treaddirattr (readdirplus). |
| 2 | 9P2000.L.zerofs2 | Compound create/open messages and stat-carrying mutation replies. |
| 3 | 9P2000.L.zerofs3 | Per-request idempotency op-id in the frame, for safe retry. |
| 4 | 9P2000.L.zerofs4 | Durability-verified fsync (Tgetlineage, Tfsyncdur). |
| 5 | 9P2000.L.zerofs5 | Folds a file open and its first read into one round trip. |
Levels 4 and 5 are durability features, so a server started with ignore_fsync never offers them and caps negotiation at level 3. A client that needs verified durability is therefore never handed a connection that would silently fake it.
The op-id at level 3 is spliced into the frame only on the request types that carry one, and only once both peers have negotiated .zerofs3. An older peer never sees the extra bytes, so standard framing is byte-for-byte unchanged for it. The fid rebinding extension (Trebind) is not gated by a version level: it is a private message the bundled client sends only after a reconnect, and the server accepts it from any session.
Fewer round trips
These extensions exist because the common client patterns in 9P2000.L cost two requests where the server already has both answers in hand. Each one returns the second answer with the first.
Walk then stat
A path lookup in 9P is a Twalk followed by a Tgetattr on the resulting fid. Twalkgetattr does both in one message: it clones fid to newfid along wnames and returns the walk qids together with the final entry's full stat. It is all-or-nothing: on any component miss the server replies Rlerror rather than a partial walk.
Twalkgetattr / Rwalkgetattr
Twalkgetattr fid[4] newfid[4] nwname[2] nwname*( wname[s] )
Rwalkgetattr nwqid[2] nwqid*( wqid[13] ) stat[Stat]
Directory listing with attributes
Treaddirattr is readdirplus: each entry comes back with its full stat inline, so a listing that needs attributes (the usual case for ls -l or a stat-walk) does not pay a Tgetattr per entry.
Treaddirattr / Rreaddirattr
Treaddirattr fid[4] offset[8] count[4]
Rreaddirattr count[4] count*( DirEntryPlus )
DirEntryPlus qid[13] offset[8] type[1] name[s] stat[Stat]
Compound create and open
Level 2 adds messages that fold a fixed client-side sequence into one round trip:
TlopenatisTwalk(clone) + Tlopen. It opensfid's inode on a freshnewfidand leavesfiduntouched, so the client keeps a handle on the directory it walked from.TlcreateattrisTwalk(clone) + Tlcreate + Tgetattr. It creates and opens the child onnewfid(without mutating the directory fid, unlike plainTlcreate) and returns the child's full stat.
The remaining level-2 messages reuse the standard request layout but reply with the post-operation stat that the server computes anyway and the standard reply throws away: Tmkdirattr, Tsymlinkattr, Tmknodattr, Tlinkattr, and Tsetattrattr. Without them the client has to follow every create or attribute change with a Tgetattr to refresh its cache.
Tlopenat / Rlopenat
Tlopenat fid[4] newfid[4] flags[4]
Rlopenat qid[13] iounit[4]
Open then read
Level 5 goes one step further for reads. Tlopenatread opens fid's inode on newfid exactly like Tlopenat, and also prefetches up to count bytes from offset 0 in the same round trip. A file that fits in the prefetch window is read with no follow-up Tread at all.
The inline read is best-effort and never affects the open. On a read error the reply still carries the open result with empty data, and the client falls back to a normal Tread. The eof flag tells the client whether data reached the end of the file: when set, the whole file fit in count and the client may serve the file's reads entirely from the prefetched buffer; when clear, it discards the partial prefetch and reads normally.
Tlopenatread / Rlopenatread
Tlopenatread fid[4] newfid[4] flags[4] count[4]
Rlopenatread qid[13] iounit[4] eof[1] count[4] data[count]
Idempotent retries
Standard 9P has no safe way to retry a mutating request. If a Tmkdir is sent and the connection drops before its reply arrives, the client cannot know whether the server applied it. Re-sending risks a double-apply (a second mkdir that fails with EEXIST, or worse, a second rename or unlink that acts on the wrong state). This matters most for the bundled client, which reconnects on its own and would otherwise have to surface an error for every in-flight mutation at the moment a server restarts.
Level 3 attaches a 16-byte idempotency op-id to each non-idempotent mutation. The op-id is deliberately not part of the message body. It is spliced into the frame immediately after the tag, so existing encode and decode paths keep producing and parsing standard frames and the op-id is simply absent when .zerofs3 is not negotiated.
Frame layout under .zerofs3
standard size[4] type[1] tag[2] body...
with op-id size[4] type[1] tag[2] op_id[16] body...
^ spliced after the 7-byte header,
on request types that carry one
The server keeps a bounded cache keyed by op-id. The first time it sees an op-id it applies the operation and records the reply; a retry carrying the same op-id returns the recorded reply without re-applying. Concurrent retries of the same op-id are serialized through a single-flight gate, so two copies in flight at once still apply exactly once. An all-zero op-id means "no dedup requested" and always applies.
Only the non-idempotent mutations carry an op-id: the create family (Tlcreate, Tsymlink, Tmknod, Tmkdir, Tlink), Trename, Trenameat, Tunlinkat, and their level-2 compound forms. Operations that are already idempotent (reads, walks, attach, clunk, write, setattr, and Tlopenat) are left out, so they pay nothing.
The op-id is what makes zerofs mount reconnect cleanly. On a dropped connection the client re-sends in-flight mutations with their original op-ids, and the client-side failover path keeps a stable op-id across a re-route so a retried mkdir or rename applies once even when the request lands on a different node after a failover.
Reconnect by inode id
When the bundled client reconnects, it has to rebuild its open fids on the new session. Re-walking the original paths does not work if a file was renamed or had its last path link removed while the client was disconnected. Trebind binds a fresh fid directly to an inode by its server-assigned id rather than by a path, so a handle survives a rename or a hardlink shuffle across the outage.
Trebind / Rrebind
Trebind fid[4] inode_id[8] n_uname[4]
Rrebind qid[13]
The server only honors a rebind for an inode that is still reachable under the session's attach root, so the extension cannot be used to escape the directory the session attached to. This is a private message the bundled client sends only during reconnect; it is not tied to a version level.
Durability-verified fsync
A plain 9P Tfsync returns Rfsync once the server has flushed, but it says nothing about whether the connection's writes are still the live durable copy. After a failover, an old leader can flush and report success for writes that a new leader has already superseded. Level 4 closes that gap by tying fsync to a durability lineage token.
After negotiating .zerofs4, the client asks for its connection's lineage token once with Tgetlineage (off the hot path, sent at connect time). It then tracks the token of its oldest un-fsync'd write and presents it on Tfsyncdur. The server flushes and replies Rfsync only if that token is still the live durable lineage; otherwise it returns Rlerror(ESTALE). A successful fsync therefore implies that every write the client acked before it is durable and will survive a crash or failover. A 0 token means the client has nothing un-fsync'd.
Durability-verified fsync (.zerofs4)
Tgetlineage (empty body)
Rgetlineage token[8]
Tfsyncdur fid[4] datasync[4] token[8]
-> Rfsync if token is the live durable lineage
-> Rlerror(ESTALE) otherwise
This is the protocol-level half of the durability and consistency guarantee, and it is what lets the bundled client keep fsync honest through a high-availability failover. A server with ignore_fsync set never offers level 4, so a client that asks for verified durability is never silently downgraded to a hollow success.
Message reference
Every extension message and its 9P type id. Reply types are paired with their request.
| Type | Message | Level | Purpose |
|---|---|---|---|
| 230 / 231 | Tlopenatread / Rlopenatread | 5 | Open plus first read in one round trip. |
| 232 | Tfsyncdur | 4 | Durability-verified fsync (replies with Rfsync, type 51). |
| 233 / 234 | Tgetlineage / Rgetlineage | 4 | Query the connection's durability lineage token. |
| 236 / 237 | Tlopenat / Rlopenat | 2 | Walk-clone plus open in one round trip. |
| 238 / 239 | Tlcreateattr / Rlcreateattr | 2 | Create and open a child, returning its stat. |
| 240 / 241 | Tmkdirattr / Rmkdirattr | 2 | mkdir returning the new directory's stat. |
| 242 / 243 | Tsymlinkattr / Rsymlinkattr | 2 | symlink returning the new link's stat. |
| 244 / 245 | Tmknodattr / Rmknodattr | 2 | mknod returning the new node's stat. |
| 246 / 247 | Tlinkattr / Rlinkattr | 2 | link returning the target's stat. |
| 248 / 249 | Tsetattrattr / Rsetattrattr | 2 | setattr returning the post-change stat. |
| 250 / 251 | Trebind / Rrebind | private | Bind a fresh fid to an inode by id, for reconnect. |
| 252 / 253 | Twalkgetattr / Rwalkgetattr | 1 | Walk plus getattr in one round trip. |
| 254 / 255 | Treaddirattr / Rreaddirattr | 1 | Readdirplus: entries with their stat inline. |
The op-id at level 3 is not a separate message. It is a 16-byte field carried in the frame of the mutating request types listed under idempotent retries.