Walrus Architecture
Walrus keeps the public API tiny while a bunch of specialised components handle allocation, durability, and cleanup behind the scenes. This section describes how the engine in src/ is actually wired together.
clients ─┐
│ append / batch_append ┌──────────────┐
├────────────────────────────────►│ Walrus facade│
│ └────┬─────────┘
│ │ get/create writer
│ ▼
│ ┌──────────────────┐
│ │ per-topic Writer │◄─┐
│ └───┬──────────────┘ │ alloc sealed blocks
│ │ writes │
│ ▼ │
│ ┌──────────────────┐ │
│ │ BlockAllocator │────────┘
│ └───┬──────────────┘
│ │ hands out 10 MB units, tracks files
│ ▼
│ wal_files/<namespace>/<timestamp>
│ ▲
│ │ sealed blocks appended
│ read / batch_read │
├─────────────────────────┘
│
│ ┌──────────────────┐
└─────────►│ Reader + WalIndex│───► persisted checkpoints
└──────────────────┘
Pieces you meet in src/wal
Walrusfacade (runtime/walrus.rs) – Owns the shared services:BlockAllocator, the globalReader, aRwLock<HashMap>of per-topicWriters, the fsync channel, and the persistedWalIndex. Constructors pick the data directory (paths.rs), set the global fsync schedule, and run recovery before returning an instance.Writer(runtime/writer.rs) – One per topic. Holds the live block, current offset, and an atomic flag to keep batch appends exclusive. When a block fills it flushes, seals it, hands it to the reader chain, and requests a fresh block from the allocator.Reader(runtime/reader.rs,walrus_read.rs) – Keeps a sealed block chain per topic plus in-memory tail progress for the active writer block.read_nextwalks sealed blocks first, then falls through to the live tail using a snapshot from the writer, and persists offsets based on the selectedReadConsistency.BlockAllocator(runtime/allocator.rs) – Spin-locked allocator that hands out 10 MB units inside pre-sized 1 GB files. Tracks block/file state so sealed + checkpointed files can be reclaimed.- Storage backends (
storage.rs) –SharedMmap(portable default) or the fd-backed path that enables io_uring on Linux.config.rsexposesenable_fd_backend/disable_fd_backend; the active schedule decides if we open files withO_SYNC. WalIndex(runtime/index.rs) – rkyv-serialised map of per-topic read positions, fsync’d every time we persist a checkpoint.- Background workers (
runtime/background.rs) – Drain the fsync queue, batch flushes (single io_uring submit on Linux), and delete files once every block in them is sealed, unlocked, and checkpointed. - Path manager (
paths.rs) – Builds namespaced directory roots, creates new timestamped WAL files, and fsyncs directory entries so the files survive a crash.
Storage layout & block lifecycle
- Files live under
wal_files/<namespace>/. Each file is 1 GB (DEFAULT_BLOCK_SIZE×BLOCKS_PER_FILE) preallocated on disk. - Writers operate on 10 MB logical blocks. Batch appends can reserve multiple contiguous blocks; regular appends seal the block when there is no room left.
- Every entry is prefixed with a 64-byte header carrying the topic name, payload length, checksum (FNV-1a), and a hint pointing at the next block boundary. Reads verify the checksum before returning data.
BlockStateTracker/FileStateTrackerrecord lock, checkpoint, and allocation state. When a file is fully allocated AND every block is released- checkpointed, the deletion worker removes the file.
.png)
Per-Topic View (continuous abstraction):
Topic A sees:
┌────────┬────────┬────────┬────────┐
│Block 0 │Block 2 │Block 5 │Block 8 │ ← Appears continuous
└────────┴────────┴────────┴────────┘
(file 0) (file 0) (file 0) (file 1)
Topic B sees:
┌────────┬────────┬────────┐
│Block 1 │Block 3 │Block 6 │ ← Also appears continuous
└────────┴────────┴────────┘
(file 0) (file 0) (file 0)
Actual layout on disk (interleaved):
File 0: [B0:A][B1:B][B2:A][B3:B][B4:?][B5:A][B6:B]...
Block allocation is dynamic, topics don't know about each other
Write path (single entry)
User calls append_for_topic("events", data)
│
▼
┌────────────────────────┐
│ 1. Get/create Writer │ ◄─── Walrus.writers RwLock
└────────┬───────────────┘
│
▼
┌────────────────────────────┐
│ 2. Check space in block? │
├────────┬───────────────────┤
│ Yes │ No │
└────┬───┴───────────────────┘
│ │
│ ▼
│ ┌──────────────────┐
│ │ Seal current blk │
│ │ → Reader chain │
│ │ Alloc new block │
│ └──────┬───────────┘
│ │
▼ ▼
┌───────────────────────────┐
│ 3. Block::write() │
│ • Serialize metadata │
│ • Compute checksum │
│ • memcpy to mmap/fd │
│ • Update offset │
└────────┬──────────────────┘
│
▼
┌────────────────────────────┐
│ 4. Fsync policy │
├────────────────────────────┤
│ • SyncEach → flush now │
│ • Milliseconds(n) → queue │
│ • NoFsync → skip │
└────────────────────────────┘
Walrus::append_for_topicfetches or creates the topic’sWriter.- The writer verifies there is enough room; if not, it flushes and seals the block, appends it to the reader chain, and grabs a new block from the allocator.
Block::writeserialises metadata + payload into the shared mmap/fd.- Fsync policy:
SyncEach→ flush immediately.Milliseconds(n)→ enqueue the path on the fsync channel.NoFsync→ skip flush entirely for raw throughput.
Batch appends
- Guarded by a compare-and-swap flag so only one batch per topic runs at a time.
- Precomputes how many blocks it needs and borrows them from the allocator up front.
- On Linux with the fd backend, the batch turns into a series of io_uring write ops submitted together; other platforms fall back to sequential writes.
- Any failure (allocation, write, completion) rolls back offsets and releases provisional blocks.
Read path
User calls read_next("events", checkpoint=true)
│
▼
┌───────────────────────────┐
│ 1. Get ColReaderInfo │ ◄─── Hydrate from WalIndex if first read
│ (cur_block_idx, │
│ cur_block_offset, │
│ tail_block_id, │
│ tail_offset) │
└────────┬──────────────────┘
│
▼
┌───────────────────────────────────┐
│ 2. Try sealed chain first │
├───────────────────────────────────┤
│ If cur_block_idx < chain.len(): │
│ • Read block[idx] at offset │
│ • Advance offset │
│ • Mark checkpointed if done │
│ • Return entry │
└────────┬──────────────────────────┘
│ Chain exhausted?
▼
┌───────────────────────────────────┐
│ 3. Try tail (active writer block) │
├───────────────────────────────────┤
│ • Snapshot writer (block_id, off) │
│ • If rotated: fold tail→sealed │
│ • Read from tail_offset │
│ • Advance tail_offset (in-memory) │
│ • Return entry │
└────────┬──────────────────────────┘
│
▼
┌────────────────────────────────────┐
│ 4. Checkpoint decision │
├────────────────────────────────────┤
│ • StrictlyAtOnce: persist now │
│ • AtLeastOnce: count++ │
│ if count % persist_every == 0: │
│ persist to WalIndex │
│ │
│ Index stores: │
│ • Sealed: (idx, offset) │
│ • Tail: (block_id | 1<<63, offset)│
└────────────────────────────────────┘
Walrus::read_next(andbatch_read_for_topic) obtain the per-topicColReaderInfo. On first use we hydrate the position fromWalIndex.- Sealed chain first: walk blocks in order, marking each block checkpointed as we drain it.
- Tail second: snapshot the writer’s live block + offset and read new entries directly from it, keeping a tail cursor in-memory.
- Checkpointing rules:
StrictlyAtOncepersists after every successful read.AtLeastOnce { persist_every }counts reads and only persists every N reads unless we must force a checkpoint (e.g., folding the tail into the sealed chain).
WalIndexstores either a sealed chain index/offset or a tail sentinel (block_id | 1<<63) that represents progress inside the writer’s live block.
batch_read_for_topic follows the same logic but builds a bounded read plan so we never exceed max_bytes or the global MAX_BATCH_ENTRIES (2000) limit.
Background fsync & reclamation
- Writers push file paths onto the fsync channel whenever they produce data under
FsyncSchedule::Milliseconds(_). - The background worker deduplicates paths, opens storage handles on demand, and flushes in batches. With the fd backend on Linux we emit one io_uring
FSYNCopcode per file and submit them together. flush_checkwatches block/file counters. Once a file is fully allocated, unlocked, and every block is checkpointed, the deletion queue removes it the next time the worker drops its mmap/fd cache.
Recovery (startup choreography)
- Walk the namespace directory, ignoring
_index.dbfiles. - For each timestamped file:
- mmap or open through the fd backend,
- scan in 10 MB strides until we hit zeroed regions,
- replay metadata to rebuild the per-topic block chains and populate block trackers.
- Rehydrate the read index to restore cursor positions (including tail sentinels).
- Trigger
flush_checkon every file so the background worker can immediately reclaim anything that is already sealed and checkpointed.
With this setup the external API stays minimal (append, batch_append, read_next, batch_read), while the engine beneath handles allocation, durability, and cleanup without the caller having to micromanage anything.