Walrus Architecture

Walrus keeps the public API tiny while a bunch of specialised components handle allocation, durability, and cleanup behind the scenes. This section describes how the engine in src/ is actually wired together.

clients ─┐
         │  append / batch_append          ┌──────────────┐
         ├────────────────────────────────►│ Walrus facade│
         │                                 └────┬─────────┘
         │                                      │ get/create writer
         │                                      ▼
         │                           ┌──────────────────┐
         │                           │ per-topic Writer │◄─┐
         │                           └───┬──────────────┘  │ alloc sealed blocks
         │                               │ writes          │
         │                               ▼                 │
         │                     ┌──────────────────┐        │
         │                     │ BlockAllocator   │────────┘
         │                     └───┬──────────────┘
         │                         │ hands out 10 MB units, tracks files
         │                         ▼
         │              wal_files/<namespace>/<timestamp>
         │                         ▲
         │                         │ sealed blocks appended
         │ read / batch_read       │
         ├─────────────────────────┘
         │
         │          ┌──────────────────┐
         └─────────►│ Reader + WalIndex│───► persisted checkpoints
                    └──────────────────┘

Pieces you meet in src/wal

  • Walrus facade (runtime/walrus.rs) – Owns the shared services: BlockAllocator, the global Reader, a RwLock<HashMap> of per-topic Writers, the fsync channel, and the persisted WalIndex. Constructors pick the data directory (paths.rs), set the global fsync schedule, and run recovery before returning an instance.
  • Writer (runtime/writer.rs) – One per topic. Holds the live block, current offset, and an atomic flag to keep batch appends exclusive. When a block fills it flushes, seals it, hands it to the reader chain, and requests a fresh block from the allocator.
  • Reader (runtime/reader.rs, walrus_read.rs) – Keeps a sealed block chain per topic plus in-memory tail progress for the active writer block. read_next walks sealed blocks first, then falls through to the live tail using a snapshot from the writer, and persists offsets based on the selected ReadConsistency.
  • BlockAllocator (runtime/allocator.rs) – Spin-locked allocator that hands out 10 MB units inside pre-sized 1 GB files. Tracks block/file state so sealed + checkpointed files can be reclaimed.
  • Storage backends (storage.rs)SharedMmap (portable default) or the fd-backed path that enables io_uring on Linux. config.rs exposes enable_fd_backend / disable_fd_backend; the active schedule decides if we open files with O_SYNC.
  • WalIndex (runtime/index.rs) – rkyv-serialised map of per-topic read positions, fsync’d every time we persist a checkpoint.
  • Background workers (runtime/background.rs) – Drain the fsync queue, batch flushes (single io_uring submit on Linux), and delete files once every block in them is sealed, unlocked, and checkpointed.
  • Path manager (paths.rs) – Builds namespaced directory roots, creates new timestamped WAL files, and fsyncs directory entries so the files survive a crash.

Storage layout & block lifecycle

  • Files live under wal_files/<namespace>/. Each file is 1 GB (DEFAULT_BLOCK_SIZE × BLOCKS_PER_FILE) preallocated on disk.
  • Writers operate on 10 MB logical blocks. Batch appends can reserve multiple contiguous blocks; regular appends seal the block when there is no room left.
  • Every entry is prefixed with a 64-byte header carrying the topic name, payload length, checksum (FNV-1a), and a hint pointing at the next block boundary. Reads verify the checksum before returning data.
  • BlockStateTracker / FileStateTracker record lock, checkpoint, and allocation state. When a file is fully allocated AND every block is released
    • checkpointed, the deletion worker removes the file.

Topic Block Mapping

Per-Topic View (continuous abstraction):

Topic A sees:
┌────────┬────────┬────────┬────────┐
│Block 0 │Block 2 │Block 5 │Block 8 │  ← Appears continuous
└────────┴────────┴────────┴────────┘
  (file 0)  (file 0)  (file 0)  (file 1)

Topic B sees:
┌────────┬────────┬────────┐
│Block 1 │Block 3 │Block 6 │          ← Also appears continuous
└────────┴────────┴────────┘
  (file 0)  (file 0)  (file 0)

Actual layout on disk (interleaved):
File 0: [B0:A][B1:B][B2:A][B3:B][B4:?][B5:A][B6:B]...
        Block allocation is dynamic, topics don't know about each other

Write path (single entry)

User calls append_for_topic("events", data)
        │
        ▼
┌────────────────────────┐
│ 1. Get/create Writer   │  ◄─── Walrus.writers RwLock
└────────┬───────────────┘
         │
         ▼
┌────────────────────────────┐
│ 2. Check space in block?   │
├────────┬───────────────────┤
│  Yes   │  No               │
└────┬───┴───────────────────┘
     │           │
     │           ▼
     │    ┌──────────────────┐
     │    │ Seal current blk │
     │    │ → Reader chain   │
     │    │ Alloc new block  │
     │    └──────┬───────────┘
     │           │
     ▼           ▼
┌───────────────────────────┐
│ 3. Block::write()         │
│  • Serialize metadata     │
│  • Compute checksum       │
│  • memcpy to mmap/fd      │
│  • Update offset          │
└────────┬──────────────────┘
         │
         ▼
┌────────────────────────────┐
│ 4. Fsync policy            │
├────────────────────────────┤
│ • SyncEach → flush now     │
│ • Milliseconds(n) → queue  │
│ • NoFsync → skip           │
└────────────────────────────┘
  1. Walrus::append_for_topic fetches or creates the topic’s Writer.
  2. The writer verifies there is enough room; if not, it flushes and seals the block, appends it to the reader chain, and grabs a new block from the allocator.
  3. Block::write serialises metadata + payload into the shared mmap/fd.
  4. Fsync policy:
    • SyncEach → flush immediately.
    • Milliseconds(n) → enqueue the path on the fsync channel.
    • NoFsync → skip flush entirely for raw throughput.

Batch appends

  • Guarded by a compare-and-swap flag so only one batch per topic runs at a time.
  • Precomputes how many blocks it needs and borrows them from the allocator up front.
  • On Linux with the fd backend, the batch turns into a series of io_uring write ops submitted together; other platforms fall back to sequential writes.
  • Any failure (allocation, write, completion) rolls back offsets and releases provisional blocks.

Read path

User calls read_next("events", checkpoint=true)
        │
        ▼
┌───────────────────────────┐
│ 1. Get ColReaderInfo      │  ◄─── Hydrate from WalIndex if first read
│    (cur_block_idx,        │
│     cur_block_offset,     │
│     tail_block_id,        │
│     tail_offset)          │
└────────┬──────────────────┘
         │
         ▼
┌───────────────────────────────────┐
│ 2. Try sealed chain first         │
├───────────────────────────────────┤
│ If cur_block_idx < chain.len():   │
│   • Read block[idx] at offset     │
│   • Advance offset                │
│   • Mark checkpointed if done     │
│   • Return entry                  │
└────────┬──────────────────────────┘
         │ Chain exhausted?
         ▼
┌───────────────────────────────────┐
│ 3. Try tail (active writer block) │
├───────────────────────────────────┤
│ • Snapshot writer (block_id, off) │
│ • If rotated: fold tail→sealed    │
│ • Read from tail_offset           │
│ • Advance tail_offset (in-memory) │
│ • Return entry                    │
└────────┬──────────────────────────┘
         │
         ▼
┌────────────────────────────────────┐
│ 4. Checkpoint decision             │
├────────────────────────────────────┤
│ • StrictlyAtOnce: persist now      │
│ • AtLeastOnce: count++             │
│   if count % persist_every == 0:   │
│     persist to WalIndex            │
│                                    │
│ Index stores:                      │
│  • Sealed: (idx, offset)           │
│  • Tail: (block_id | 1<<63, offset)│
└────────────────────────────────────┘
  1. Walrus::read_next (and batch_read_for_topic) obtain the per-topic ColReaderInfo. On first use we hydrate the position from WalIndex.
  2. Sealed chain first: walk blocks in order, marking each block checkpointed as we drain it.
  3. Tail second: snapshot the writer’s live block + offset and read new entries directly from it, keeping a tail cursor in-memory.
  4. Checkpointing rules:
    • StrictlyAtOnce persists after every successful read.
    • AtLeastOnce { persist_every } counts reads and only persists every N reads unless we must force a checkpoint (e.g., folding the tail into the sealed chain).
  5. WalIndex stores either a sealed chain index/offset or a tail sentinel (block_id | 1<<63) that represents progress inside the writer’s live block.

batch_read_for_topic follows the same logic but builds a bounded read plan so we never exceed max_bytes or the global MAX_BATCH_ENTRIES (2000) limit.

Background fsync & reclamation

  • Writers push file paths onto the fsync channel whenever they produce data under FsyncSchedule::Milliseconds(_).
  • The background worker deduplicates paths, opens storage handles on demand, and flushes in batches. With the fd backend on Linux we emit one io_uring FSYNC opcode per file and submit them together.
  • flush_check watches block/file counters. Once a file is fully allocated, unlocked, and every block is checkpointed, the deletion queue removes it the next time the worker drops its mmap/fd cache.

Recovery (startup choreography)

  1. Walk the namespace directory, ignoring _index.db files.
  2. For each timestamped file:
    • mmap or open through the fd backend,
    • scan in 10 MB strides until we hit zeroed regions,
    • replay metadata to rebuild the per-topic block chains and populate block trackers.
  3. Rehydrate the read index to restore cursor positions (including tail sentinels).
  4. Trigger flush_check on every file so the background worker can immediately reclaim anything that is already sealed and checkpointed.

With this setup the external API stays minimal (append, batch_append, read_next, batch_read), while the engine beneath handles allocation, durability, and cleanup without the caller having to micromanage anything.