Walrus Internals

This page explores the engineering decisions, performance optimizations, and implementation details that make Walrus fast and reliable. If you want to understand how the pieces actually work under the hood, you’re in the right place.

Walrus Internals

High-Level Architecture

Before diving into specifics, here’s how all the pieces fit together:

Walrus Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Walrus Facade                            │
│  (coordinates: allocator, readers, writers, index, bg worker)   │
└────┬─────────────────────┬──────────────────────┬───────────────┘
     │                     │                      │
     │ append              │ read                 │ background
     ▼                     ▼                      ▼
┌──────────┐          ┌──────────┐         ┌─────────────┐
│  Writer  │          │  Reader  │         │ Fsync/Delete│
│(per topic│──────────▶(per topic│         │   Worker    │
│  mutex)  │ seal blk │  chains) │         └─────────────┘
└────┬─────┘          └────┬─────┘
     │                     │
     │ alloc block         │ read from sealed/tail
     ▼                     ▼
┌──────────────────────────────────────────┐
│         BlockAllocator (spin lock)        │
│   hands out 10MB blocks from 1GB files   │
└────┬─────────────────────────────────────┘
     │
     ▼
┌──────────────────────────────────────────┐
│       Storage Layer (Mmap or FD)         │
│  memory-mapped 1GB files (sparse alloc)  │
└──────────────────────────────────────────┘

The Spin-Lock Allocator

One of Walrus’s key performance features is the BlockAllocator, which hands out 10 MB blocks in sub-microsecond time using a spin lock instead of OS-level synchronization.

Why Spin Locks?

Traditional mutexes involve syscalls, which cost 1-5 microseconds even in the fast path. For an operation that needs to:

Increment an offset
Check if we’ve exceeded file size
Return a block descriptor

…that’s unacceptable overhead when allocating thousands of blocks per second.

The Implementation

pub struct BlockAllocator {
    next_block: UnsafeCell<Block>,     // Pre-computed next block
    lock: AtomicBool,                  // Spin lock (userspace only)
    paths: Arc<WalPathManager>,
}

The spin lock uses a simple compare-and-swap loop:

loop {
    match self.lock.compare_exchange_weak(
        false, true,                    // Acquire lock
        Ordering::AcqRel,
        Ordering::Relaxed
    ) {
        Ok(_) => break,                 // Got it!
        Err(_) => std::hint::spin_loop(), // Try again
    }
}

Critical section operations:

Check if next_block.offset + size > MAX_FILE_SIZE
If rollover: create new file, update mmap, reset offset
Register block in state tracker
Increment next_block.id
Release lock

Amortized cost: ~200-500 nanoseconds per allocation (no syscalls).

Block Allocation Flow

Thread requests block
        │
        ▼
   ┌─────────┐
   │ CAS loop│ ◄──── Spin if contended
   │ acquire │       (std::hint::spin_loop)
   └────┬────┘
        │ Got lock!
        ▼
   ┌──────────────────────────┐
   │ Check: offset + size      │
   │       > MAX_FILE_SIZE?    │
   └────┬───────────────┬──────┘
        │ No            │ Yes
        │               ▼
        │          ┌─────────────────┐
        │          │ Create new file │
        │          │ Update mmap ref │
        │          │ Reset offset=0  │
        │          └────┬────────────┘
        │               │
        ▼               ▼
   ┌────────────────────────────┐
   │ Register block in tracker  │
   │ Increment next_block.id    │
   └────┬───────────────────────┘
        ▼
   ┌─────────┐
   │ Release │
   │  lock   │
   └────┬────┘
        ▼
   Return block to writer
   (path, offset, limit, mmap)

Safety Invariants

The UnsafeCell<Block> requires careful handling:

The spin lock guarantees exclusive access during allocation
Once handed out, blocks have unique ownership (single writer per topic)
Blocks are never deallocated (recycled via file deletion instead)

Storage Backend Architecture

Walrus supports two storage backends optimized for different platforms and workloads.

Dual Backend Design

Mmap Backend (default fallback):

Uses memory-mapped files via memmap2
Portable across all platforms
OS manages page cache automatically
Writes are volatile until msync() or flush()

FD Backend (Linux optimization):

Uses file descriptors with pwrite/pread
Enables io_uring batching (see below)
Supports O_SYNC flag for kernel-managed durability
Position-independent I/O (thread-safe without seek)

Backend Architecture Diagram

┌─────────────────────────────────────────┐
│          StorageImpl (enum)             │
├──────────────────┬──────────────────────┤
│   Mmap Backend   │     FD Backend       │
│   (portable)     │   (Linux optimized)  │
├──────────────────┼──────────────────────┤
│ memmap2 crate    │ raw file descriptor  │
│ memory-mapped    │ pwrite/pread         │
│ msync() flush    │ O_SYNC or io_uring   │
│ OS page cache    │ io_uring batching    │
└──────────────────┴──────────────────────┘
         │                    │
         └──────────┬─────────┘
                    ▼
         ┌─────────────────────┐
         │    SharedMmap       │
         │  (Arc<RwLock<...>>) │
         │  + atomic timestamp │
         └─────────────────────┘
                    │
                    ▼
         ┌─────────────────────┐
         │  Global Keeper Map  │
         │ (path → SharedMmap) │
         │  deduplicates mmaps │
         └─────────────────────┘

When Each Backend Shines

Workload	Recommended Backend	Why
Linux production	FD backend	io_uring batching, O_SYNC support
macOS/Windows	Mmap backend	Only portable option
Benchmarking	FD backend (no sync)	Eliminates OS noise
Single-threaded	Either	Minimal difference

Switching at Runtime

use walrus_rust::{enable_fd_backend, disable_fd_backend};

enable_fd_backend();   // Use FD + io_uring on Linux
disable_fd_backend();  // Force mmap everywhere

Set WALRUS_QUIET=1 to suppress backend switch logging.

The SharedMmap Abstraction

Both backends implement a common interface through SharedMmap:

pub struct SharedMmap {
    inner: Arc<RwLock<StorageImpl>>,   // Mmap or FD backend
    last_modified: AtomicU64,          // Timestamp tracking
}

Interior mutability pattern:

Implements Sync + Send for concurrent access
RwLock protects the storage handle
Atomic timestamp prevents stale handle reuse
Global “keeper” map deduplicates file handles

io_uring Batching: The Secret Sauce

On Linux with the FD backend enabled, Walrus batches multiple operations into a single syscall using io_uring. This is the key to sub-millisecond batch operations.

Traditional I/O vs io_uring

Traditional approach (N syscalls):

for each entry:
    pwrite(fd, data)      // syscall

Result: 2000 entries = 2000 syscalls

io_uring approach (1 syscall):

Userspace                          Kernel
─────────────────────────────────────────────

for each entry:
  prepare_write() ────────┐
  queue in SQ ring        │
                          │
submit_and_wait() ────────┼──────────────────┐
                          │                  │
    (BLOCKS)              │                  │
                          ▼                  ▼
                     io_uring processes all ops
                          │                  │
                          ▼                  │
                     completions in CQ ring  │
                          │                  │
                          └──────────────────┘
    (UNBLOCKS)            │
                          ▼
check completion results

Result: 2000 entries written with 1 syscall instead of 2000.

The Three-Phase Batch Write

When you call batch_append_for_topic(topic, entries), here’s what happens:

┌───────────────────────────────────────────────────────┐
│ Phase 1: Planning (no locks held)                     │
├───────────────────────────────────────────────────────┤
│ • Calculate entry sizes                               │
│ • Determine how many blocks needed                    │
│ • Pre-allocate blocks from BlockAllocator             │
│ • Build write plan (block, offset, size) tuples      │
└─────────────────────┬─────────────────────────────────┘
                      ▼
┌───────────────────────────────────────────────────────┐
│ Phase 2: io_uring Prep (FD backend, Linux only)      │
├───────────────────────────────────────────────────────┤
│ • Serialize all entries to buffers                    │
│ • For each entry:                                     │
│     - prep_write(fd, buf, len, offset)                │
│     - queue in submission queue (SQ)                  │
│ • NO syscalls yet (all userspace)                     │
└─────────────────────┬─────────────────────────────────┘
                      ▼
┌───────────────────────────────────────────────────────┐
│ Phase 3: Submit & Verify (SINGLE syscall)            │
├───────────────────────────────────────────────────────┤
│ • io_uring.submit_and_wait(num_entries)               │
│ • Kernel executes all writes                          │
│ • Check completion queue (CQ) for errors              │
│ • If any failure → rollback (zero headers)            │
│ • If success → update writer offset, seal blocks      │
└───────────────────────────────────────────────────────┘

Phase 1: Planning (no locks held)

// Compute block layout
let mut needed_blocks = vec![];
let mut current_block_space = self.current_block.lock().limit - offset;

for entry in batch {
    let entry_size = PREFIX_META_SIZE + entry.len();
    if entry_size > current_block_space {
        needed_blocks.push(allocator.alloc_block(entry_size)?);
        current_block_space = block_size;
    }
    current_block_space -= entry_size;
}

Why this matters: Pre-allocating blocks without holding writer locks prevents blocking other operations.

Phase 2: io_uring Preparation (FD backend only)

let ring = IoUring::new(batch.len())?;
let mut buffers = Vec::new();

for (block, entry, offset) in &write_plan {
    let buf = serialize_entry(entry, topic);
    buffers.push(buf);

    unsafe {
        let sqe = ring.submission().next_sqe().unwrap();
        sqe.prep_write(
            block.fd,
            buf.as_ptr(),
            buf.len(),
            offset
        );
    }
}

Why it’s fast: All write operations are queued in userspace.

Phase 3: Submission (single syscall)

ring.submit_and_wait(batch.len())?;

// Check completion results
for cqe in ring.completion() {
    if cqe.result() < 0 {
        // Rollback: zero headers, revert offsets
        return Err(...);
    }
}

Result: 2000 entries written with 1 syscall instead of 2000.

Batch Read Optimization

The same principle applies to batch_read_for_topic:

// Build read plan (up to max_bytes)
let plan = build_read_plan(sealed_chain, tail_block, max_bytes);

// Submit all reads via io_uring (Linux FD backend)
let ring = IoUring::new(plan.len())?;
for (block, start, end) in &plan {
    let sqe = ring.submission().next_sqe().unwrap();
    sqe.prep_read(block.fd, buffer, end - start, start);
}
ring.submit_and_wait(plan.len())?;

// Parse entries from buffers
let entries = parse_entries_from_buffers(buffers)?;

Performance impact: Batch reads complete in ~100-500 microseconds vs 2-10 milliseconds with traditional I/O.

Fallback Behavior

When io_uring is unavailable (non-Linux or mmap backend):

Batch writes: Sequential mmap.write() calls (still atomic via offset tracking)
Batch reads: Sequential block.read() calls
Still correct, just slower (~5-10x overhead)

Tail Reading: The Zero-Copy Optimization

One of Walrus’s cleverest tricks is allowing readers to consume from the active writer’s block without forcing block rotation.

Reader's Perspective

The Problem

Traditional WAL designs require:

Writer fills block completely
Writer seals block
Writer notifies readers
Readers can now read sealed block

Issue: Low-throughput topics waste space and add latency (must wait for block to fill).

Traditional WAL (forced rotation):

Writer must wait:     Reader must wait:
    ▼                     ▼
┌─────────────────────────────────────┐
│ Block 1 [###########·············] │  ← 50% full, but...
└─────────────────────────────────────┘
         │
         │  Writer: "Need to fill whole block before sealing"
         │  Reader: "Can't read until sealed"
         │  Latency: seconds to hours for low-volume topics!
         ▼
    (waiting...)

The Walrus Solution: Tail Reads

Readers can read from the writer’s current block using snapshot semantics:

Walrus (tail reading):

Writer continues:        Reader reads immediately:
    ▼                           ▼
┌─────────────────────────────────────┐
│ Active Block [###########·········] │
│               ▲         ▲           │
│               │         │           │
│           last_read  current_offset │
└───────────────┼─────────┼───────────┘
                │         │
                │         └─ Writer appending here (atomic)
                │
                └─ Reader can read [last_read..current_offset)
                   (snapshot semantics, no locks!)

Latency: sub-millisecond!

// Reader requests snapshot
let (block_id, current_offset) = writer.snapshot_block();

// Reader reads entries from [last_offset, current_offset)
for offset in last_offset..current_offset {
    let entry = block.read(offset)?;
    process(entry);
}

Key insight: Writers only append (monotonically increasing offset), so readers can safely read behind the write cursor without blocking.

In-Memory Tail Progress

Readers track two positions:

pub struct ColReaderInfo {
    // Sealed chain position
    cur_block_idx: usize,
    cur_block_offset: u64,

    // Tail position (in active writer block)
    tail_block_id: u64,
    tail_offset: u64,
}

Reading algorithm:

Try sealed chain first (fully checkpointed blocks)
If exhausted, request writer snapshot
Read from [tail_offset, writer.current_offset)
Update tail_offset (in-memory only initially)

The Tail Sentinel: Bit 63

When persisting read positions, Walrus distinguishes sealed vs tail using bit 63 in BlockPos:

BlockPos struct (persisted to index):

┌──────────────────────────────────────────────────┐
│ cur_block_idx: u64                               │
│ ┌──┬─────────────────────────────────────────┐  │
│ │63│ 62  ...  Block ID or Chain Index     0 │  │
│ └┬─┴─────────────────────────────────────────┘  │
│  │                                               │
│  └─ TAIL_FLAG (bit 63)                          │
│     • 0 = Reading from sealed chain             │
│     • 1 = Reading from active writer tail       │
└──────────────────────────────────────────────────┘

Examples:
  0x0000000000000005 = sealed chain, block index 5
  0x800000000000002A = tail read, block ID 42

pub struct BlockPos {
    cur_block_idx: u64,  // High bit = tail flag
    cur_block_offset: u64,
}

Encoding:

Bit 63 = 0: Reading from sealed chain at cur_block_idx
Bit 63 = 1: Reading from active writer tail (block_id in lower 63 bits)

On recovery:

if pos.cur_block_idx & (1u64 << 63) != 0 {
    // Tail read in progress
    let tail_block_id = pos.cur_block_idx & !(1u64 << 63);
    info.tail_block_id = tail_block_id;
    info.tail_offset = pos.cur_block_offset;
} else {
    // Sealed chain read
    info.cur_block_idx = pos.cur_block_idx as usize;
    info.cur_block_offset = pos.cur_block_offset;
}

Folding Tail into Sealed Chain

When the writer rotates blocks, readers must transition:

// Writer sealed block 42
reader.append_block_to_chain(topic, sealed_block_42);

// Reader's tail progress (if on block 42) moves to sealed chain
if reader.tail_block_id == sealed_block_42.id {
    reader.cur_block_idx = sealed_chain.len() - 1;
    reader.cur_block_offset = reader.tail_offset;
    reader.tail_offset = 0;  // Reset tail
}

Result: Seamless transition from in-memory tail to durable sealed chain.

Block and File State Tracking

Walrus tracks per-block and per-file state to determine when files are safe to delete. This happens entirely in userspace with zero syscalls.

┌──────────────────────────────────────────────────────────┐
│                   Global State Trackers                   │
│                  (static OnceLock<Mutex<...>>)            │
├─────────────────────────────┬────────────────────────────┤
│   BlockStateTracker         │   FileStateTracker         │
│   HashMap<BlockID, State>   │   HashMap<FilePath, State> │
├─────────────────────────────┼────────────────────────────┤
│ • block_id → file_path      │ • locked_blocks (AtomicU16)│
│ • is_checkpointed (bool)    │ • checkpoint_blocks (")    │
│                             │ • total_blocks (")         │
│                             │ • is_fully_allocated (bool)│
└─────────────────────────────┴────────────────────────────┘
              │                            │
              │ Updates trigger            │ Updates trigger
              ▼                            ▼
    ┌──────────────────┐         ┌─────────────────┐
    │ Reader finishes  │         │ Writer seals    │
    │ block            │         │ block           │
    └────────┬─────────┘         └────────┬────────┘
             │                            │
             ▼                            ▼
    set_checkpointed(id) ────────► increment counters
             │                            │
             └─────────┬──────────────────┘
                       ▼
              ┌────────────────┐
              │  flush_check() │
              │  evaluates:    │
              │  • fully_alloc?│
              │  • locked==0?  │
              │  • checkpoint  │
              │    >= total?   │
              └────────┬───────┘
                       │ All true?
                       ▼
              ┌────────────────┐
              │ Send to        │
              │ deletion queue │
              └────────────────┘

BlockStateTracker

Static global state (initialized once):

static BLOCK_STATE: OnceLock<Mutex<HashMap<u64, BlockState>>> = OnceLock::new();

struct BlockState {
    file_path: String,
    is_checkpointed: bool,
}

Operations:

register_block(id, path) - called when block allocated
set_checkpointed_true(id) - called when reader finishes block
Auto-increments file’s checkpointed_blocks counter

FileStateTracker

Tracks aggregate file state:

static FILE_STATE: OnceLock<Mutex<HashMap<String, FileState>>> = OnceLock::new();

struct FileState {
    locked_blocks: u32,       // Writers holding blocks
    checkpointed_blocks: u32, // Readers finished blocks
    total_blocks: u32,        // Blocks allocated from this file
    is_fully_allocated: bool, // File reached MAX_FILE_SIZE
}

Deletion Conditions

After every checkpoint, flush_check() evaluates:

fn flush_check(file_path: &str) {
    let state = FILE_STATE.lock().get(file_path);

    if state.is_fully_allocated
       && state.locked_blocks == 0
       && state.total_blocks > 0
       && state.checkpointed_blocks >= state.total_blocks {
        // Safe to delete!
        send_to_deletion_queue(file_path);
    }
}

English translation:

File is full (no more allocations coming)
No writers holding blocks (all sealed)
At least one block was allocated (not empty file)
All blocks have been read (checkpointed)

Result: Files delete automatically as soon as safe, with zero manual intervention.

Concurrency Model

Walrus supports high concurrency while maintaining safety and avoiding deadlocks.

Lock Hierarchy (to prevent deadlocks)

Lock Acquisition Order (top to bottom, NEVER reverse):

    1. Walrus.writers (RwLock<HashMap>)
       │  Brief: lookup/insert Writer
       │  Scope: Minimal
       │
       └──> Release immediately after get_or_create

    2. Writer.current_block / Writer.current_offset (Mutex)
       │  Per-topic: No cross-topic contention
       │  Scope: During write operation
       │
       └──> Can hold while allocating blocks

    3. Reader.data (outer RwLock, inner per-topic RwLock)
       │  Outer: Read lock for topic lookup
       │  Inner: Write lock for position updates
       │  Scope: Per-read operation
       │
       └──> Release before index persistence

    4. WalIndex (RwLock)
       │  Global: Protects index file
       │  Scope: Brief, during flush to disk
       │
       └──> Held LAST, released ASAP

Deadlock Prevention:
  ✓ Always acquire in this order
  ✗ Never hold multiple topic locks simultaneously
  ✗ Never acquire Walrus.writers after any other lock

Walrus.writers (RwLock)
- Held briefly to get/create Writer
- Released before calling Writer methods
Writer.current_block + Writer.current_offset (Mutex)
- Held during writes
- Per-topic locks (no global contention)
Reader data (RwLock<HashMap<String, Arc<RwLock>>>)
- Outer RwLock: read-locked for topic lookup
- Inner RwLock: write-locked for position updates
- Per-topic granularity
WalIndex (RwLock)
- Held last, during persistence
- Brief critical section

Rule: Always acquire locks in this order. Never hold multiple topic locks simultaneously.

Thread Safety Guarantees

Component	Concurrency Model
`Walrus`	Shared across threads (`Arc<Walrus>`)
`Writer`	One per topic, callable from any thread
`Reader`	Shared global, per-topic locks
`BlockAllocator`	Thread-safe via spin lock
`WalIndex`	Thread-safe via RwLock

Snapshot Semantics

Readers never block writers (and vice versa):

Timeline: Writer and Reader Operating Concurrently

Writer Thread:                Reader Thread:
─────────────────────────────────────────────────────────

t0: Lock(current_offset)
t1: Append entry A
t2: offset = 100
t3: Unlock
    ║
    ║                         t4: Request snapshot
    ║                         t5: Read offset atomically
    ║                         t6: (got: block_id=42, offset=100)
    ║                         t7: Release (no writer lock needed!)
    ║
t8: Lock(current_offset)      t8: Read entries [0..100)
t9: Append entry B            t9:   (from snapshot, no lock!)
t10: offset = 200             t10:  (writer continues freely)
t11: Unlock                   t11:
    ║                         t12: Process entries
    ║
    ║                         Later: Request new snapshot
    ║                         (got: block_id=42, offset=200)

Key: Writer and reader never wait for each other!

// Writer path
writer.append(data);  // Holds writer locks only

// Reader path
let snapshot = writer.snapshot_block();  // Quick atomic reads
// Writer lock released here!
reader.read_from_snapshot(snapshot);     // No writer involvement

Key: Snapshot captures (block_id, offset) atomically, then releases writer lock. Reader works with stale-but-consistent snapshot.

Batch Write Exclusion

Only one batch write per topic allowed:

pub struct Writer {
    is_batch_writing: AtomicBool,
    // ...
}

pub fn batch_write(&self, entries: &[&[u8]]) -> io::Result<()> {
    // Try to acquire batch lock
    if self.is_batch_writing.compare_exchange(
        false, true,
        Ordering::AcqRel,
        Ordering::Acquire
    ).is_err() {
        return Err(ErrorKind::WouldBlock);
    }

    // ... perform batch ...

    self.is_batch_writing.store(false, Ordering::Release);
    Ok(())
}

Why: Prevents two threads from concurrently batching to the same topic (which would corrupt block layout).

User experience: ErrorKind::WouldBlock returned if collision occurs (rare, and retryable).

Checksum Verification and Corruption Handling

Every entry carries a FNV-1a 64-bit checksum verified on read.

Entry Structure in Block:

┌────────────────────────────────────────────────────────┐
│ 2-byte Length Prefix                                   │
├────────────────────────────────────────────────────────┤
│ Metadata (rkyv serialized):                            │
│   • read_size: usize                                   │
│   • owned_by: String (topic name)                      │
│   • next_block_start: u64                              │
│   • checksum: u64  ◄─── FNV-1a(payload data)          │
├────────────────────────────────────────────────────────┤
│ Payload Data (read_size bytes)                         │
│ [user data bytes...]                                   │
└────────────────────────────────────────────────────────┘

On Read:
  1. Deserialize metadata
  2. Read payload
  3. Compute FNV-1a(payload)
  4. Compare with metadata.checksum
  5. If mismatch → ErrorKind::InvalidData

Why FNV-1a?

Algorithm	Speed	Collision Resistance	Use Case
CRC32	Fast	Good	Network packets
FNV-1a	Faster	Good enough	In-memory hash tables, WAL entries
xxHash	Fastest	Better	Modern checksums
SHA256	Slow	Cryptographic	Security

Decision: FNV-1a offers ~5-10 ns/byte on modern CPUs with excellent distribution for our entry sizes (KB-MB range).

Checksum Computation

pub fn checksum64(data: &[u8]) -> u64 {
    const FNV_OFFSET: u64 = 14695981039346656037;
    const FNV_PRIME: u64 = 1099511628211;

    let mut hash = FNV_OFFSET;
    for &byte in data {
        hash ^= byte as u64;
        hash = hash.wrapping_mul(FNV_PRIME);
    }
    hash
}

Stored in metadata:

pub struct Metadata {
    pub read_size: usize,
    pub owned_by: String,
    pub next_block_start: u64,
    pub checksum: u64,  // <-- FNV-1a of entry data
}

Read-Time Verification

Every block.read(offset) call:

pub fn read(&self, offset: u64) -> io::Result<Entry> {
    // Deserialize metadata
    let meta = deserialize_metadata(offset)?;

    // Read data
    let data = read_bytes(offset + meta_size, meta.read_size);

    // Verify checksum
    let computed = checksum64(&data);
    if computed != meta.checksum {
        eprintln!("Checksum mismatch at offset {}", offset);
        return Err(ErrorKind::InvalidData.into());
    }

    Ok(Entry { data })
}

Corruption Handling

On checksum failure:

Error logged to stderr (unless WALRUS_QUIET=1)
Entry skipped (reader advances to next)
Read operation returns None or truncated batch

Philosophy: Fail gracefully rather than crash. Corruption is logged for forensics, but system remains operational.

Recovery options:

Re-write corrupted entries (if source available)
Rebuild topic from upstream source
Accept data loss for corrupted range

Atomic Index Persistence

Read positions survive process restarts via WalIndex, persisted atomically using the write-tmp-rename pattern.

The Classic Atomic Write Pattern

pub fn persist(&self, db_path: &str) -> io::Result<()> {
    let tmp_path = format!("{}.tmp", db_path);

    // 1. Serialize to temporary file
    let bytes = rkyv::to_bytes(&self.positions)?;
    fs::write(&tmp_path, bytes)?;

    // 2. Fsync temporary file
    let file = File::open(&tmp_path)?;
    file.sync_all()?;
    drop(file);

    // 3. Atomic rename (POSIX guarantees atomicity)
    fs::rename(&tmp_path, db_path)?;

    // 4. Fsync parent directory (ensures rename is durable)
    let parent = Path::new(db_path).parent().unwrap();
    let dir = File::open(parent)?;
    dir.sync_all()?;

    Ok(())
}

Why This Works

POSIX guarantee: rename() is atomic—either old file or new file visible, never partial.

Crash scenarios:

Crash during write to .tmp: old index intact
Crash during fsync: .tmp may be partial, old index intact
Crash during rename: either old or new index visible (both valid)
Crash after rename, before dir fsync: new index may not survive power loss (OS-dependent)

Result: At-most-once durability with fsync, at-least-once without.

Recovery on Startup

fn load_or_rebuild_index(db_path: &str) -> WalIndex {
    match WalIndex::load(db_path) {
        Ok(index) => index,
        Err(_) => {
            eprintln!("Index corrupted or missing, rebuilding...");
            rebuild_index_from_wal_scan()
        }
    }
}

Fallback: Scan all WAL files, infer positions from checkpointed blocks.

Rollback Mechanism

Batch writes can fail mid-flight (disk full, io_uring errors, etc.). Walrus rolls back atomically.

The Challenge

Consider a batch write across 3 blocks:

Batch Write Failure Scenario:

┌─────────────────────────────────────────────────────┐
│ Block 100                                           │
│ [entry1][entry2][entry3] ✓ Written successfully    │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│ Block 101                                           │
│ [entry4][entry5] ✓ Written successfully            │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│ Block 102                                           │
│ [entry6] ✗ FAILED (disk full / io error)           │
└─────────────────────────────────────────────────────┘

Problem: Partial batch visible to readers (violates atomicity)

Block 100: [entry1, entry2, entry3] ✓ written
Block 101: [entry4, entry5] ✓ written
Block 102: [entry6] ✗ FAILED (disk full)

Problem: Readers might see partial batch (entries 1-5 but not 6), violating atomicity.

The Solution: Header Zeroing

Walrus invalidates metadata headers on failure:

Rollback Process:

1. Detect failure (io_uring completion error)
   │
   ▼
2. For each block in written_blocks[]:
   │
   ├──> Block 100: zero_range(offset, PREFIX_META_SIZE)
   │    ┌─────────────────────────────────────┐
   │    │ [0x00][0x00]...    │ entry2 │ entry3│
   │    └─────────────────────────────────────┘
   │     ▲ Zeroed metadata = invalid entry
   │
   ├──> Block 101: zero_range(offset, PREFIX_META_SIZE)
   │    ┌─────────────────────────────────────┐
   │    │ [0x00][0x00]...    │ entry5         │
   │    └─────────────────────────────────────┘
   │
   └──> Block 102: (failed write, nothing to clean)

3. Revert writer.current_offset to original value
   │
4. Release provisional blocks to allocator
   │
   ▼
5. Return error to caller

Result: Readers see NO trace of failed batch (clean rollback)

fn rollback(&self, written_blocks: &[(Block, u64)]) -> io::Result<()> {
    for (block, start_offset) in written_blocks {
        // Zero the 2-byte length prefix + metadata
        block.zero_range(start_offset, PREFIX_META_SIZE)?;
    }

    // Revert writer offsets
    self.current_offset.store(original_offset, Ordering::Release);

    // Release provisional blocks back to allocator
    for block in &written_blocks {
        allocator.release(block.id);
    }

    Ok(())
}

Why this works:

Readers deserialize metadata by reading 2-byte length prefix
Zeroed prefix → invalid length → read fails gracefully
Even if data bytes survived, metadata corruption prevents read

Result: Failed batches leave no visible artifacts. Readers see consistent pre-batch state.

Background Worker Architecture

The background thread handles two jobs: batched fsyncs and deferred file deletion.

┌──────────────────────────────────────────────────────────────┐
│           Background Worker Event Loop (infinite)             │
└────┬────────────────────────────────────────┬────────────────┘
     │                                        │
     ▼                                        ▼
┌─────────────────┐                  ┌─────────────────┐
│  Fsync Channel  │                  │ Deletion Channel│
│ (mpsc receiver) │                  │ (mpsc receiver) │
└────────┬────────┘                  └────────┬────────┘
         │                                    │
         ▼                                    ▼
    Collect paths                        Accumulate paths
    (deduplicate)                        (pending_deletions)
         │                                    │
         ▼                                    │
    ┌─────────────────┐                      │
    │ Batch Flush:    │                      │
    │ • Open handles  │                      │
    │   from pool     │                      │
    │ • io_uring prep │                      │
    │ • submit_and_   │                      │
    │   wait() ◄──────┼── 1 syscall/batch   │
    └────────┬────────┘                      │
             │                                │
             │   Every 1000 cycles (~2 min): │
             │   ┌───────────────────────────▼──┐
             │   │ 1. file_pool.clear()         │
             │   │    (drops all handles)       │
             │   │                              │
             │   │ 2. for path in pending:      │
             └───┤      fs::remove_file(path)   │
                 │                              │
                 │ 3. pending_deletions.clear() │
                 └──────────────────────────────┘
                            │
                            ▼
                   sleep(100ms), loop

The Event Loop

fn background_worker(
    fsync_rx: mpsc::Receiver<String>,
    del_rx: mpsc::Receiver<String>,
) {
    let mut fsync_cycle = 0;
    let mut file_pool = HashMap::new();  // Cached file handles
    let mut pending_deletions = Vec::new();

    loop {
        // Phase 1: Collect paths to flush (batch up to 1 second)
        let mut paths = HashSet::new();
        while let Ok(path) = fsync_rx.try_recv() {
            paths.insert(path);
        }

        // Phase 2: Flush all files (io_uring batch on Linux)
        if !paths.is_empty() {
            flush_batch(&paths, &mut file_pool)?;
        }

        // Phase 3: Collect deletion requests
        while let Ok(path) = del_rx.try_recv() {
            pending_deletions.push(path);
        }

        // Phase 4: Perform deletions every ~1000 cycles
        fsync_cycle += 1;
        if fsync_cycle % 1000 == 0 {
            // Drop all file handles first!
            file_pool.clear();

            // Now safe to delete
            for path in pending_deletions.drain(..) {
                let _ = fs::remove_file(&path);
            }
        }

        thread::sleep(Duration::from_millis(100));
    }
}

Batched Fsync (io_uring)

When multiple writers enqueue paths:

fn flush_batch(paths: &HashSet<String>, pool: &mut HashMap<String, FdBackend>) {
    let ring = IoUring::new(paths.len())?;

    for path in paths {
        let storage = pool.entry(path.clone())
            .or_insert_with(|| FdBackend::open(path));

        let sqe = ring.submission().next_sqe().unwrap();
        sqe.prep_fsync(storage.fd, 0);  // Queue fsync
    }

    ring.submit_and_wait(paths.len())?;  // Single syscall!
}

Performance: 100 files fsynced in ~1-2 milliseconds (vs 50-100ms with sequential fsyncs).

File Handle Pooling

The file_pool caches open handles across fsync cycles:

Benefits:

Amortizes open() syscall cost
Reduces file descriptor churn
Better for filesystems with open/close overhead

Tradeoff: Unbounded growth if files never deleted

Solution: Purge every 1000 cycles (~2 minutes with 100ms sleep):

if fsync_cycle % 1000 == 0 {
    file_pool.clear();  // Drops all handles
}

Deletion Timing

Why wait 1000 cycles?

Allows fsync queue to drain (files must be flushed before deletion)
Batches deletions (reduces syscall overhead)
Prevents race: file handle in file_pool while trying to delete

Sequence:

flush_check() sends path to del_rx
Background worker accumulates paths
Every 1000 cycles: drop handles, delete files
Fresh start with empty pool

Performance Deep Dive

Let’s quantify the optimizations.

Write Throughput Scaling

Thread Scaling Graph

Allocation Costs

Allocator Type	Latency	Syscalls	Contention
Mutex (pthread)	1-5 μs	Yes (futex)	High
Spin lock (Walrus)	200-500 ns	No	Low

Speedup: 5-25x faster allocation.

Contention handling: Spin lock uses compare_exchange_weak with spin_loop() hint, yielding to hypervisor on tight loops.

io_uring vs Sequential I/O

Batch size: 1000 entries, 1 KB each

Backend	Syscalls	Latency
Sequential pwrite	1000	~10 ms
io_uring batch	1	~0.5 ms

Speedup: 20x reduction in latency.

Tail Read Optimization

Scenario: Low-throughput topic (1 msg/sec, 10 MB blocks)

Strategy	Block Rotation Frequency	Waste
Seal-only reads	Every 10 MB	0%
Tail reads (Walrus)	Every 10 MB	0% but immediate reads

Benefit: Read latency drops from “when block fills” (~10,000 seconds) to “immediately” (<1ms).

Fsync Batching

Scenario: 100 active topics, fsync every 1 second

Strategy	Fsyncs/sec	Syscalls/sec
Per-topic fsync	100	100
Batched fsync (Walrus)	1	1

Reduction: 100x fewer syscalls.

Design Philosophy

A few words on why Walrus is built this way.

No External Dependencies (Where Possible)

Notice the lack of third-party concurrency crates:

No crossbeam, flume, or async
Standard library mpsc channels
Hand-rolled spin locks

Why? Predictability and control. Every microsecond matters in a WAL, and understanding exactly how your primitives behave (syscalls, memory ordering, contention) is critical.

Zero-Copy Where Safe

Memory-mapped files eliminate userspace/kernel copies
io_uring reduces buffer copying
rkyv avoids serialization overhead (zero-copy deserialization)

Tradeoff: Unsafe code, careful alignment, manual memory management. Worth it for the throughput gains.

Fail Gracefully

Corruption, disk full, slow peers—systems fail. Walrus logs, skips, and continues:

Checksum failures → skip entry, log error
Batch write failures → rollback, return error
Background fsync failures → log, continue

Philosophy: Availability over correctness for transient errors. Permanent corruption (bad checksums) is logged for forensic analysis.

Optimize for the 99th Percentile

Spin locks hurt worst-case latency (unbounded spinning). But:

P50 latency: 200 ns (vs 2 μs for mutex)
P99 latency: 1 μs (vs 5 μs for mutex)
P99.9 latency: 10 μs (vs 50 μs for mutex)

Result: Better tail latency despite theoretical unbounded worst case.

Future Directions

Walrus is production-ready for single-node workloads. Distributed features are in progress (see WIP files in the repo).

Coming eventually:

Raft-based cluster consensus (see distributed coordination.md)
Quorum writes with leader/follower replication
Hierarchical consensus (sub-cluster leases)
Lock-free MPSC/SPSC queues for inter-node ACKs (see quorum writes.md)

The architecture is designed to extend cleanly—distributed features will layer on top without changing the core WAL engine.

Closing Thoughts

Walrus achieves high performance through careful engineering at every layer:

Spin locks eliminate syscalls in hot paths
io_uring batches operations to single syscalls
Tail reads provide immediate consistency without write amplification
Dual backends balance portability and performance

If you made it this far, you now know more about Walrus internals than most database engineers know about their WALs. Go build something fast.