Troubleshooting

Common issues, debugging strategies, and operational guidance for running Walrus in production.

Table of Contents

  1. Troubleshooting
    1. Common Issues
      1. Checksum Verification Failures
      2. ErrorKind::WouldBlock on Batch Writes
      3. Disk Space Exhaustion
      4. Slow Performance
        1. 1. Check fsync schedule
        2. 2. Check backend (Linux only)
        3. 3. Check for lock contention
        4. 4. Check disk I/O
        5. 5. Common bottlenecks
      5. Index Corruption
      6. Out of Memory (OOM)
    2. Performance Tuning
      1. Maximizing Write Throughput
      2. Minimizing Read Latency
      3. Balancing Durability vs Speed
    3. Debugging Strategies
      1. Enable Debug Logging
      2. Inspect WAL Files
      3. Monitor Background Worker
      4. Verify Index Persistence
    4. Production Checklist
    5. Getting Help
    6. Known Limitations

Common Issues

Checksum Verification Failures

Symptom:

Checksum mismatch at offset 12345

Causes:

  1. Disk corruption (bad sectors, failing drive)
  2. Memory corruption (bad RAM, cosmic rays)
  3. Process crash during write (partial entry written)
  4. Manual file modification

Diagnosis:

# Check system logs for disk errors
dmesg | grep -i "error\|fail"

# Run filesystem check (unmount first!)
fsck /dev/sdX

# Test RAM with memtest86+

Resolution:

  • If rare (< 1 per million entries): Likely cosmic ray or transient error, monitor and continue
  • If frequent: Hardware issue, replace disk/RAM immediately
  • If correlated with crashes: Bug in write path, file an issue with reproduction steps

Data recovery: Walrus skips corrupted entries and continues. To recover:

// Re-write from source if available
for entry in source_data {
    wal.append_for_topic("recovered", entry)?;
}

ErrorKind::WouldBlock on Batch Writes

Symptom:

wal.batch_append_for_topic("my-topic", &entries)?;
// Error: ErrorKind::WouldBlock

Cause: Another thread is currently performing a batch write to the same topic.

Why this happens: Only one batch write per topic is allowed at a time (enforced via AtomicBool flag). This prevents concurrent batches from corrupting block layout.

Resolution:

// Option 1: Retry after brief delay
loop {
    match wal.batch_append_for_topic(topic, &entries) {
        Ok(()) => break,
        Err(e) if e.kind() == ErrorKind::WouldBlock => {
            thread::sleep(Duration::from_micros(100));
            continue;
        }
        Err(e) => return Err(e),
    }
}

// Option 2: Use regular append (slower but never blocks)
for entry in entries {
    wal.append_for_topic(topic, entry)?;
}

// Option 3: Coordinate batches at application level
// (one producer per topic for batch writes)

Prevention: Design your application so only one thread performs batch writes per topic, or use a coordinator.


Disk Space Exhaustion

Symptom:

Error: No space left on device (os error 28)

Cause:

  1. Files aren’t being deleted (readers not checkpointing)
  2. Write throughput exceeds read throughput (backlog growing)
  3. Insufficient disk capacity for workload

Diagnosis:

# Check Walrus data directory size
du -sh wal_files/

# Count files
ls -1 wal_files/ | wc -l

# Check oldest file (should be recent if deletion working)
ls -lt wal_files/ | tail -n 1

Resolution:

If files aren’t being deleted:

// Ensure readers checkpoint regularly
while let Some(entry) = wal.read_next(topic, true)? {  // true = checkpoint!
    process(entry);
}

// With AtLeastOnce, ensure persist_every isn't too large
let wal = Walrus::with_consistency(
    ReadConsistency::AtLeastOnce { persist_every: 1_000 }  // Not 1_000_000!
)?;

If backlog is growing:

  • Scale read capacity (more reader threads/processes)
  • Reduce write rate at source
  • Add more disk space

Emergency cleanup:

# Find fully checkpointed files (manual intervention)
# WARNING: Only delete if you're certain readers are done!
find wal_files/ -name "read_offset_idx_index.db" -exec cat {} \;
# Compare with file list, manually remove ancient files

Slow Performance

Symptom: Throughput lower than expected.

Diagnosis checklist:

1. Check fsync schedule

// Too aggressive?
Walrus::with_consistency_and_schedule(
    ReadConsistency::StrictlyAtOnce,
    FsyncSchedule::SyncEach,  // VERY slow!
)?;

// Better for throughput:
Walrus::with_consistency_and_schedule(
    ReadConsistency::AtLeastOnce { persist_every: 5_000 },
    FsyncSchedule::Milliseconds(2_000),
)?;

2. Check backend (Linux only)

# Set WALRUS_QUIET=0 to see backend selection
WALRUS_QUIET=0 cargo run

# Output should show:
# "Using FD backend with io_uring"

# If it says "Using mmap backend", force FD:
enable_fd_backend();

3. Check for lock contention

# Profile with perf (Linux)
perf record -g ./your_program
perf report

# Look for:
# - High time in spin loops (BlockAllocator)
# - Mutex contention (Writer locks)
# - Syscalls (fsync, write)

4. Check disk I/O

# Monitor disk utilization
iostat -x 1

# %util should be high (90-100%) if disk-bound
# If low, you're CPU or lock-bound

5. Common bottlenecks

Symptom Cause Fix
Low CPU, low disk Lock contention Reduce topics per thread, use more threads
High CPU, low disk Serialization overhead Use larger entries (batch small items)
Low CPU, high disk Disk-bound Faster SSD, reduce fsync frequency
High syscalls Mmap backend on Linux enable_fd_backend()

Index Corruption

Symptom:

Index corrupted or missing, rebuilding...

Causes:

  1. Process killed during index write
  2. Filesystem corruption
  3. Manual deletion of read_offset_idx_index.db

Automatic recovery: Walrus rebuilds the index by scanning WAL files and inferring positions from checkpointed state.

Manual intervention:

# Delete corrupted index (Walrus will rebuild)
rm wal_files/read_offset_idx_index.db

# Or for keyed instance:
rm wal_files/my-instance-key/read_offset_idx_index.db

Prevention:

  • Use FsyncSchedule::Milliseconds(n) to ensure index writes are durable
  • Avoid kill -9 (use SIGTERM for graceful shutdown)

Out of Memory (OOM)

Symptom: Process killed by OOM killer or allocation failure.

Causes:

  1. Too many concurrent topics (each has reader/writer state)
  2. Large batch sizes (entries buffered in memory)
  3. Memory leaks (file handles not released)

Diagnosis:

# Monitor memory usage
top -p $(pgrep your_program)

# Check mmap count (Linux)
cat /proc/$(pgrep your_program)/maps | wc -l

# Should be ~1-2 per WAL file, if hundreds → leak

Resolution:

Large batches:

// Don't do this:
let huge_batch: Vec<&[u8]> = vec![/* 10,000 entries */];
wal.batch_append_for_topic(topic, &huge_batch)?;

// Do this instead:
for chunk in huge_batch.chunks(1000) {
    wal.batch_append_for_topic(topic, chunk)?;
}

Too many topics:

// Avoid creating millions of topics
// Multiplex into fewer topics with entry metadata instead
let entry = serialize(Metadata { user_id: 12345, data: payload });
wal.append_for_topic("events", &entry)?;

Performance Tuning

Maximizing Write Throughput

// 1. Disable durability (benchmarking only!)
let wal = Walrus::with_consistency_and_schedule(
    ReadConsistency::AtLeastOnce { persist_every: 100_000 },
    FsyncSchedule::NoFsync,
)?;

// 2. Use FD backend (Linux)
enable_fd_backend();

// 3. Batch writes
wal.batch_append_for_topic(topic, &entries)?;  // Not append() in loop

// 4. Multiple writer threads
let wal = Arc::new(wal);
for topic in topics {
    let wal = Arc::clone(&wal);
    thread::spawn(move || {
        wal.append_for_topic(&topic, data)?;
    });
}

Expected throughput (consumer laptop):

  • Single thread: ~200-300k ops/sec
  • 8 threads: ~1M ops/sec
  • 16 threads: ~1.5M ops/sec (disk-bound)

Minimizing Read Latency

// 1. StrictlyAtOnce for immediate visibility
let wal = Walrus::with_consistency(ReadConsistency::StrictlyAtOnce)?;

// 2. Checkpoint every read
while let Some(entry) = wal.read_next(topic, true)? {  // true = checkpoint
    process(entry);
}

// 3. Batch reads for throughput
let entries = wal.batch_read_for_topic(topic, 1_048_576, true)?;  // 1 MB

Balancing Durability vs Speed

Priority Configuration
Speed AtLeastOnce{10_000} + NoFsync
Balance AtLeastOnce{1_000} + Milliseconds(1_000)
Durability StrictlyAtOnce + SyncEach

Debugging Strategies

Enable Debug Logging

# See all internal events
WALRUS_QUIET=0 ./your_program

# Output includes:
# - Backend selection
# - Block allocations
# - File creations
# - Checksum failures
# - Deletion triggers

Inspect WAL Files

# List files in data directory
ls -lh wal_files/

# Files are named by epoch milliseconds:
# 1700000000 (older)
# 1700000100
# 1700000200 (newer)

# Check file sizes (should be ~1 GB when full)
du -h wal_files/*

# Partially filled files indicate active allocation

Monitor Background Worker

The background thread logs fsync and deletion events:

[Background] Flushed 15 files
[Background] Deleted file: wal_files/1700000000

If you never see deletions, readers aren’t checkpointing.

Verify Index Persistence

# Index files (one per instance)
ls -lh wal_files/read_offset_idx_index.db

# For keyed instances
ls -lh wal_files/*/read_offset_idx_index.db

# Check modification time (should update regularly if readers active)
stat wal_files/read_offset_idx_index.db

Production Checklist

Before deploying Walrus to production:

  • Choose appropriate ReadConsistency for your durability needs
  • Set FsyncSchedule based on acceptable data loss window
  • Enable FD backend on Linux (enable_fd_backend())
  • Set WALRUS_DATA_DIR to dedicated disk/partition
  • Monitor disk space (alert at 80% usage)
  • Test crash recovery (kill -9, verify reads resume)
  • Load test with realistic workload (topics, entry sizes, concurrency)
  • Verify file deletion works (run for hours, check old files deleted)
  • Set up monitoring for checksum failures
  • Document your instance key strategy (if using keyed instances)

Getting Help

If you encounter issues not covered here:

  1. Check the logs with WALRUS_QUIET=0
  2. Search existing issues: GitHub Issues
  3. File a bug report with:
    • Walrus version
    • Operating system and kernel version
    • Reproduction steps
    • Relevant logs
    • Configuration (consistency, fsync schedule, backend)

For performance issues, include:

  • Hardware specs (CPU, disk type, RAM)
  • Benchmark results (make bench-writes, etc.)
  • Output of perf record (Linux) or profiler

Known Limitations

Current release (v0.1.0):

  • Single-node only (no replication)
  • Maximum entry size: 1 GB (file size limit)
  • io_uring optimizations Linux-only
  • No built-in compression
  • No encryption at rest

Workarounds:

  • Large entries: Split at application level
  • Compression: Compress before calling append_for_topic()
  • Encryption: Use dm-crypt/LUKS for disk encryption

Future releases will address distributed features (see roadmap).