Troubleshooting

Common issues, debugging strategies, and operational guidance for running Walrus in production.

Troubleshooting

Common Issues

Checksum Verification Failures

Symptom:

Checksum mismatch at offset 12345

Causes:

Disk corruption (bad sectors, failing drive)
Memory corruption (bad RAM, cosmic rays)
Process crash during write (partial entry written)
Manual file modification

Diagnosis:

# Check system logs for disk errors
dmesg | grep -i "error\|fail"

# Run filesystem check (unmount first!)
fsck /dev/sdX

# Test RAM with memtest86+

Resolution:

If rare (< 1 per million entries): Likely cosmic ray or transient error, monitor and continue
If frequent: Hardware issue, replace disk/RAM immediately
If correlated with crashes: Bug in write path, file an issue with reproduction steps

Data recovery: Walrus skips corrupted entries and continues. To recover:

// Re-write from source if available
for entry in source_data {
    wal.append_for_topic("recovered", entry)?;
}

ErrorKind::WouldBlock on Batch Writes

Symptom:

wal.batch_append_for_topic("my-topic", &entries)?;
// Error: ErrorKind::WouldBlock

Cause: Another thread is currently performing a batch write to the same topic.

Why this happens: Only one batch write per topic is allowed at a time (enforced via AtomicBool flag). This prevents concurrent batches from corrupting block layout.

Resolution:

// Option 1: Retry after brief delay
loop {
    match wal.batch_append_for_topic(topic, &entries) {
        Ok(()) => break,
        Err(e) if e.kind() == ErrorKind::WouldBlock => {
            thread::sleep(Duration::from_micros(100));
            continue;
        }
        Err(e) => return Err(e),
    }
}

// Option 2: Use regular append (slower but never blocks)
for entry in entries {
    wal.append_for_topic(topic, entry)?;
}

// Option 3: Coordinate batches at application level
// (one producer per topic for batch writes)

Prevention: Design your application so only one thread performs batch writes per topic, or use a coordinator.

Disk Space Exhaustion

Symptom:

Error: No space left on device (os error 28)

Cause:

Files aren’t being deleted (readers not checkpointing)
Write throughput exceeds read throughput (backlog growing)
Insufficient disk capacity for workload

Diagnosis:

# Check Walrus data directory size
du -sh wal_files/

# Count files
ls -1 wal_files/ | wc -l

# Check oldest file (should be recent if deletion working)
ls -lt wal_files/ | tail -n 1

Resolution:

If files aren’t being deleted:

// Ensure readers checkpoint regularly
while let Some(entry) = wal.read_next(topic, true)? {  // true = checkpoint!
    process(entry);
}

// With AtLeastOnce, ensure persist_every isn't too large
let wal = Walrus::with_consistency(
    ReadConsistency::AtLeastOnce { persist_every: 1_000 }  // Not 1_000_000!
)?;

If backlog is growing:

Scale read capacity (more reader threads/processes)
Reduce write rate at source
Add more disk space

Emergency cleanup:

# Find fully checkpointed files (manual intervention)
# WARNING: Only delete if you're certain readers are done!
find wal_files/ -name "read_offset_idx_index.db" -exec cat {} \;
# Compare with file list, manually remove ancient files

Slow Performance

Symptom: Throughput lower than expected.

Diagnosis checklist:

1. Check fsync schedule

// Too aggressive?
Walrus::with_consistency_and_schedule(
    ReadConsistency::StrictlyAtOnce,
    FsyncSchedule::SyncEach,  // VERY slow!
)?;

// Better for throughput:
Walrus::with_consistency_and_schedule(
    ReadConsistency::AtLeastOnce { persist_every: 5_000 },
    FsyncSchedule::Milliseconds(2_000),
)?;

2. Check backend (Linux only)

# Set WALRUS_QUIET=0 to see backend selection
WALRUS_QUIET=0 cargo run

# Output should show:
# "Using FD backend with io_uring"

# If it says "Using mmap backend", force FD:
enable_fd_backend();

3. Check for lock contention

# Profile with perf (Linux)
perf record -g ./your_program
perf report

# Look for:
# - High time in spin loops (BlockAllocator)
# - Mutex contention (Writer locks)
# - Syscalls (fsync, write)

4. Check disk I/O

# Monitor disk utilization
iostat -x 1

# %util should be high (90-100%) if disk-bound
# If low, you're CPU or lock-bound

5. Common bottlenecks

Symptom	Cause	Fix
Low CPU, low disk	Lock contention	Reduce topics per thread, use more threads
High CPU, low disk	Serialization overhead	Use larger entries (batch small items)
Low CPU, high disk	Disk-bound	Faster SSD, reduce fsync frequency
High syscalls	Mmap backend on Linux	`enable_fd_backend()`

Index Corruption

Symptom:

Index corrupted or missing, rebuilding...

Causes:

Process killed during index write
Filesystem corruption
Manual deletion of read_offset_idx_index.db

Automatic recovery: Walrus rebuilds the index by scanning WAL files and inferring positions from checkpointed state.

Manual intervention:

# Delete corrupted index (Walrus will rebuild)
rm wal_files/read_offset_idx_index.db

# Or for keyed instance:
rm wal_files/my-instance-key/read_offset_idx_index.db

Prevention:

Use FsyncSchedule::Milliseconds(n) to ensure index writes are durable
Avoid kill -9 (use SIGTERM for graceful shutdown)

Out of Memory (OOM)

Symptom: Process killed by OOM killer or allocation failure.

Causes:

Too many concurrent topics (each has reader/writer state)
Large batch sizes (entries buffered in memory)
Memory leaks (file handles not released)

Diagnosis:

# Monitor memory usage
top -p $(pgrep your_program)

# Check mmap count (Linux)
cat /proc/$(pgrep your_program)/maps | wc -l

# Should be ~1-2 per WAL file, if hundreds → leak

Resolution:

Large batches:

// Don't do this:
let huge_batch: Vec<&[u8]> = vec![/* 10,000 entries */];
wal.batch_append_for_topic(topic, &huge_batch)?;

// Do this instead:
for chunk in huge_batch.chunks(1000) {
    wal.batch_append_for_topic(topic, chunk)?;
}

Too many topics:

// Avoid creating millions of topics
// Multiplex into fewer topics with entry metadata instead
let entry = serialize(Metadata { user_id: 12345, data: payload });
wal.append_for_topic("events", &entry)?;

Performance Tuning

Maximizing Write Throughput

// 1. Disable durability (benchmarking only!)
let wal = Walrus::with_consistency_and_schedule(
    ReadConsistency::AtLeastOnce { persist_every: 100_000 },
    FsyncSchedule::NoFsync,
)?;

// 2. Use FD backend (Linux)
enable_fd_backend();

// 3. Batch writes
wal.batch_append_for_topic(topic, &entries)?;  // Not append() in loop

// 4. Multiple writer threads
let wal = Arc::new(wal);
for topic in topics {
    let wal = Arc::clone(&wal);
    thread::spawn(move || {
        wal.append_for_topic(&topic, data)?;
    });
}

Expected throughput (consumer laptop):

Single thread: ~200-300k ops/sec
8 threads: ~1M ops/sec
16 threads: ~1.5M ops/sec (disk-bound)

Minimizing Read Latency

// 1. StrictlyAtOnce for immediate visibility
let wal = Walrus::with_consistency(ReadConsistency::StrictlyAtOnce)?;

// 2. Checkpoint every read
while let Some(entry) = wal.read_next(topic, true)? {  // true = checkpoint
    process(entry);
}

// 3. Batch reads for throughput
let entries = wal.batch_read_for_topic(topic, 1_048_576, true)?;  // 1 MB

Balancing Durability vs Speed

Priority	Configuration
Speed	`AtLeastOnce{10_000}` + `NoFsync`
Balance	`AtLeastOnce{1_000}` + `Milliseconds(1_000)`
Durability	`StrictlyAtOnce` + `SyncEach`

Debugging Strategies

Enable Debug Logging

# See all internal events
WALRUS_QUIET=0 ./your_program

# Output includes:
# - Backend selection
# - Block allocations
# - File creations
# - Checksum failures
# - Deletion triggers

Inspect WAL Files

# List files in data directory
ls -lh wal_files/

# Files are named by epoch milliseconds:
# 1700000000 (older)
# 1700000100
# 1700000200 (newer)

# Check file sizes (should be ~1 GB when full)
du -h wal_files/*

# Partially filled files indicate active allocation

Monitor Background Worker

The background thread logs fsync and deletion events:

[Background] Flushed 15 files
[Background] Deleted file: wal_files/1700000000

If you never see deletions, readers aren’t checkpointing.

Verify Index Persistence

# Index files (one per instance)
ls -lh wal_files/read_offset_idx_index.db

# For keyed instances
ls -lh wal_files/*/read_offset_idx_index.db

# Check modification time (should update regularly if readers active)
stat wal_files/read_offset_idx_index.db

Production Checklist

Before deploying Walrus to production:

Getting Help

If you encounter issues not covered here:

Check the logs with WALRUS_QUIET=0
Search existing issues: GitHub Issues
File a bug report with:
- Walrus version
- Operating system and kernel version
- Reproduction steps
- Relevant logs
- Configuration (consistency, fsync schedule, backend)

For performance issues, include:

Hardware specs (CPU, disk type, RAM)
Benchmark results (make bench-writes, etc.)
Output of perf record (Linux) or profiler

Known Limitations

Current release (v0.1.0):

Single-node only (no replication)
Maximum entry size: 1 GB (file size limit)
io_uring optimizations Linux-only
No built-in compression
No encryption at rest

Workarounds:

Large entries: Split at application level
Compression: Compress before calling append_for_topic()
Encryption: Use dm-crypt/LUKS for disk encryption

Future releases will address distributed features (see roadmap).

Troubleshooting

Table of Contents

Common Issues

Checksum Verification Failures

ErrorKind::WouldBlock on Batch Writes

Disk Space Exhaustion

Slow Performance

1. Check fsync schedule

2. Check backend (Linux only)

3. Check for lock contention

4. Check disk I/O

5. Common bottlenecks

Index Corruption

Out of Memory (OOM)

Performance Tuning

Maximizing Write Throughput

Minimizing Read Latency

Balancing Durability vs Speed

Debugging Strategies

Enable Debug Logging

Inspect WAL Files

Monitor Background Worker

Verify Index Persistence

Production Checklist

Getting Help

Known Limitations