Troubleshooting
Common issues, debugging strategies, and operational guidance for running Walrus in production.
Table of Contents
- Troubleshooting
Common Issues
Checksum Verification Failures
Symptom:
Checksum mismatch at offset 12345
Causes:
- Disk corruption (bad sectors, failing drive)
- Memory corruption (bad RAM, cosmic rays)
- Process crash during write (partial entry written)
- Manual file modification
Diagnosis:
# Check system logs for disk errors
dmesg | grep -i "error\|fail"
# Run filesystem check (unmount first!)
fsck /dev/sdX
# Test RAM with memtest86+
Resolution:
- If rare (< 1 per million entries): Likely cosmic ray or transient error, monitor and continue
- If frequent: Hardware issue, replace disk/RAM immediately
- If correlated with crashes: Bug in write path, file an issue with reproduction steps
Data recovery: Walrus skips corrupted entries and continues. To recover:
// Re-write from source if available
for entry in source_data {
wal.append_for_topic("recovered", entry)?;
}
ErrorKind::WouldBlock on Batch Writes
Symptom:
wal.batch_append_for_topic("my-topic", &entries)?;
// Error: ErrorKind::WouldBlock
Cause: Another thread is currently performing a batch write to the same topic.
Why this happens: Only one batch write per topic is allowed at a time (enforced via AtomicBool flag). This prevents concurrent batches from corrupting block layout.
Resolution:
// Option 1: Retry after brief delay
loop {
match wal.batch_append_for_topic(topic, &entries) {
Ok(()) => break,
Err(e) if e.kind() == ErrorKind::WouldBlock => {
thread::sleep(Duration::from_micros(100));
continue;
}
Err(e) => return Err(e),
}
}
// Option 2: Use regular append (slower but never blocks)
for entry in entries {
wal.append_for_topic(topic, entry)?;
}
// Option 3: Coordinate batches at application level
// (one producer per topic for batch writes)
Prevention: Design your application so only one thread performs batch writes per topic, or use a coordinator.
Disk Space Exhaustion
Symptom:
Error: No space left on device (os error 28)
Cause:
- Files aren’t being deleted (readers not checkpointing)
- Write throughput exceeds read throughput (backlog growing)
- Insufficient disk capacity for workload
Diagnosis:
# Check Walrus data directory size
du -sh wal_files/
# Count files
ls -1 wal_files/ | wc -l
# Check oldest file (should be recent if deletion working)
ls -lt wal_files/ | tail -n 1
Resolution:
If files aren’t being deleted:
// Ensure readers checkpoint regularly
while let Some(entry) = wal.read_next(topic, true)? { // true = checkpoint!
process(entry);
}
// With AtLeastOnce, ensure persist_every isn't too large
let wal = Walrus::with_consistency(
ReadConsistency::AtLeastOnce { persist_every: 1_000 } // Not 1_000_000!
)?;
If backlog is growing:
- Scale read capacity (more reader threads/processes)
- Reduce write rate at source
- Add more disk space
Emergency cleanup:
# Find fully checkpointed files (manual intervention)
# WARNING: Only delete if you're certain readers are done!
find wal_files/ -name "read_offset_idx_index.db" -exec cat {} \;
# Compare with file list, manually remove ancient files
Slow Performance
Symptom: Throughput lower than expected.
Diagnosis checklist:
1. Check fsync schedule
// Too aggressive?
Walrus::with_consistency_and_schedule(
ReadConsistency::StrictlyAtOnce,
FsyncSchedule::SyncEach, // VERY slow!
)?;
// Better for throughput:
Walrus::with_consistency_and_schedule(
ReadConsistency::AtLeastOnce { persist_every: 5_000 },
FsyncSchedule::Milliseconds(2_000),
)?;
2. Check backend (Linux only)
# Set WALRUS_QUIET=0 to see backend selection
WALRUS_QUIET=0 cargo run
# Output should show:
# "Using FD backend with io_uring"
# If it says "Using mmap backend", force FD:
enable_fd_backend();
3. Check for lock contention
# Profile with perf (Linux)
perf record -g ./your_program
perf report
# Look for:
# - High time in spin loops (BlockAllocator)
# - Mutex contention (Writer locks)
# - Syscalls (fsync, write)
4. Check disk I/O
# Monitor disk utilization
iostat -x 1
# %util should be high (90-100%) if disk-bound
# If low, you're CPU or lock-bound
5. Common bottlenecks
| Symptom | Cause | Fix |
|---|---|---|
| Low CPU, low disk | Lock contention | Reduce topics per thread, use more threads |
| High CPU, low disk | Serialization overhead | Use larger entries (batch small items) |
| Low CPU, high disk | Disk-bound | Faster SSD, reduce fsync frequency |
| High syscalls | Mmap backend on Linux | enable_fd_backend() |
Index Corruption
Symptom:
Index corrupted or missing, rebuilding...
Causes:
- Process killed during index write
- Filesystem corruption
- Manual deletion of
read_offset_idx_index.db
Automatic recovery: Walrus rebuilds the index by scanning WAL files and inferring positions from checkpointed state.
Manual intervention:
# Delete corrupted index (Walrus will rebuild)
rm wal_files/read_offset_idx_index.db
# Or for keyed instance:
rm wal_files/my-instance-key/read_offset_idx_index.db
Prevention:
- Use
FsyncSchedule::Milliseconds(n)to ensure index writes are durable - Avoid
kill -9(useSIGTERMfor graceful shutdown)
Out of Memory (OOM)
Symptom: Process killed by OOM killer or allocation failure.
Causes:
- Too many concurrent topics (each has reader/writer state)
- Large batch sizes (entries buffered in memory)
- Memory leaks (file handles not released)
Diagnosis:
# Monitor memory usage
top -p $(pgrep your_program)
# Check mmap count (Linux)
cat /proc/$(pgrep your_program)/maps | wc -l
# Should be ~1-2 per WAL file, if hundreds → leak
Resolution:
Large batches:
// Don't do this:
let huge_batch: Vec<&[u8]> = vec![/* 10,000 entries */];
wal.batch_append_for_topic(topic, &huge_batch)?;
// Do this instead:
for chunk in huge_batch.chunks(1000) {
wal.batch_append_for_topic(topic, chunk)?;
}
Too many topics:
// Avoid creating millions of topics
// Multiplex into fewer topics with entry metadata instead
let entry = serialize(Metadata { user_id: 12345, data: payload });
wal.append_for_topic("events", &entry)?;
Performance Tuning
Maximizing Write Throughput
// 1. Disable durability (benchmarking only!)
let wal = Walrus::with_consistency_and_schedule(
ReadConsistency::AtLeastOnce { persist_every: 100_000 },
FsyncSchedule::NoFsync,
)?;
// 2. Use FD backend (Linux)
enable_fd_backend();
// 3. Batch writes
wal.batch_append_for_topic(topic, &entries)?; // Not append() in loop
// 4. Multiple writer threads
let wal = Arc::new(wal);
for topic in topics {
let wal = Arc::clone(&wal);
thread::spawn(move || {
wal.append_for_topic(&topic, data)?;
});
}
Expected throughput (consumer laptop):
- Single thread: ~200-300k ops/sec
- 8 threads: ~1M ops/sec
- 16 threads: ~1.5M ops/sec (disk-bound)
Minimizing Read Latency
// 1. StrictlyAtOnce for immediate visibility
let wal = Walrus::with_consistency(ReadConsistency::StrictlyAtOnce)?;
// 2. Checkpoint every read
while let Some(entry) = wal.read_next(topic, true)? { // true = checkpoint
process(entry);
}
// 3. Batch reads for throughput
let entries = wal.batch_read_for_topic(topic, 1_048_576, true)?; // 1 MB
Balancing Durability vs Speed
| Priority | Configuration |
|---|---|
| Speed | AtLeastOnce{10_000} + NoFsync |
| Balance | AtLeastOnce{1_000} + Milliseconds(1_000) |
| Durability | StrictlyAtOnce + SyncEach |
Debugging Strategies
Enable Debug Logging
# See all internal events
WALRUS_QUIET=0 ./your_program
# Output includes:
# - Backend selection
# - Block allocations
# - File creations
# - Checksum failures
# - Deletion triggers
Inspect WAL Files
# List files in data directory
ls -lh wal_files/
# Files are named by epoch milliseconds:
# 1700000000 (older)
# 1700000100
# 1700000200 (newer)
# Check file sizes (should be ~1 GB when full)
du -h wal_files/*
# Partially filled files indicate active allocation
Monitor Background Worker
The background thread logs fsync and deletion events:
[Background] Flushed 15 files
[Background] Deleted file: wal_files/1700000000
If you never see deletions, readers aren’t checkpointing.
Verify Index Persistence
# Index files (one per instance)
ls -lh wal_files/read_offset_idx_index.db
# For keyed instances
ls -lh wal_files/*/read_offset_idx_index.db
# Check modification time (should update regularly if readers active)
stat wal_files/read_offset_idx_index.db
Production Checklist
Before deploying Walrus to production:
- Choose appropriate
ReadConsistencyfor your durability needs - Set
FsyncSchedulebased on acceptable data loss window - Enable FD backend on Linux (
enable_fd_backend()) - Set
WALRUS_DATA_DIRto dedicated disk/partition - Monitor disk space (alert at 80% usage)
- Test crash recovery (kill -9, verify reads resume)
- Load test with realistic workload (topics, entry sizes, concurrency)
- Verify file deletion works (run for hours, check old files deleted)
- Set up monitoring for checksum failures
- Document your instance key strategy (if using keyed instances)
Getting Help
If you encounter issues not covered here:
- Check the logs with
WALRUS_QUIET=0 - Search existing issues: GitHub Issues
- File a bug report with:
- Walrus version
- Operating system and kernel version
- Reproduction steps
- Relevant logs
- Configuration (consistency, fsync schedule, backend)
For performance issues, include:
- Hardware specs (CPU, disk type, RAM)
- Benchmark results (
make bench-writes, etc.) - Output of
perf record(Linux) or profiler
Known Limitations
Current release (v0.1.0):
- Single-node only (no replication)
- Maximum entry size: 1 GB (file size limit)
- io_uring optimizations Linux-only
- No built-in compression
- No encryption at rest
Workarounds:
- Large entries: Split at application level
- Compression: Compress before calling
append_for_topic() - Encryption: Use dm-crypt/LUKS for disk encryption
Future releases will address distributed features (see roadmap).