March 30, 2026·Adwait Godboleverificationformal-methodsaicase-studylsm-tree

We Asked Two AI Agents to Build an LSM Database. Only One Got Database Compaction Right.

We gave the same specification to two AI code-generation agents and asked each to build a persistent, crash-safe LSM database in Rust. The kind of storage engine that powers LevelDB, RocksDB, and the write path of half the infrastructure you use daily.

The setup

Both implementations target the same spec: a key-value store with put, get, delete, scan, and close. Writes must survive process crashes (via a write-ahead log), background compaction must never change what queries return, and deleted keys must never come back.

Both received the same architectural constraints: sequence numbers on every entry, a memtable keyed by (key, reverse sequence number), leveled compaction, and bloom filters with no-false-negatives guarantees. These constraints were extracted from Verus proof skeletons to ensure the proofs could apply to either implementation.

The difference is what happened next:

Claude Code went straight from the prompt to code. No intermediate specification, no proofs — just the architectural constraints and an implementation.
Aretta first translated the high-level prompt into a low-level formal specification, then built Verus proof skeletons for that specification, and only then implemented the code. The proofs cover 13 machine-checked properties across data-structure invariants and store-level ordering guarantees. (The full list is in the appendix below.)

A torture test suite of 45 tests catches zero bugs

We wrote 45 tests across seven categories: latest-write-wins, tombstone correctness, scan ordering, crash recovery, compaction transparency, scale, and edge cases. The suite includes overwrite storms of 100,000 puts to the same key, mass deletion of 10,000 keys followed by a full-range scan, and round-trip checks for binary keys containing null bytes. Every test passed on both implementations. This is the part where most teams would ship.

The bug that tests can't find

LSM databases periodically run compaction — merging and reorganizing on-disk files (SSTables) to reclaim space and maintain read performance. Here is the database compaction code from Claude Code:

// Step 1: Remove old SSTables from in-memory tracking
levels[level].retain(|h| !source_ids.contains(&h.meta.id));
manifest.remove_sstables(&all_removed_ids);

// Step 2: Write merged output to new SSTables
let new_handles = write_split_sstables(dir, manifest, &merged)?;

See full source — agent-cc/src/compaction.rs

Do you see the bug?

Step 1 removes references to the old SSTables. Step 2 writes their replacement. If step 2 fails due to an I/O error, the references are already gone and there's nothing to replace them. From this point on, get() silently returns None for every key that lived in those files.

The data is still on disk. A restart recovers it. But the running process has no idea it's broken. It just serves wrong answers until someone notices.

What the proof found

Aretta's proof skeleton specifies that new SSTables must be durable before the manifest is updated, and the manifest must be durable before old files are deleted. The underlying invariant is: at every reachable state during compaction, either old SSTables are present or new SSTables are present (or both). Aretta first proved this property on an abstract model of the compaction phases, then used that proof to guide the implementation — ensuring the code maintains the invariant by construction.

When we tried proving this property on Claude Code's generated code, Verus found the property to be violated during compaction.

Confirming the bug: fault injection

To confirm this causes real harm, we built a fault injection hook that fails the N-th SSTableWriter::finish() call with a simulated ENOSPC error.

pub fn fail_nth_sst_finish(n: u32) {
    SST_FINISH_COUNTDOWN.with(|c| c.set(n));
}

See full source — agent-cc/src/testing.rs

We wrote four sentinel keys into separate L0 SSTables, then triggered compaction with the fault armed. On Claude Code's implementation, all four sentinel keys vanished from the live instance — get() returned None for every one of them. A reopen recovered the data (the files were still on disk), but the running process had been silently serving wrong answers with no indication of failure.

On Aretta's implementation, the same fault injection had no effect. All four sentinel keys remained visible throughout. The proof-guided implementation writes output first, then calls manifest.atomic_swap() which only updates in-memory state after the new SSTables are confirmed on disk:

// Step 1: Write new SSTables (confirmed on disk)
let new_sstables = self.write_level_sstables(manifest, merge_iter, 1)?;

// Step 2: Atomic manifest swap — ADDs before REMOVEs, fsync,
//         THEN update in-memory state
manifest.atomic_swap(new_sstables, old_file_numbers)?;

// Step 3: Delete old files (manifest already durable)
for entry in old_entries { fs::remove_file(...)?; }

See full source — agent-a/src/compaction.rs

If the write fails at step 1, atomic_swap is never called and the in-memory state is unchanged. The invariant holds at every phase.

Why this bug is hard to catch

This isn't a logic error in the traditional sense. The happy path is correct — when compaction succeeds, the store works perfectly. The bug only manifests when an I/O operation fails between two specific lines of code. That makes it invisible to every standard testing technique:

Unit tests only see the happy path. Compaction works, data looks right, test passes. Nobody writes a test that says "now make the disk full between these two lines."
Integration tests run on healthy local SSDs. The I/O failure that triggers this bug simply never happens in CI.
Fuzz testing throws random keys and values at the API. It doesn't fuzz the filesystem. You could fuzz for a year and never trigger ENOSPC at the right moment.
Code review reads "clear old state, write new state" as perfectly reasonable cleanup. Its very easy to miss the two-line gap between remove and write.

The proof catches it because it reasons about all reachable states, not just the states that tests happen to visit. It asks: "is there any point during compaction where the invariant doesn't hold?" And it finds one.

What this means

Tests are essential, and the 45-test torture suite we built provides real value as ongoing regression coverage. But tests exercise the code with working disks, successful writes, and cooperative hardware. The bugs that matter in production — the ones that cause silent data loss at 3 AM when a disk fills up — live in failure paths that no test in our suite could reach.

Formal verification works differently. Rather than running the code and checking outputs, it reasons about the full semantics of the program. The Verus proof examined every reachable state during compaction and determined that the invariant is violated at a specific program point. It did not need to simulate a disk failure to reach that conclusion.

Neither technique subsumes the other. Tests caught bugs that the proofs did not cover (like the missing file lock and the scan panic on empty input). The proof caught a bug that tests could not reach. In practice, the two approaches are complementary: tests cover breadth across the API surface, while proofs cover depth on critical invariants. Using both together gives stronger assurance than either one alone.

The full proof trees are available for Aretta here and Claude Code here.

The full source code, proof skeletons, and fault injection tests are available on GitHub. If you're building infrastructure that needs stronger correctness guarantees than testing alone can provide, reach out.

Appendix: proof properties

The 13 machine-checked properties verified by Aretta's Verus proof skeletons:

Property	Formal
Entry validity	`is_delete ==> value.is_none()`
Encoding roundtrip	`decode(encode(n)) == n`
Bloom filter	`add(k) ==> probe(k) == true`
Memtable ordering	`get(k) == max_by_seq(entries_for_key(k))`
Merge iterator	`is_sorted(collect(mi))` and completeness
WAL-before-memtable	`wal.sync()` happens-before `memtable.insert()`
Flush ordering	`sstable.finish()` happens-before `manifest.add()` happens-before `wal.rotate()`
Compaction ordering	`output.finish()` happens-before `manifest.swap()` happens-before `delete(inputs)`
Tombstone correctness	`delete(k) && compact() ==> get(k) == None`
Compaction transparency	`forall k: get_before(k) == get_after(k)`
Latest-write-wins	`put(k,v1); put(k,v2) ==> get(k) == v2`