Towards Autonomous Protocol Proofs
Writing formal proofs for distributed protocols is incredibly tedious and in almost all cases not worth the effort. In general, you have to come up with an inductive invariant which is already a very difficult task. At least a few separate PhD theses were entirely devoted to automating this task in the past few years, and they still usually don’t scale beyond protocols of a moderate size. After coming up with an inductive invariant, you then also need to prove that invariant is actually inductive, which on its own can be another difficult task. In most cases, even automating this checking step is undecidable.
If you can’t hand the proof of an inductive invariant to an SMT solver and have it automatically solved, you need to put in some manual effort to essentially write a detailed proof yourself, with the help of a proof assistant. Doing this kind of proof in TLA+ via the TLA+ proof system (TLAPS) is feasible, but a major challenge for any nontrival protocols.
In some prior research, we worked on better ways to come up with inductive invariants, and one product of this work was a candidate inductive invariant for an abstract version of the Raft protocol specified in TLA+. The invariant consists of 12 smaller lemma invariants, and is unable to be automatically verified by the TLA+ proof system (TLAPS). You can try to use TLC (for tiny protocols) or Apalache for verifying the correctness for finite versions of the protocol (e.g. finite number of nodes), but none of these give you a full, general proof of correctness.
In recent experiments re-checking these proofs, the latest Claude models (Opus 4.6) appear to do an excellent job of automatically writing these proofs. In the past, and from direct experience working on this during past research projects, this was typically at least a months long effort for a highly capable master’s or PhD student. It is also incredibly tedious, meticulous, and mentally taxing work for humans to carry out. In other words, it can just be a huge time sink, for questionable value.
We can take the candidate inductive invariant as a starting point, and give Claude a few basic instructions about how to run TLAPS to check proof obligations. The ultimate goal is to check this induction step for each separate lemma and action of the protocol. We then let it go, asking it prove each theorem one by one. We can also ask for a nice HTML report of the proofs generated after each one is done, where we can also drill into the proof of each individual theorem.
AbstractRaft Inductive Invariant Proofs
Theorem Summary
| Theorem | Invariant | Obligations | SMT | Zenon | TLAPM | Prover Time | Claude Time | Attempts |
|---|---|---|---|---|---|---|---|---|
| L_1 | H_OnePrimaryPerTerm | 139/139 | 47 | 2 | 90 | 11.9s | ~8 min | 8 |
| L_2 | H_PrimaryHasOwnEntries | 182/182 | 63 | 0 | 119 | 2.4s | ~10 min | 4 |
| L_3 | H_LogMatching | 21/21 | 7 | 0 | 14 | 1.3s | ~4 min | 2 |
| L_4 | H_PrimaryTermGTELogTerm | 95/95 | 37 | 0 | 58 | 1.7s | ~15 min | 9 |
| L_5 | H_QuorumsSafeAtTerms | 132/132 | 132 | 0 | 0 | cached | — | 6 |
| L_6 | H_UniformLogEntries | 640/640 | 640 | 0 | 0 | cached | — | — |
| L_7 | H_TermsMonotonic | 227/227 | 227 | 0 | 0 | cached | ~8 min | 6 |
| L_8 | H_LogEntryImpliesSafeAtTerm | 20/20 | 20 | 0 | 0 | cached | — | — |
| L_9 | H_LeaderCompleteness | 418/418 | 159 | 0 | 257 | 4.1s | — | — |
| L_10 | H_LaterLogsHaveEarlierCommitted | 422/422 | 131 | 0 | 291 | cached | — | — |
| L_11 | H_CommittedEntryIsOnQuorum | 125/125 | 125 | 0 | 0 | cached | — | — |
| L_12 | H_StateMachineSafety | 138/138 | 54 | 0 | 84 | cached | ~5 min | 3 |
| — | Init => IndGlobal | N/A | — | — | — | — | — | — |
Overall, each theorem required roughly no more than 30-40 minutes of agent thinking time, and with little to no human intervention. The arguments and steps generated by Claude automatically were all impressive, and something I was not expecting it do so well at. The final version of the generated proof can be found here, along with associated commits made as it proved each theorem. The full file with complete proofs is around 1720 lines, up from a baseline of 296 lines in the starting unproven file.
There are definitely a few caveats here, but the results are still incredibly impressive, and something completely out of realm of possibility a few years ago. Importantly, there is a lot of information about the Raft protocol on the web, and it is probably one of the most well-studied and widely implemented consensus protocols to date. Similarly, there has been some amount of work done on formally verified Raft proofs. This work is not done specifically in TLA+, but there is prior work in the area. Having said that, I still think this is a super impressive achievement. Even for an experienced engineer/researcher, going off and reading documentation on existing Raft proofs and synthesizing that into a correct TLAPS proof would be an extremely nontrivial task.
From a practical systems engineering and building standpoint, I’m still not that interested in fully automated proofs. I often still find that finite-state model checking over adequately large state spaces is still a better cost-benefit tradeoff, and makes things much easier to automate and re-verify upon modification. Also, I am still very unconvinced that we will ever be able to (or care about) proving properties of the actual, running system implementations. Regardless, this task is a great benchmark for understanding the frontier capabilities of these models.
Another thing that’s becoming increasingly relevant as well is the speed bottlenecks on these models. Right now, the execution loops feel roughly in line with the speed of a human, or at least comprehensible to a human. But, for tasks that are largely inference bottlenecked, it is interesting to think about what is possible if these kind of tasks are massively parallelizable and/or can run at 100x or 1000x their current speed. This calculus also leads to other interesting questions, as we have done deep research and development on intricate and efficient algorithms for automatically model checking these types of protocols. But, if we can run a general (super) intelligence at 1000x human speed, do these kind of specialized algorithms become increasingly less necessary? Special purpose algorithms might always be more efficient for specific tasks, but in a world where the cost of compute continues to fall we might not care that much, especially if the AI-driven methods are more flexible and general.