llmstep: [L]LM proofstep suggestions in Lean


  • [11.2023] Experimental Llemma suggestions that leverage file context
  • [10.2023] New paper describing version 1.0.0 of llmstep: [paper]
  • [10.2023] Support for Reprover
  • [9.2023] Support for free GPU servers via Google Colab

llmstep is a Lean 4 tactic for suggesting proof steps using a language model:

Calling llmstep "prefix" gives suggestions that start with prefix:

example (f : ℕ → ℕ) : Monotone f → ∀ n, f n ≤ f (n + 1) := by
  intro h n
  llmstep "exact"

==> Lean Infoview
  Try This:
    * exact h (Nat.le_succ _)
    * exact h (Nat.le_succ n)
    * exact h (Nat.le_add_right _ _)

Clicking a suggestion places it in the proof:

example (f : ℕ → ℕ) : Monotone f → ∀ n, f n ≤ f (n + 1) := by
  intro h n
  exact h (Nat.le_succ _)

llmstep checks the language model suggestions in Lean, and highlights those that close the proof.

Quick start

First, install Lean 4 in VS Code and the python requirements (pip install -r requirements.txt).

Then start a server:

python python/server.py

Open LLMstep/Examples.lean in VS Code and try out llmstep.

Use llmstep in a project

  1. Add llmstep in lakefile.lean:
require llmstep from git

Then run lake update.

  1. Import llmstep in a Lean file:
import LLMstep
  1. Start a server based on your runtime environment. For instance:
python python/server.py

Please see the recommended servers below.


The llmstep tactic communicates with a server that you can run in your own environment (e.g., CPU, GPU, Google Colab).

The table below shows the recommended language model and server scripts. To start a server, use python {script}, e.g. python python/server_vllm.py:

EnvironmentScriptDefault ModelContextSpeedminiF2F-test
CPUpython/server_encdec.pyLeanDojo ByT5 300mState3.16s22.1%
Colab GPUSee Colab setupllmstep Pythia 2.8bState1.68s27.9%
CUDA GPUpython/server_vllm.pyllmstep Pythia 2.8bState0.25s27.9%
CUDA GPU*python/server_llemma.pyLlemma 7bState, current file 🔥N/AN/A

Please refer to our paper for further information on the benchmarks.

llmstep aims to be a model-agnostic tool. We welcome contributions of new models.

* File context support (e.g. with Llemma) is currently experimental.


llmstep has three parts:

  1. a Lean tactic
  2. a language model
  3. a Python server

The Lean tactic sends a request to the server.
The server calls the language model and returns the generated suggestions.
The suggestions are displayed by the tactic in VS Code.

Google Colab

To use Google Colab's free GPU to run a server, follow these instructions:

  1. Open and run this notebook to start a server: Open In Colab

  2. In your local environment, set the environment variable LLMSTEP_HOST equal to the url printed out in this notebook (for example, https://04fa-34-125-110-83.ngrok.io/).

  3. In your local environment, set the environment variable LLMSTEP_SERVER=COLAB.

  4. Use llmstep.

VS Code steps (2) and (3)

To set environment variables in VS Code, go to:

  • Settings (Command + , on Mac)

  • Extensions -> Lean 4

  • Add the environment variables to Server Env. For example:

  • Then restart the Lean Server (Command + t, then type > Lean 4: Restart Server):

Language model

By default, llmstep uses a Pythia 2.8b language model fine-tuned on LeanDojo Benchmark 4:

The python/train directory shows how the model was fine-tuned.


You can use the non-retrieval version of Reprover, which we refer to as LeanDojo ByT5 300m:

python python/server_encdec.py

By default, this runs the leandojo-lean4-tacgen-byt5-small model.
This model is particularly useful on CPU due to its small parameter count.

Using a different model

Swap in other decoder-only language models with the --hf-model argument:

python server.py --hf-model some/other-model-7B

Use --hf-model with python/server_encdec.py for encoder-decoder models.

Use --hf-model with python/server_llemma.py for prompted base models (e.g. CodeLlama).

Fine-tuning a model

The scripts in python/train show how to finetune a model.

Additional Notes

  • The llmstep tactic is inspired by gpt-f.
  • Fine-tuning data for the Pythia-2.8b model is from LeanDojo.
  • The fine-tuning code is based on the script from Stanford Alpaca.
  • The tactic implementation adopts ideas and code from Mathlib4's Polyrith and Std.Tactic.TryThis.
  • Thank you to Mario Carneiro and Scott Morrison for reviewing the tactic implementation.

llmstep was initially created for an IJCAI-2023 tutorial on neural theorem proving.
It aims to be a model-agnostic platform for integrating language models and Lean.


Please cite:

    title={LLMSTEP: LLM proofstep suggestions in Lean},
    author={Sean Welleck and Rahul Saha},
    journal={arXiv preprint arXiv:2310.18457},