lean_scout | Reservoir

Lean Scout

Lean Scout is a tool for creating datasets from Lean projects.

Requirements

To use this tool, you must have:

A basic Lean4 installation, including elan, lake, and lean. The supported toolchain is tracked in lean-toolchain.
Python 3.13+.
The uv Python package manager.

Quickstart

Add Lean Scout as a dependency in your project.

`lakefile.toml`

[[require]]
name = "lean_scout"
git = "https://github.com/mathlib-initiative/lean_scout.git"
rev = "main" # Prefer pinning to a release tag or commit in production

`lakefile.lean`

require lean_scout from git
  "https://github.com/mathlib-initiative/lean_scout.git" @ "main"

Then, from the root of your Lean4 project, run Lean Scout directly via Lake:

lake update
lake run scout --command tactics --parquet --library MyLibrary

Swap the flags for any invocation (e.g. --parquet, --jsonl, --read, --imports, --dataDir, shard counts).

Note: The old hosted extract.sh wrapper has been removed. Add Lean Scout as a normal Lake dependency and invoke lake run scout ... directly.

If you invoke Lean Scout from outside your project root (for example from CI with a different working directory or from another script), pass --cmdRoot /path/to/project/root so relative --read inputs and output paths stay anchored to that directory.

GitHub Actions

You can run Lean Scout directly inside CI once your project declares lean_scout as a Lake dependency. Here is an example workflow that extracts data from a Lean4 project and uploads it to Hugging Face:

name: Upload Lean dataset to HuggingFace Hub

on:
  push:
    branches:
      - main
  workflow_dispatch:

permissions:
  contents: read

env:
  HF_DATASET_NAME: my-dataset

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v6
      - uses: leanprover/lean-action@v1

      - name: Install uv
        uses: astral-sh/setup-uv@v4
        with:
          python-version: "3.13"

      - name: Create temp directory
        id: tempdir
        run: echo "path=$(mktemp -d)" >> "$GITHUB_OUTPUT"

      - name: Generate parquet files
        run: |
          lake run scout \
            --command types \
            --parquet \
            --dataDir "${{ steps.tempdir.outputs.path }}" \
            --imports MyLeanModule

      - name: Verify parquet files exist
        run: |
          if ! ls "${{ steps.tempdir.outputs.path }}"/*.parquet 1>/dev/null 2>&1; then
            echo "::error::No parquet files were generated"
            exit 1
          fi
          echo "Generated data:"
          ls -lh "${{ steps.tempdir.outputs.path }}"/*.parquet

      - name: Upload to HuggingFace Hub
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: |
          uvx hf upload \
            "${{ env.HF_DATASET_NAME }}" \
            "${{ steps.tempdir.outputs.path }}" \
            --repo-type dataset \
            --private \
            --commit-message "Update dataset from ${{ github.sha }}"

To use this in your own Lean4 project on GitHub, you must:

Add Lean Scout as a Lake dependency in your project.
Set up your Hugging Face write token as a repository secret under HF_TOKEN.
Change HF_DATASET_NAME: my-dataset to the dataset you want to update.
Change the parameters passed to lake run scout. The current options extract data about types contained in the environment obtained by importing MyLeanModule.
If your workflow runs from a different working directory, pass --cmdRoot "$GITHUB_WORKSPACE" (or another appropriate project root) to keep relative paths anchored correctly.

Basic usage

Once Lean Scout is available as a dependency, run it from your Lean4 project root.

Extract from imports

lake run scout --command types --parquet --imports Lean

This will run the types command to extract types of constants from an environment created by importing the Lean module.

Extract from files

# Single file
lake run scout --command tactics --parquet --read MyFile.lean

# Multiple files in parallel (one subprocess per file)
lake run scout --command tactics --parquet --parallel 4 --read File1.lean File2.lean File3.lean

# Extract from entire library (recommended for large codebases)
lake run scout --command tactics --parquet --parallel 8 --library LeanScoutTest

For tactics, Lean Scout treats syntax, import, elaboration, and type errors in the target file as extraction failures: the run returns a nonzero exit code and no records are emitted for that file.

If you have Lean Scout as a dependency with Mathlib as another dependency, you can similarly run:

lake run scout --command types --parquet --imports Mathlib

In both cases, the data will be written to parquet files in the ./data/ subdirectory of your Lean4 project. You can specify the base directory where data is stored as follows:

lake run scout --command types --parquet --dataDir $HOME/storage/types --imports Mathlib

This will write the data to files located within the $HOME/storage/types/ directory. The default location is ./data/.

By default Lean Scout resolves both outputs and relative read targets from the directory where you invoke the command (--cmdRoot, default: current working directory). If you run from outside the project root or from automation that changes the working directory, pass --cmdRoot /path/to/where/paths/are/relative so relative --read paths and outputs stay anchored to that location.

Lean Scout is strict about extraction failures: if any extractor subprocess or the Parquet writer fails, the overall run returns a nonzero exit code, stops launching new targets after the first detected failure, and cancels already-running extractor subprocesses as aggressively as possible.

If an extraction stops early (for example because of Ctrl+C or because the run exits with an error), Lean Scout leaves the output directory on disk. If the failed run wrote partial Parquet files, remove the previous output directory or point --dataDir to a fresh location before retrying.

JSON lines

The flag --jsonl can be used to extract data directly to stdout. Parquet files will not be written if using --jsonl.

Note: logging information is sent to stderr. In --parquet mode, malformed JSON lines reaching the Python writer are treated as fatal errors rather than skipped.

Extraction Modes

Lean Scout supports multiple extraction modes:

--imports: Extract from an environment created by importing modules (single subprocess)
- Best for: Extracting types, declarations, or other environment-level data
- Example: lake run scout --command types --parquet --imports Lean
--read: Extract from specific files (parallel subprocesses, one per file)
- Best for: Processing specific files with per-file data extraction
- Example: lake run scout --command tactics --parquet --parallel 4 --read File1.lean File2.lean
--library: Extract from all modules in a library (parallel subprocesses, recommended)
- Best for: Processing entire libraries or large codebases
- Uses lake query -q <library>:module_paths to automatically discover all module files
- Example: lake run scout --command tactics --parquet --parallel 8 --library LeanScoutTest

Note: The --library flag is the recommended approach for extracting data from entire libraries, as it automatically discovers all modules without requiring manual file management.

Important: The target flags (--imports, --library, --read) consume all remaining command-line arguments. Place other flags like --parquet, --jsonl, --parallel, --dataDir before the target specification.

Extractor Configuration

Extractors can be configured using the --config flag, which accepts a JSON object:

lake run scout --config '{"taskLimit": 8}' --command types --parquet --imports Lean

Filtering philosophy

Built-in extractors do not filter data during extraction. Instead, they emit enough metadata for users to filter downstream in whatever way matches their use case.

In other words:

extraction should preserve the raw data
built-in extractors should expose useful filter metadata
filtering policy belongs to downstream consumers, not the extraction step

Built-in extractors validate config strictly. Unknown keys and values of the wrong type are treated as extraction errors and cause a nonzero exit code.

Breaking change: built-in filter config is no longer supported. Filtering should now be done downstream using emitted metadata such as allowCompletion and kind.

Available Configuration Options

Option	Type	Default	Description
`taskLimit`	natural number	unset	Maximum number of concurrent per-constant worker tasks for imports-mode extractors (`types`, `const_dep`)

Examples:

# Bound imports-mode worker parallelism for imports-mode extractors
lake run scout --config '{"taskLimit": 8}' --command const_dep --parquet --imports Lean

# Tactics accepts an empty config only; filter downstream on `kind`
lake run scout --command tactics --parquet --library MyLib

Sharding

By default, data is organized into 128 parquet shards. The shard associated with a datapoint is computed by hashing a key, which is specified directly in each data extractor. The number of shards used can be controlled with the --numShards option:

lake run scout --command types --parquet --numShards 32 --imports Lean

Available Data Extractors

We provide three built-in data extractors: types, tactics, and const_dep.

`types`

Extracts constant declarations with their types, modules, and docstrings.

Supported modes: --imports only

Example:

lake run scout --command types --parquet --imports Lean

Output schema:

name (string): Constant name
module (string, nullable): Module containing the constant
type (string): Type signature
docString (string, nullable): Declaration docstring, if one exists
allowCompletion (bool): Whether Lean.Meta.allowCompletion holds for the constant

Configuration:

taskLimit (optional natural number): Bounds concurrent per-constant worker tasks during imports-mode extraction

Downstream filtering example:

filtered = dataset.filter(lambda x: x["allowCompletion"])

`tactics`

Extracts tactic invocations with before/after goal states, proof-term views of goals, used constants, used free variables, used goals, elaborator info, syntax kinds, and source locations.

Supported modes: --read, --library

Example:

lake run scout --command tactics --parquet --parallel 4 --library LeanScoutTest

Output schema:

module (string, nullable): Module containing the tactic (null for plain --read files without module setup)
startPos (struct): Start position of the tactic syntax in the source file
- line (nat): 1-based line number
- column (nat): 0-based column number
endPos (struct): End position of the tactic syntax in the source file
- line (nat): 1-based line number
- column (nat): 0-based column number
nextStartPos (struct): Position after the tactic's trailing whitespace; this is the start position of the next syntax in the source file, if any, or the end of the file
- line (nat): 1-based line number
- column (nat): 0-based column number
goals (list): List of goal states before the tactic
- pp (string): Pretty-printed goal
- ppTerm (string): Pretty-printed term representation of the original goal metavariable before elaborating the tactic
- assigned (bool): Whether the original goal metavariable is assigned after elaborating the tactic
- usedConstants (list of strings): Constants referenced in the instantiated goal, including through delayed-assigned metavariables
- usedFVars (list of strings): Free variables referenced in the instantiated goal, using the same sanitized names as the pretty-printed goal
- usedGoals (list): Goals/metavariables referenced in the instantiated goal after following delayed assignments
  - new (bool): Whether the used goal was created by this tactic rather than present before elaborating it
  - index (nat, nullable): Index of the used goal in goalsAfter, or null if it is not present there
  - kind (string): Metavariable kind (natural, synthetic, or syntheticOpaque)
  - pp (string): Pretty-printed used goal
  - ppTerm (string): Pretty-printed term representation of the used goal metavariable
goalsAfter (list of strings): Pretty-printed goal states after the tactic
ppTac (string): Pretty-printed tactic syntax
elaborator (string): Name of the elaborator that produced this tactic
kind (string): Syntax node kind for the tactic

Configuration:

no built-in extractor-specific options; config must be {}

Downstream filtering example:

structural = dataset.filter(lambda x: x["kind"] == "Lean.Parser.Tactic.tacticSeq")

`const_dep`

Extracts constant dependency information, mapping each constant to the set of constants it uses.

Supported modes: --imports only

Example:

lake run scout --command const_dep --parquet --imports Lean

Output schema:

name (string): Constant name
module (string, nullable): Module containing the constant
deps (list of strings): Names of constants directly used by this constant
allowCompletion (bool): Whether Lean.Meta.allowCompletion holds for the parent constant

Configuration:

taskLimit (optional natural number): Bounds concurrent per-constant worker tasks during imports-mode extraction

Downstream filtering example:

filtered_rows = dataset.filter(lambda x: x["allowCompletion"])
# If you need dependency-level metadata, join dependency names against `types` output.

Creating datasets

It is straightforward to create a dataset (in the sense of datasets) from a list of parquet files. For example, once you run

lake run scout --command types --parquet --imports Lean

to create parquet files of the form ./data/*.parquet, a dataset can be created in python as follows (see data.ipynb):

from datasets import Dataset
import glob

dataset = Dataset.from_parquet(glob.glob("./data/*.parquet"))

or as follows:

from datasets import load_dataset

dataset = load_dataset("parquet", data_dir="./data", split="train")

How does LeanScout work?

The Lean orchestrator (Main.lean) manages one or more Lean subprocess(es) that extract data and output JSON lines to stdout
For --parquet output, the orchestrator spawns a Python process (cli.py) that reads JSON from stdin and writes to Parquet files
For --jsonl output, the orchestrator writes JSON directly to stdout

The orchestration logic is implemented in Main.lean, with the Parquet writing handled by src/lean_scout/cli.py.

Testing

Running Tests

For broad coverage, run both:

lake test                                        # Lean schema tests (LeanScoutTest.lean)
./run_tests                                      # Main automation suite (ruff, mypy, build, internals, integration, extractors)

To run individual components:

uv run pytest test/internals/ -v                 # Python parquet writer tests
./test/integration/test_lean_orchestrator.sh     # Lean orchestrator integration tests
uv run pytest test/extractors/ -v                # End-to-end extractor tests

Lean Schema Tests

The lake test command runs LeanScoutTest.lean, which validates:

Schema JSON roundtrip: All registered data extractors have schemas that serialize to JSON and deserialize back correctly
Schema Python roundtrip: Schema definitions are correctly parsed by the Python parquet writer (test/schema.py)