[{"data":1,"prerenderedAt":4},["ShallowReactive",2],{"wgm4UaRTRC":3},"# Lean Scout\n\nLean Scout is a tool for creating datasets from Lean projects.\n\n## Requirements\n\nTo use this tool, you must have:\n- A basic Lean4 installation, including `elan`, `lake`, and `lean`. The supported toolchain is tracked in `lean-toolchain`.\n- Python 3.13+.\n- The `uv` Python package manager.\n\n## Quickstart\n\nAdd Lean Scout as a dependency in your project.\n\n### `lakefile.toml`\n```toml\n[[require]]\nname = \"lean_scout\"\ngit = \"https://github.com/mathlib-initiative/lean_scout.git\"\nrev = \"main\" # Prefer pinning to a release tag or commit in production\n```\n\n### `lakefile.lean`\n```lean\nrequire lean_scout from git\n  \"https://github.com/mathlib-initiative/lean_scout.git\" @ \"main\"\n```\n\nThen, from the root of your Lean4 project, run Lean Scout directly via Lake:\n```bash\nlake update\nlake run scout --command tactics --parquet --library MyLibrary\n```\n\nSwap the flags for any invocation (e.g. `--parquet`, `--jsonl`, `--read`, `--imports`, `--dataDir`, shard counts).\n\n> **Note**: The old hosted `extract.sh` wrapper has been removed. Add Lean Scout as a normal Lake dependency and invoke `lake run scout ...` directly.\n\nIf you invoke Lean Scout from outside your project root (for example from CI with a different working directory or from another script), pass `--cmdRoot /path/to/project/root` so relative `--read` inputs and output paths stay anchored to that directory.\n\n## GitHub Actions\n\nYou can run Lean Scout directly inside CI once your project declares `lean_scout` as a Lake dependency. Here is an example workflow that extracts data from a Lean4 project and uploads it to Hugging Face:\n```yml\nname: Upload Lean dataset to HuggingFace Hub\n\non:\n  push:\n    branches:\n      - main\n  workflow_dispatch:\n\npermissions:\n  contents: read\n\nenv:\n  HF_DATASET_NAME: my-dataset\n\njobs:\n  build:\n    runs-on: ubuntu-latest\n\n    steps:\n      - uses: actions/checkout@v6\n      - uses: leanprover/lean-action@v1\n\n      - name: Install uv\n        uses: astral-sh/setup-uv@v4\n        with:\n          python-version: \"3.13\"\n\n      - name: Create temp directory\n        id: tempdir\n        run: echo \"path=$(mktemp -d)\" >> \"$GITHUB_OUTPUT\"\n\n      - name: Generate parquet files\n        run: |\n          lake run scout \\\n            --command types \\\n            --parquet \\\n            --dataDir \"${{ steps.tempdir.outputs.path }}\" \\\n            --imports MyLeanModule\n\n      - name: Verify parquet files exist\n        run: |\n          if ! ls \"${{ steps.tempdir.outputs.path }}\"/*.parquet 1>/dev/null 2>&1; then\n            echo \"::error::No parquet files were generated\"\n            exit 1\n          fi\n          echo \"Generated data:\"\n          ls -lh \"${{ steps.tempdir.outputs.path }}\"/*.parquet\n\n      - name: Upload to HuggingFace Hub\n        env:\n          HF_TOKEN: ${{ secrets.HF_TOKEN }}\n        run: |\n          uvx hf upload \\\n            \"${{ env.HF_DATASET_NAME }}\" \\\n            \"${{ steps.tempdir.outputs.path }}\" \\\n            --repo-type dataset \\\n            --private \\\n            --commit-message \"Update dataset from ${{ github.sha }}\"\n```\n\nTo use this in your own Lean4 project on GitHub, you must:\n- Add Lean Scout as a Lake dependency in your project.\n- Set up your Hugging Face write token as a repository secret under `HF_TOKEN`.\n- Change `HF_DATASET_NAME: my-dataset` to the dataset you want to update.\n- Change the parameters passed to `lake run scout`. The current options extract data about types contained in the environment obtained by importing `MyLeanModule`.\n- If your workflow runs from a different working directory, pass `--cmdRoot \"$GITHUB_WORKSPACE\"` (or another appropriate project root) to keep relative paths anchored correctly.\n\n## Basic usage\n\nOnce Lean Scout is available as a dependency, run it from your Lean4 project root.\n\n### Extract from imports\n```bash\nlake run scout --command types --parquet --imports Lean\n```\n\nThis will run the `types` command to extract types of constants from an environment created by importing the `Lean` module.\n\n### Extract from files\n```bash\n# Single file\nlake run scout --command tactics --parquet --read MyFile.lean\n\n# Multiple files in parallel (one subprocess per file)\nlake run scout --command tactics --parquet --parallel 4 --read File1.lean File2.lean File3.lean\n\n# Extract from entire library (recommended for large codebases)\nlake run scout --command tactics --parquet --parallel 8 --library LeanScoutTest\n```\n\nFor `tactics`, Lean Scout treats syntax, import, elaboration, and type errors in the target file as extraction failures: the run returns a nonzero exit code and no records are emitted for that file.\n\nIf you have Lean Scout as a dependency with `Mathlib` as another dependency, you can similarly run:\n```bash\nlake run scout --command types --parquet --imports Mathlib\n```\n\nIn both cases, the data will be written to `parquet` files in the `./data/` subdirectory of your Lean4 project.\nYou can specify the base directory where data is stored as follows:\n```bash\nlake run scout --command types --parquet --dataDir $HOME/storage/types --imports Mathlib\n```\n\nThis will write the data to files located within the `$HOME/storage/types/` directory.\nThe default location is `./data/`.\n\nBy default Lean Scout resolves both outputs and relative read targets from the directory where you invoke the command (`--cmdRoot`, default: current working directory). If you run from outside the project root or from automation that changes the working directory, pass `--cmdRoot /path/to/where/paths/are/relative` so relative `--read` paths and outputs stay anchored to that location.\n\nLean Scout is strict about extraction failures: if any extractor subprocess or the Parquet writer fails, the overall run returns a nonzero exit code, stops launching new targets after the first detected failure, and cancels already-running extractor subprocesses as aggressively as possible.\n\nIf an extraction stops early (for example because of `Ctrl+C` or because the run exits with an error), Lean Scout leaves the output directory on disk. If the failed run wrote partial Parquet files, remove the previous output directory or point `--dataDir` to a fresh location before retrying.\n\n### JSON lines\n\nThe flag `--jsonl` can be used to extract data directly to stdout.\nParquet files will not be written if using `--jsonl`.\n\n**Note**: logging information is sent to stderr.\nIn `--parquet` mode, malformed JSON lines reaching the Python writer are treated as fatal errors rather than skipped.\n\n## Extraction Modes\n\nLean Scout supports multiple extraction modes:\n\n1. **`--imports`**: Extract from an environment created by importing modules (single subprocess)\n   - Best for: Extracting types, declarations, or other environment-level data\n   - Example: `lake run scout --command types --parquet --imports Lean`\n\n2. **`--read`**: Extract from specific files (parallel subprocesses, one per file)\n   - Best for: Processing specific files with per-file data extraction\n   - Example: `lake run scout --command tactics --parquet --parallel 4 --read File1.lean File2.lean`\n\n3. **`--library`**: Extract from all modules in a library (parallel subprocesses, recommended)\n   - Best for: Processing entire libraries or large codebases\n   - Uses `lake query -q \u003Clibrary>:module_paths` to automatically discover all module files\n   - Example: `lake run scout --command tactics --parquet --parallel 8 --library LeanScoutTest`\n\n**Note**: The `--library` flag is the recommended approach for extracting data from entire libraries, as it automatically discovers all modules without requiring manual file management.\n\n**Important**: The target flags (`--imports`, `--library`, `--read`) consume all remaining command-line arguments. Place other flags like `--parquet`, `--jsonl`, `--parallel`, `--dataDir` before the target specification.\n\n## Extractor Configuration\n\nExtractors can be configured using the `--config` flag, which accepts a JSON object:\n\n```bash\nlake run scout --config '{\"taskLimit\": 8}' --command types --parquet --imports Lean\n```\n\n### Filtering philosophy\n\nBuilt-in extractors do **not** filter data during extraction.\nInstead, they emit enough metadata for users to filter downstream in whatever way matches their use case.\n\nIn other words:\n- extraction should preserve the raw data\n- built-in extractors should expose useful filter metadata\n- filtering policy belongs to downstream consumers, not the extraction step\n\nBuilt-in extractors validate config strictly. Unknown keys and values of the wrong type are treated as extraction errors and cause a nonzero exit code.\n\n**Breaking change**: built-in `filter` config is no longer supported. Filtering should now be done downstream using emitted metadata such as `allowCompletion` and `kind`.\n\n### Available Configuration Options\n\n| Option | Type | Default | Description |\n|--------|------|---------|-------------|\n| `taskLimit` | natural number | unset | Maximum number of concurrent per-constant worker tasks for imports-mode extractors (`types`, `const_dep`) |\n\n**Examples**:\n```bash\n# Bound imports-mode worker parallelism for imports-mode extractors\nlake run scout --config '{\"taskLimit\": 8}' --command const_dep --parquet --imports Lean\n\n# Tactics accepts an empty config only; filter downstream on `kind`\nlake run scout --command tactics --parquet --library MyLib\n```\n\n## Sharding\n\nBy default, data is organized into 128 parquet shards.\nThe shard associated with a datapoint is computed by hashing a key, which is specified directly in each data extractor.\nThe number of shards used can be controlled with the `--numShards` option:\n```bash\nlake run scout --command types --parquet --numShards 32 --imports Lean\n```\n\n## Available Data Extractors\n\nWe provide three built-in data extractors: `types`, `tactics`, and `const_dep`.\n\n### `types`\nExtracts constant declarations with their types and modules.\n\n**Supported modes**: `--imports` only\n\n**Example**:\n```bash\nlake run scout --command types --parquet --imports Lean\n```\n\n**Output schema**:\n- `name` (string): Constant name\n- `module` (string, nullable): Module containing the constant\n- `type` (string): Type signature\n- `allowCompletion` (bool): Whether `Lean.Meta.allowCompletion` holds for the constant\n\n**Configuration**:\n- `taskLimit` (optional natural number): Bounds concurrent per-constant worker tasks during imports-mode extraction\n\n**Downstream filtering example**:\n```python\nfiltered = dataset.filter(lambda x: x[\"allowCompletion\"])\n```\n\n### `tactics`\nExtracts tactic invocations with before/after goal states, proof-term views of goals, used constants, used free variables, used goals, elaborator info, syntax kinds, and source locations.\n\n**Supported modes**: `--read`, `--library`\n\n**Example**:\n```bash\nlake run scout --command tactics --parquet --parallel 4 --library LeanScoutTest\n```\n\n**Output schema**:\n- `module` (string, nullable): Module containing the tactic (`null` for plain `--read` files without module setup)\n- `startPos` (struct): Start position of the tactic syntax in the source file\n  - `line` (nat): 1-based line number\n  - `column` (nat): 0-based column number\n- `endPos` (struct): End position of the tactic syntax in the source file\n  - `line` (nat): 1-based line number\n  - `column` (nat): 0-based column number\n- `nextStartPos` (struct): Position after the tactic's trailing whitespace; this is the start position of the next syntax in the source file, if any, or the end of the file\n  - `line` (nat): 1-based line number\n  - `column` (nat): 0-based column number\n- `goals` (list): List of goal states before the tactic\n  - `pp` (string): Pretty-printed goal\n  - `ppTerm` (string): Pretty-printed term representation of the original goal metavariable before elaborating the tactic\n  - `assigned` (bool): Whether the original goal metavariable is assigned after elaborating the tactic\n  - `usedConstants` (list of strings): Constants referenced in the instantiated goal, including through delayed-assigned metavariables\n  - `usedFVars` (list of strings): Free variables referenced in the instantiated goal, using the same sanitized names as the pretty-printed goal\n  - `usedGoals` (list): Goals/metavariables referenced in the instantiated goal after following delayed assignments\n    - `new` (bool): Whether the used goal was created by this tactic rather than present before elaborating it\n    - `index` (nat, nullable): Index of the used goal in `goalsAfter`, or `null` if it is not present there\n    - `kind` (string): Metavariable kind (`natural`, `synthetic`, or `syntheticOpaque`)\n    - `pp` (string): Pretty-printed used goal\n    - `ppTerm` (string): Pretty-printed term representation of the used goal metavariable\n- `goalsAfter` (list of strings): Pretty-printed goal states after the tactic\n- `ppTac` (string): Pretty-printed tactic syntax\n- `elaborator` (string): Name of the elaborator that produced this tactic\n- `kind` (string): Syntax node kind for the tactic\n\n**Configuration**:\n- no built-in extractor-specific options; config must be `{}`\n\n**Downstream filtering example**:\n```python\nstructural = dataset.filter(lambda x: x[\"kind\"] == \"Lean.Parser.Tactic.tacticSeq\")\n```\n\n### `const_dep`\nExtracts constant dependency information, mapping each constant to the set of constants it uses.\n\n**Supported modes**: `--imports` only\n\n**Example**:\n```bash\nlake run scout --command const_dep --parquet --imports Lean\n```\n\n**Output schema**:\n- `name` (string): Constant name\n- `module` (string, nullable): Module containing the constant\n- `deps` (list of strings): Names of constants directly used by this constant\n- `allowCompletion` (bool): Whether `Lean.Meta.allowCompletion` holds for the parent constant\n\n**Configuration**:\n- `taskLimit` (optional natural number): Bounds concurrent per-constant worker tasks during imports-mode extraction\n\n**Downstream filtering example**:\n```python\nfiltered_rows = dataset.filter(lambda x: x[\"allowCompletion\"])\n# If you need dependency-level metadata, join dependency names against `types` output.\n```\n\n## Creating datasets\n\nIt is straightforward to create a dataset (in the sense of `datasets`) from a list of parquet files.\nFor example, once you run\n```bash\nlake run scout --command types --parquet --imports Lean\n```\nto create `parquet` files of the form `./data/*.parquet`, a dataset can be created in python as follows (see `data.ipynb`):\n```python\nfrom datasets import Dataset\nimport glob\n\ndataset = Dataset.from_parquet(glob.glob(\"./data/*.parquet\"))\n```\nor as follows:\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"parquet\", data_dir=\"./data\", split=\"train\")\n```\n\n# How does LeanScout work?\n\n1. The Lean orchestrator (`Main.lean`) manages one or more Lean subprocess(es) that extract data and output JSON lines to stdout\n2. For `--parquet` output, the orchestrator spawns a Python process (`cli.py`) that reads JSON from stdin and writes to Parquet files\n3. For `--jsonl` output, the orchestrator writes JSON directly to stdout\n\nThe orchestration logic is implemented in `Main.lean`, with the Parquet writing handled by `src/lean_scout/cli.py`.\n\n# Testing\n\n### Running Tests\n\nFor broad coverage, run both:\n```bash\nlake test                                        # Lean schema tests (LeanScoutTest.lean)\n./run_tests                                      # Main automation suite (ruff, mypy, build, internals, integration, extractors)\n```\n\nTo run individual components:\n```bash\nuv run pytest test/internals/ -v                 # Python parquet writer tests\n./test/integration/test_lean_orchestrator.sh     # Lean orchestrator integration tests\nuv run pytest test/extractors/ -v                # End-to-end extractor tests\n```\n\n### Lean Schema Tests\n\nThe `lake test` command runs `LeanScoutTest.lean`, which validates:\n1. **Schema JSON roundtrip**: All registered data extractors have schemas that serialize to JSON and deserialize back correctly\n2. **Schema Python roundtrip**: Schema definitions are correctly parsed by the Python parquet writer (`test/schema.py`)\n",1780242005328]