[{"data":1,"prerenderedAt":4},["ShallowReactive",2],{"CJcQLHL73q":3},"# Mathematics Distillation Challenge — Equational Theories — Stage 2\n\n> Official competition page:\n> \u003Chttps://competition.sair.foundation/competitions/mathematics-distillation-challenge-equational-theories-stage2/overview>\n\n## Background\n\nThe pilot task is **equational implication over magmas** (a set with\none binary operation `◇`): given two laws `E₁` and `E₂`, decide whether\n`E₁ ⇒ E₂` holds across **every** magma.\n\nThis challenge is based on the [Equational Theories Project](https://teorth.github.io/equational_theories/),\ninitiated by Terence Tao:\n\n- Raw implication graph: [export_raw_implications](https://teorth.github.io/equational_theories/implications/)\n- Law list — 4694 laws of order ≤ 4: [equations.txt](https://github.com/teorth/equational_theories/blob/main/data/equations.txt)\n- Larger law list of order 5 used by Stage 2: bundled at [`examples/problems/eq_size5.txt`](examples/problems/eq_size5.txt) (~62 K laws)\n\nExample: `E_4: x = x * y` implies `E_3: x = x * x`.\n\nStage 1 asked models for a yes/no answer. **Stage 2 raises the bar**:\nevery answer must come with a machine-verifiable Lean 4 certificate —\na proof for true implications, or a finite magma witness where the\nhypothesis holds but the goal fails. A deterministic Lean judge\naccepts or rejects each answer — no partial credit, no probabilistic\nscoring, no LLM-as-judge. The same judge code runs locally and at the\nofficial evaluation: if the harness in this repo turns green for\nyour `solver.py`, the judge returns the same verdict in production.\n\nThe submission is a single `solver.py`. The competition runs **two\ntracks** with shared judging but different solver shapes — pick\nwhichever fits your strategy.\n\n## Pick Your Track\n\nThe competition has **two tracks**. Both share the same judge, the same\nfive-status verdict mapping (`accepted` / `unparsed` / `malformed` /\n`incomplete_proof` / `incorrect`), and the same submission contract:\n**a single `solver.py` file, ≤ 500 KB**. They differ only in how\nproblems and budgets are shaped — one solver source can support both.\n\n### → Solo track\n\n- **One problem per solver subprocess.** Every problem gets a fresh process.\n- **Fixed per-problem budget**: 3600 s wall-clock; LLM calls capped at 65 536 output tokens each; submitted Lean code ≤ 100 KB.\n- Communication: stdin (problem JSON) → stdout (answer JSON), one line each.\n- **Best for**: getting started, deep single-problem search.\n- **Quick Start**: [Solo Quick Start](#solo-quick-start) below.\n- **Full spec**: [`docs/solo_mode.md`](docs/solo_mode.md).\n\n### → Marathon track\n\n- **N problems per solver subprocess** (reference: N=100). One process, one shared global budget.\n- **Compressed global budget**: `compression_ratio × N × Marathon per-problem reference` (600 s + 65 536 tokens per problem; deliberately tighter than Solo's wall-clock, see [`docs/marathon_mode.md`](docs/marathon_mode.md)). Default `compression_ratio = 0.5` — solver cannot finish all N at the per-problem reference cost and must triage.\n- Communication: file-based (read manifest JSONL, append answers JSONL).\n- **Best for**: triage strategies, cross-problem caching, prompt reuse.\n- **Quick Start**: [Marathon Quick Start](#marathon-quick-start) below.\n- **Full spec**: [`docs/marathon_mode.md`](docs/marathon_mode.md).\n\nMost contestants start with Solo. Marathon is the long-form track where\nstrategic budget allocation is rewarded.\n\n---\n\n## Solo Quick Start\n\n```bash\n# One-command setup (installs Lean, fetches Mathlib, builds judge modules)\nbash scripts/setup.sh\n\n# Install Python deps (OpenAI SDK — defaults to OpenRouter; override\n# via OPENAI_BASE_URL / OPENAI_API_KEY to hit api.openai.com)\npip install openai\n\n# Activate the environment\nsource .env.judge\n\n# Verify the judge works\npython3 scripts/run_harness.py\n\n# Run a demo solver on 20 sample problems\npython3 -m pipeline.runner \\\n  --submission examples/solo/demos/baseline \\\n  --problems examples/problems/sample_20.json\n```\n\n### Prerequisites\n\n- **OS**: macOS (Apple Silicon / Intel) or Linux (x86_64). Windows\n  users should run under WSL 2 — the setup targets POSIX shells.\n- **Disk**: ~3 GB free (Lean toolchain + Mathlib olean cache — this\n  repo is a self-contained lake package depending only on Mathlib; no\n  `equational_theories` clone required).\n- **RAM**: 8 GB minimum, 16 GB recommended.\n- **Network**: Required for initial setup only.\n- **Python**: 3.8+ (with `openai` for pipeline LLM calls).\n- **Git**: 2.x+.\n\n### Manual setup (step-by-step)\n\nIf you prefer to set things up step by step instead of using `setup.sh`:\n\n1. **Install elan** (Lean version manager):\n   ```bash\n   curl -sSf https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh | sh -s -- -y --default-toolchain none\n   export PATH=\"$HOME/.elan/bin:$PATH\"\n   ```\n\n2. **Install the Lean toolchain** (version from this repo's `lean-toolchain`):\n   ```bash\n   TOOLCHAIN=$(cat lean-toolchain | tr -d '[:space:]')\n   elan toolchain install \"$TOOLCHAIN\"\n   elan default \"$TOOLCHAIN\"\n   ```\n\n3. **Fetch Mathlib and build the judge modules**:\n   ```bash\n   lake update                  # pin Mathlib per lakefile.lean\n   lake exe cache get           # ~2 GB of pre-compiled Mathlib oleans\n   lake build JudgeMagma.Magma JudgeDecide.DecideBang \\\n              JudgeFinOp.MemoFinOp JudgeSupport.Inspect\n   ```\n\n4. **Configure environment**:\n   ```bash\n   cat > .env.judge \u003C\u003CEOF\n   export LEAN_BIN=\"$(which lean)\"\n   export LAKE_BIN=\"$(which lake)\"\n   export PATH=\"\\$HOME/.elan/bin:\\$PATH\"\n   EOF\n   source .env.judge\n   ```\n\n5. **Verify**: `python3 scripts/run_harness.py`\n\n---\n\n## Marathon Quick Start\n\nMarathon mode runs **one solver subprocess against N problems** under a\nsingle global budget instead of one subprocess per problem. The solver\ncontract is the same single-file `solver.py`; the difference is the I/O\nshape (file-based) and the budgeting.\n\n```bash\n# Run the bundled sequential baseline against a 5-problem manifest.\npython3 scripts/run_marathon.py \\\n  --solver examples/marathon/demos/baseline \\\n  --manifest tests/marathon_fixtures/manifests/normal_5.jsonl\n\n# Run a strategic, LLM-using marathon solver. Set OPENROUTER_API_KEY\n# (or OPENAI_API_KEY) first — it's used by the marathon proxy, never\n# forwarded into the solver subprocess.\nexport OPENROUTER_API_KEY=sk-...\npython3 scripts/run_marathon.py \\\n  --solver examples/marathon/demos/triage \\\n  --manifest examples/problems/marathon/normal_100.jsonl \\\n  --compression-ratio 0.5\n# 100 problems × 600 s × 0.5 ≈ 30 000 s wall-clock. Swap in\n# examples/problems/normal.jsonl (1000 problems, ~83 h at the same\n# compression) when you're ready for the full reference set.\n```\n\nThe runner derives `budget_seconds` and `budget_tokens` from\n`compression_ratio × N × Marathon-per-problem-reference` (600 s and\n65 536 tokens; see [`docs/marathon_mode.md`](docs/marathon_mode.md)).\nOverride either budget directly with `--budget-seconds` /\n`--budget-tokens`, or change just the multiplier with\n`--compression-ratio` (default `0.5`; smaller squeezes harder, `1.0` =\nno compression).\n\nRegression harness (separate from `run_harness.py`):\n\n```bash\npython3 scripts/run_marathon_harness.py\n```\n\nFull Marathon spec: [`docs/marathon_mode.md`](docs/marathon_mode.md).\n\n---\n\n## What You Submit\n\nYour submission is a **single Python file** named `solver.py`, up to\n**500 KB**. The file is identical in shape for both tracks; what differs\nis the I/O it implements (Solo: stdin/stdout JSON; Marathon:\nmanifest-in / answers-out files). One source file can support both — see\n[`docs/marathon_mode.md`](docs/marathon_mode.md) for the env-var trigger.\n\n```\nmy_submission/\n└── solver.py       # Your program. For Solo: stdin/stdout JSON protocol.\n                    #                For Marathon: see docs/marathon_mode.md.\n                    # If it uses the LLM in Solo, a module-level\n                    # PROMPT = \"\"\"...\"\"\"  string holds the template.\n```\n\nThe Solo proxy extracts the `PROMPT` constant from `solver.py` via AST\nparsing (the module is never imported or executed on the host), fills\nplaceholders, and sends the rendered prompt to the LLM on the solver's\nbehalf. The Marathon path uses the helper `from marathon_llm import\ncall_llm` (or any OpenAI-SDK call) instead — it does not parse a\n`PROMPT` constant.\n\nThe solver is a free-form program. There are no required function\nsignatures — the only requirement is the I/O protocol of the track you\nare running (described in the Solo sections below and in\n[`docs/marathon_mode.md`](docs/marathon_mode.md) respectively).\n\n---\n\n## Reference problem sets\n\nThis repo bundles four problem sets at\n[`examples/problems/`](examples/problems/) — mirrored from the\nHuggingFace dataset\n[`SAIRfoundation/equational-theories-selected-problems`](https://huggingface.co/datasets/SAIRfoundation/equational-theories-selected-problems)\n— **as practice and training material**. The Stage 2 final evaluation\nruns on a held-back set drawn from the same underlying corpus\n(including order-5 laws), so the bundled sets are not the eval set;\nthey are the reference distribution you tune your solver against.\n\n| Set       | Size  | True / False split | Difficulty                                                    |\n|-----------|-------|---------------------|---------------------------------------------------------------|\n| `normal`  | 1 000 | 500 / 500           | Reference distribution. Start here.                           |\n| `hard1`   |    69 |  24 / 45            | Tightly packed pairs; small set, high \"compute / row\" ratio. |\n| `hard2`   |   200 | 100 / 100           | Where the easy patterns run out.                              |\n| `hard3`   |   400 | 195 / 205           | Highest difficulty in the public split.                       |\n\nPlus two synthetic samples for fast iteration:\n[`examples/problems/sample_20.json`](examples/problems/sample_20.json)\n(smoke test) and\n[`examples/problems/sample_200.json`](examples/problems/sample_200.json)\n(200, 100 true / 100 false). Beginners should validate their solver on\n`sample_20.json` first, then move to `normal` once the loop is\nreliable.\n\n---\n\n## Examples & Tutorial\n\nThe `examples/` directory contains demo submissions, sample problems, and per-track tutorials. Each track has **3 reference demos** chosen as a learning ladder (skeleton → entry-level LLM → flagship strategy):\n\n```\nexamples/\n├── problems/                     # Sample sets + HF JSONL mirrors\n│   ├── sample_20.json            #   20 sample problems (quick test)\n│   ├── sample_200.json           #   200 problems (100 true + 100 false)\n│   └── (normal|hard1|hard2|hard3).jsonl\n├── solo/\n│   ├── TUTORIAL.md               # Solo: 3 annotated walkthroughs\n│   └── demos/\n│       ├── baseline/             #   Brute-force + singleton + generic LLM fallback (start here)\n│       │   └── solver.py\n│       ├── twophase/             #   gpt-oss-120b: deeper search + analysis-then-implementation LLM\n│       │   └── solver.py\n│       └── opnorm/               #   gpt-oss-120b: 16 deterministic strategies + structural-context LLM (flagship)\n│           └── solver.py\n└── marathon/\n    ├── TUTORIAL.md               # Marathon: 3 walkthroughs (baseline / triage / cross-problem state)\n    └── demos/\n        ├── baseline/             #   Sequential brute-force, no LLM (start here, zero token cost)\n        │   └── solver.py\n        ├── triage/               #   Difficulty-sorted Pass B + Pass C deeper-thought retry on Pass-B no-shows (entry-level LLM)\n        │   └── solver.py\n        └── fewshot/              #   In-run lemma cache + few-shot transfer (cross-problem state, Marathon-only)\n            └── solver.py\n```\n\nEvery submission — including every demo — is a single `solver.py` (≤ 500 KB). If the solver uses the LLM, the prompt template lives as a top-level `PROMPT = \"\"\"...\"\"\"` constant inside that same file (Solo) or is embedded inline in the solver (Marathon).\n\n### Running demos\n\nAfter `source .env.judge`:\n\n```bash\nsource .env.judge\n\n# Baseline demo on 20 problems\npython3 -m pipeline.runner \\\n  --submission examples/solo/demos/baseline \\\n  --problems examples/problems/sample_20.json\n\n# OSS two-phase demo on 200 problems\npython3 -m pipeline.runner \\\n  --submission examples/solo/demos/twophase \\\n  --problems examples/problems/sample_200.json\n\n# OSS opnorm reference solver on 200 problems\npython3 -m pipeline.runner \\\n  --submission examples/solo/demos/opnorm \\\n  --problems examples/problems/sample_200.json\n\n# Custom output path\npython3 -m pipeline.runner \\\n  --submission examples/solo/demos/baseline \\\n  --problems examples/problems/sample_200.json \\\n  --output results.json\n```\n\n**Resume behavior**: If the output file already exists, solved problems are skipped (their entries are kept verbatim). Failed entries are dropped on resume and re-run, and the new outcome replaces the old entry — only one row per problem id ever lands in the output file. To start fresh, delete or rename the output file.\n\n### Interactive CLI\n\n`scripts/submit.py` wraps the same `pipeline.proxy.run_solver` engine as `pipeline.runner`, but adds colorized per-problem rows, a per-problem debug log, and exits `0` iff every selected problem is solved. Use it when you want a tighter feedback loop than the plain runner.\n\n```bash\n# Quick smoke on the bundled 20-problem sample\npython3 scripts/submit.py \\\n  --submission examples/solo/demos/baseline \\\n  --problems   examples/problems/sample_20.json\n\n# Narrow to a handful of IDs and stream JSON results to disk atomically\npython3 scripts/submit.py \\\n  --submission examples/solo/demos/baseline \\\n  --problems   examples/problems/hard1.jsonl \\\n  --problem-ids hard1_0001,hard1_0007,hard1_0012 \\\n  --output     pipeline/results/hard1_spot.json \\\n  --verbose\n```\n\nAny typo in `--problem-ids` or an empty problem set fails with exit code `2` rather than silently running nothing, so a mistyped flag never masquerades as success.\n\n### Tutorial\n\n**Solo** — see [`examples/solo/TUTORIAL.md`](examples/solo/TUTORIAL.md) for three annotated walkthroughs showing the full solver-proxy-judge interaction:\n\n1. **Deterministic counterexample** -- solver finds a Fin 5 counterexample, no LLM needed (1.9s)\n2. **LLM feedback loop** -- LLM tries 4 times with judge error feedback until proof accepted (77s)\n3. **MATCH-COLLAPSE** -- 9 deterministic strategies fail, then 1 LLM call with specialized prompt succeeds (73s)\n\n**Marathon** — see [`examples/marathon/TUTORIAL.md`](examples/marathon/TUTORIAL.md) for three walkthroughs of marathon-specific strategies:\n\n1. **Free counterexample harvest** -- baseline brute-force pass clears ~40-50% of `normal` at zero token cost\n2. **Triage + deeper-thought retry** -- difficulty-sorted Pass B + budget-aware Pass C re-attempt on Pass-B no-shows with bumped reasoning effort (`triage`)\n3. **Marathon-distinctive: in-run lemma cache + few-shot transfer** -- `fewshot` accumulates winning patterns across problems and prepends them to later prompts; cross-problem state is structurally impossible in Solo\n\n## Problem Format\n\nProblems use the [HuggingFace-aligned format](https://huggingface.co/datasets/SAIRfoundation/equational-theories-selected-problems). The binary operation may use `◇` or `*` (auto-normalized to `◇` for Lean):\n\n```json\n{\n  \"id\": \"normal_0646\",\n  \"eq1_id\": 2034,\n  \"eq2_id\": 2417,\n  \"equation1\": \"x = (y ◇ (z ◇ w)) ◇ (u ◇ v)\",\n  \"equation2\": \"x = (y ◇ (z ◇ (w ◇ x))) ◇ z\",\n  \"answer\": true\n}\n```\n\nThe question: **Does equation1 imply equation2?**\n\n## Answer Format\n\n```json\n{\"verdict\": \"true\", \"code\": \"\u003Cfull Lean 4 source code>\"}\n```\n\n- `verdict`: `\"true\"` (prove implication) or `\"false\"` (prove non-implication)\n- `code`: Complete Lean 4 source exposing a `submission : Goal` term (see below)\n\nThe judge writes a per-verify `JudgeProblem.lean` with the two problem\nequations bound as `EquationLHS` / `EquationRHS` plus a verdict-specific\n`abbrev Goal`. Submitter code lives in `Submission.lean` and only has\nto expose a term named `submission` whose type is definitionally equal\nto `Goal`. The goal statement itself is judge-controlled (lives in a\nseparately-generated `Problem.lean`), so the submitter doesn't need to\nwrite the theorem header at all.\n\n**Lean primitives the certificates use** (all provided by the judge — no\nexternal Mathlib imports needed for the canonical false-cert shape):\n\n- `◇` — the magma's binary operation (single character; `*` in your\n  problem text is auto-normalized to `◇`)\n- `Magma G` — Lean type class declaring `G` as a magma; `[Magma G]`\n  introduces an instance bringing `◇` into scope\n- `Fin n` — the standard finite type `{0, 1, …, n-1}`; the canonical\n  false-certificate domain\n- `finOpTable \"\u003Cjson>\"` — judge helper that turns a JSON-encoded n×n\n  table into a `Fin n → Fin n → Fin n` operation\n- `decideFin!` — judge tactic that closes a finite-domain goal by\n  exhaustive evaluation of the magma's operation table\n\n### True certificate\n\n`Goal` expands to `∀ (G : Type) [Magma G], EquationLHS G → EquationRHS G`.\n\n```lean\nimport JudgeProblem\n\ndef submission : Goal := by\n  intro G _ h x y z\n  …tactics that produce EquationRHS G using h …\n```\n\n### False certificate\n\n`Goal` expands to `∃ (G : Type) (_ : Magma G), EquationLHS G ∧ ¬ EquationRHS G`.\n\n```lean\nimport JudgeProblem\nimport JudgeDecide.DecideBang\nimport JudgeFinOp.MemoFinOp\nopen MemoFinOp\n\ndef submission : Goal := by\n  let m : Magma (Fin 2) := { op := finOpTable \"[[0,0],[1,1]]\" }\n  refine ⟨Fin 2, m, ?_⟩\n  decideFin!\n```\n\n> **Universe note**: `Goal` is pinned to concrete `Type` (= `Type 0`)\n> in both branches because `abbrev Goal : Prop := ∀ (G : Type _) …`\n> leaves a stuck universe meta that Lean can't resolve at `abbrev`\n> elaboration. Submitters work with small types (`Fin n`, concrete\n> magmas) which all live in `Type 0`, so this isn't a practical\n> restriction.\n>\n> **Backward compatibility**: old-style `theorem submission :\n> \u003Cexplicit goal> := …` submissions still verify if they use the new\n> `import JudgeProblem` imports — `Goal` is `@[reducible]`, so the\n> explicit type and `Goal` unify by definitional equality.\n\n---\n\n## System Architecture (Solo)\n\n> Marathon's architecture is parallel but file-based and uses a local\n> HTTP LLM proxy instead of stdin/stdout — see\n> [`docs/marathon_mode.md`](docs/marathon_mode.md).\n\n```\n┌──────────────────────────────────────────────────────────────┐\n│                       Proxy (organizer)                       │\n│                                                               │\n│  1. Start solver as subprocess (sandboxed in production)      │\n│  2. Send problem + budget to solver via stdin                 │\n│  3. Wait for solver requests on stdout                        │\n│  4. For judge calls: forward to judge, return result          │\n│  5. For LLM calls: fill PROMPT template, call LLM API         │\n│  6. On judge \"accepted\" → record result                       │\n│  7. On wall-clock timeout → terminate solver                  │\n│                                                               │\n│  ┌────────────────┐                     ┌──────────────────┐ │\n│  │     Solver      │  stdin/stdout JSON  │      Proxy       │ │\n│  │  (contestant)   │◄═══════════════════►│   (organizer)    │ │\n│  │                 │                     │       │    │     │ │\n│  │  - isolated     │                     │       │    │     │ │\n│  │  - no secrets   │                     │       ▼    ▼     │ │\n│  │                 │                     │   ┌─────┐ ┌───┐ │ │\n│  │                 │                     │   │Judge│ │LLM│ │ │\n│  └────────────────┘                     │   └─────┘ └───┘ │ │\n└──────────────────────────────────────────────────────────────┘\n```\n\n| Component | Provider | Network | Description |\n|-----------|----------|---------|-------------|\n| **Solver** | Contestant | Isolated (no secrets, sandboxed in production) | Your program; communicates with proxy via stdin/stdout |\n| **Proxy** | Organizer | Online | Launches solver, mediates all I/O, fills prompt templates, calls LLM API, enforces limits |\n| **Judge** | Organizer | Offline | Deterministic Lean verifier, returns `accepted` or an error |\n| **LLM** | Organizer | Online | Generates proofs/counterexamples when prompted |\n| **Prompt** | Contestant | N/A | `PROMPT` constant inside `solver.py` (single-file submission); proxy fills its placeholders before each LLM call |\n\n---\n\n## Communication Protocol\n\n> **Solo only.** Marathon uses a file-based contract (manifest in, JSONL out) plus a local HTTP LLM proxy — see [`docs/marathon_mode.md`](docs/marathon_mode.md).\n\nAll communication between solver and proxy uses **JSON messages over stdin/stdout**, one JSON object per line. The proxy starts **one solver process per problem**. No state carries between problems.\n\n### Startup: Proxy -> Solver\n\nWhen the solver process starts, the proxy writes the problem and budget to stdin:\n\n```json\n{\n  \"problem\": {\n    \"id\": \"normal_0646\",\n    \"eq1_id\": 2034,\n    \"eq2_id\": 2417,\n    \"equation1\": \"x = (y ◇ (z ◇ w)) ◇ (u ◇ v)\",\n    \"equation2\": \"x = (y ◇ (z ◇ (w ◇ x))) ◇ z\"\n  },\n  \"budget\": {\n    \"timeout_seconds\": 3600,\n    \"max_code_length\": 100000,\n    \"max_false_cert_bytes\": 20000\n  }\n}\n```\n\n### Solver -> Proxy: Judge Request\n\n```json\n{\"call\": \"judge\", \"verdict\": \"true\", \"code\": \"import JudgeProblem\\n\\ndef submission : Goal := by\\n...\"}\n```\n\nProxy forwards to judge, returns:\n\n```json\n{\"status\": \"accepted\"}\n```\n\nor:\n\n```json\n{\"status\": \"incorrect\", \"stderr\": \"type mismatch...\"}\n```\n\nWhen proxy sees `\"status\": \"accepted\"`, it records the result automatically. The solver does NOT need a separate \"submit\" action.\n\n### Solver -> Proxy: LLM Request\n\nThe solver sends a context dict (not a raw prompt). The proxy reads the `PROMPT` constant from `solver.py`, fills all placeholders, and sends the assembled prompt to the LLM.\n\n```json\n{\"call\": \"llm\", \"context\": {\"analysis\": \"No counterexample on Fin 2-3\"}}\n```\n\nProxy fills template, calls LLM, returns:\n\n```json\n{\"response\": \"{\\\"verdict\\\": \\\"true\\\", \\\"proof\\\": \\\"intro x y\\\\n...\\\"}\"}\n```\n\n### Full Example Session\n\n```\nProxy  ──stdin──→  {\"problem\": {...}, \"budget\": {...}}\n\n                   (solver reads problem, does brute-force search, prepares context)\n\nSolver ──stdout─→  {\"call\": \"llm\", \"context\": {\"analysis\": \"No counterexample on Fin 2-5\"}}\nProxy  ──stdin──→  {\"response\": \"{\\\"verdict\\\": \\\"true\\\", \\\"proof\\\": \\\"intro ...\\\"}\"}\n\n                   (solver parses LLM response, builds full Lean code)\n\nSolver ──stdout─→  {\"call\": \"judge\", \"verdict\": \"true\", \"code\": \"import ...\"}\nProxy  ──stdin──→  {\"status\": \"incorrect\", \"stderr\": \"type mismatch ...\"}\n\n                   (solver retries — proxy auto-includes error in {history.*})\n\nSolver ──stdout─→  {\"call\": \"llm\", \"context\": {\"analysis\": \"Judge rejected: type mismatch...\"}}\nProxy  ──stdin──→  {\"response\": \"{\\\"verdict\\\": \\\"true\\\", \\\"proof\\\": \\\"have ...\\\"}\"}\n\nSolver ──stdout─→  {\"call\": \"judge\", \"verdict\": \"true\", \"code\": \"import ...\"}\nProxy  ──stdin──→  {\"status\": \"accepted\"}\n\n                   (proxy records result, terminates solver process)\n```\n\n---\n\n## Prompt Template System\n\n> **Solo only.** The Marathon proxy is an HTTP forwarder — solvers build their own prompts and call it via the OpenAI SDK or `marathon_llm.call_llm`. See [`docs/marathon_mode.md`](docs/marathon_mode.md).\n\nContestants provide a prompt template as a `PROMPT` string constant inside `solver.py`, using placeholders from three namespaces. The proxy fills them before each LLM call.\n\n### `{problem.*}` -- Problem data (auto-filled)\n\n| Placeholder | Example |\n|-------------|---------|\n| `{problem.id}` | `normal_0646` |\n| `{problem.eq1_id}` | `2034` |\n| `{problem.eq2_id}` | `2417` |\n| `{problem.eq1_name}` | `Equation2034` |\n| `{problem.eq2_name}` | `Equation2417` |\n| `{problem.equation1}` | `x = (y ◇ (z ◇ w)) ◇ (u ◇ v)` |\n| `{problem.equation2}` | `x = (y ◇ (z ◇ (w ◇ x))) ◇ z` |\n\n### `{history.*}` -- Judge history (auto-accumulated)\n\n| Placeholder | Description |\n|-------------|-------------|\n| `{history.attempts}` | Formatted log of each attempt's verdict, status, and error |\n| `{history.round}` | Number of judge calls so far (`0`, `1`, `2`, ...) |\n| `{history.last_error}` | stderr or message from the most recent rejection |\n| `{history.last_status}` | `incorrect`, `incomplete_proof`, etc. |\n\n### `{solver.*}` -- Solver context (dynamic)\n\nThe solver sends arbitrary key-value pairs in the `context` field of its LLM request. The proxy maps each key `k` to `{solver.k}`.\n\nExample: `{\"call\": \"llm\", \"context\": {\"analysis\": \"...\"}}` fills `{solver.analysis}` in the template.\n\n### Unfilled placeholders\n\nAny `{problem.*}`, `{solver.*}`, or `{history.*}` placeholder not matched is silently removed.\n\n### Example PROMPT constant\n\n```python\nPROMPT = \"\"\"You are an expert in universal algebra and Lean 4 theorem proving.\n\nDoes {problem.eq1_name} imply {problem.eq2_name}?\n\nHypothesis ({problem.eq1_name}): ∀ elements, {problem.equation1}\nGoal ({problem.eq2_name}): ∀ elements, {problem.equation2}\n\n## Solver's analysis\n\n{solver.analysis}\n\n## Previous attempts (round {history.round})\n\n{history.attempts}\n\n## Response format\n\nONLY valid JSON, no markdown fences:\n{\"verdict\": \"true\", \"proof\": \"\u003Ctactic body>\"}\nor\n{\"verdict\": \"false\", \"counterexample_table\": [[0,1],[1,0]]}\n\"\"\"\n```\n\n---\n\n## Judge\n\n### Statuses\n\n| Status | Meaning |\n|--------|---------|\n| `accepted` | Proof compiles, type-checks, and passes dependency policy |\n| `unparsed` | Answer is not valid JSON |\n| `malformed` | JSON parses but violates required schema |\n| `incomplete_proof` | Uses `sorry`, `admit`, or banned axioms/dependencies |\n| `incorrect` | Structurally valid but Lean rejects the proof |\n\nThe judge is deterministic: same input always produces same output.\n\n### Constraints\n\n| Constraint | Value |\n|------------|-------|\n| Max code length | 100,000 characters |\n| Max false certificate code | 20,000 bytes |\n| Lean timeout | 300 seconds per proof |\n| Banned tokens | `sorry`, `admit`, `sorryAx`, `dbg_trace`, `dbgTrace`, `run_tac`, `mkSorry`, `initialize`, `builtin_initialize` |\n\n### Available Imports\n\nYour code runs with a sandboxed LEAN_PATH covering the judge's own\nmodules and the Mathlib olean cache. Available imports:\n\n- `JudgeProblem` — binds `EquationLHS` / `EquationRHS` to the two\n  problem equations (generated per-verify) plus an `abbrev Goal`\n  whose body is the verdict-specific ∀ / ∃ statement\n- `JudgeDecide.DecideBang` — `decideFin!` / `decide!` tactics for\n  finite-model checking\n- `JudgeFinOp.MemoFinOp` — `open MemoFinOp` exposes `finOpTable`, a\n  JSON-string → `Fin n → Fin n → Fin n` helper for building finite\n  magmas\n- `JudgeMagma.Magma` — the `◇` operator (re-imported by\n  `JudgeProblem`, so you rarely need this directly)\n- `Mathlib.*` — any Mathlib module, pinned by `lakefile.lean`\n\n---\n\n## Configuration\n\n> Below: **Solo** reference budgets and LLM parameters. Marathon derives its global budgets from these via `compression_ratio` — see [`docs/marathon_mode.md`](docs/marathon_mode.md).\n\n> The numbers in `pipeline/config.json` (wall-clock timeout, Lean timeout, code-size caps, sandbox limits, LLM parameters) are a **reference configuration** for Stage 2. They will be tuned based on community feedback as the competition progresses — expect the wall-clock budget and sandbox limits in particular to settle once we see how contestant solvers actually behave. The single-file solver contract and the public five-status verdict semantics are stable; the numerical knobs are not.\n\n### LLM Parameters\n\nAll LLM parameters are fixed by the organizer in `pipeline/config.json`. Contestants cannot change them.\n\n| Parameter | Value |\n|-----------|-------|\n| Model | `openai/gpt-oss-120b` |\n| Provider | `deepinfra/bf16` |\n| Max output tokens | 65,536 |\n| Temperature | 0.0 |\n| Reasoning effort | medium |\n| Seed | 0 (deterministic) |\n\n`reasoning_effort` is pinned to `medium` as a reference: at `high`, `openai/gpt-oss-120b` on `deepinfra/bf16` has been observed to burn the entire HTTP budget inside the reasoning chain and return empty `content` on hard problems. `medium` consistently emits substantive Lean within the budget and is the value we run the reference solver against. Subject to change as providers and models evolve.\n\nLLM calls go through the OpenAI SDK with `base_url` pointing at\nOpenRouter by default. Set one of `OPENAI_API_KEY` or\n`OPENROUTER_API_KEY`; flip to OpenAI directly by also setting\n`OPENAI_BASE_URL=https://api.openai.com/v1` (and adjusting the model\nname).\n\nThere are **two routing styles**, both work end-to-end:\n\n1. **Env-driven (single global provider)** — `OPENAI_BASE_URL` +\n   `OPENAI_API_KEY` set in the shell apply to every config. Best for\n   \"I just want to swap OpenRouter for OpenAI everywhere\".\n2. **Config-driven (per-run / per-experiment provider)** — set\n   `llm.base_url` and `llm.api_key_env` in the config JSON. The\n   environment value of `llm.api_key_env` is read at call time:\n\n   ```json\n   \"llm\": {\n     \"model\": \"deepseek-v4-flash\",\n     \"base_url\": \"https://api.deepseek.com/v1\",\n     \"api_key_env\": \"DEEPSEEK_API_KEY\",\n     \"max_output_tokens\": 8192,\n     \"temperature\": 0.2\n   }\n   ```\n\n   The proxy talks to any OpenAI-compatible endpoint this way:\n   DeepSeek, Kimi/Moonshot, GLM/Zhipu, Minimax, Qwen, api.openai.com,\n   etc. — no code changes. OpenRouter-only fields (`provider`,\n   `reasoning_effort`) are emitted only when `base_url` actually\n   points at OpenRouter, so a config that adds just `base_url` +\n   `api_key_env` to the default never leaks OpenRouter routing hints\n   to a direct provider.\n\n### Solver Budgets\n\n| Limit | Value | Description |\n|-------|-------|-------------|\n| Wall-clock timeout | 3600s | Single per-problem budget; pacing LLM/judge calls within this is the solver's responsibility. Widened from the earlier 600s reference so multi-round LLM loops have room to finish under `reasoning_effort=medium`. |\n| Solver file size | 500 KB | `solver.py` larger than this is rejected pre-launch |\n\n### Environment Variables\n\n| Variable | Default | Description |\n|----------|---------|-------------|\n| `LEAN_BIN` | auto-detected | Path to the `lean` binary |\n| `LAKE_BIN` | auto-detected | Path to the `lake` binary |\n| `JUDGE_ARTIFACT_DIR` | `.artifacts` | Where per-verify `JudgeProblem.lean`, `Submission.lean`, and `Problem.lean` are written |\n| `JUDGE_LEAN_PATH` | (none; falls back to `lake env`) | Operator override for `LEAN_PATH` — useful when `.lake/` is read-only and `lake env` can't recompute |\n| `LEAN_TIMEOUT_SECONDS` | `120` (raw `judge/verify.py`) / `300` (via pipeline, from `judge.lean_timeout_seconds` in `pipeline/config.json`) | Per-proof compilation timeout. The pipeline's 300 s value is what actually runs during evaluation; the 120 s default only applies if you invoke `judge/verify.py` directly without the runner. |\n| `OPENAI_API_KEY` | (none) | Preferred API key for LLM calls — OpenAI SDK reads it first |\n| `OPENROUTER_API_KEY` | (none) | Fallback key if `OPENAI_API_KEY` is unset; same wire format |\n| `OPENAI_BASE_URL` | `https://openrouter.ai/api/v1` | Env-level base URL; overridden by `llm.base_url` in the config |\n| `\u003Cllm.api_key_env>` | (none) | Whichever name the config's `llm.api_key_env` points at — e.g. `DEEPSEEK_API_KEY` for direct DeepSeek routing |\n\n### Solver Sandbox (optional, MVP)\n\nContestant `solver.py` can be run inside a Docker container for host isolation. Mode is controlled by `pipeline/config.json`:\n\n```json\n\"sandbox\": {\n  \"mode\": \"none\",            // \"none\" (default) | \"docker\"\n  \"image\": \"ee-solver:latest\",\n  \"memory_mb\": 2048,\n  \"cpus\": 2,\n  \"pids_limit\": 64,\n  \"tmpfs_size_mb\": 64\n}\n```\n\nWith `mode = \"none\"` the `memory_mb`, `cpus`, `pids_limit`, and `tmpfs_size_mb` fields are inert — the solver runs in-process on the host and inherits the host's resources. They take effect only when `mode = \"docker\"`, where they are mapped to `--memory` / `--memory-swap` / `--cpus` / `--pids-limit` / `--tmpfs /tmp:size=` on the `docker run` invocation below. The values shown are **reference** numbers that let the bundled reference solver and demos finish within budget on a modest box; like the rest of the config they will be refined from community-feedback once Docker-mode runs are common.\n\nWhen `mode = \"docker\"` the solver is launched as:\n\n```\ndocker run --rm -i --network=none --read-only \\\n  --cap-drop=ALL --security-opt=no-new-privileges:true \\\n  --memory=\u003Cmemory_mb>m --memory-swap=\u003Cmemory_mb>m \\\n  --cpus=\u003Ccpus> --pids-limit=\u003Cpids_limit> \\\n  --tmpfs /tmp:size=\u003Ctmpfs_size_mb>m \\\n  -v \u003Csubmission>:/solver:ro -e PYTHONUNBUFFERED=1 \u003Cimage>\n```\n\nHardening layers: no network, read-only root FS, all capabilities dropped, no-new-privileges, non-root `solver` user (from the image), `--memory-swap` pinned to `--memory` so swap can't double the effective limit, bounded CPU/pid/tmpfs, `/solver` mount read-only. The host `docker` CLI inherits the full host environment (so DOCKER_HOST / DOCKER_CONFIG / TLS vars reach the daemon); the container sees only the minimal env injected via explicit `-e` flags.\n\nBuilding the image: `bash scripts/setup.sh` will build `ee-solver:latest` automatically when Docker is running (silently skipped otherwise).\n\nVerifying the sandbox: `python3 scripts/sandbox_smoke.py` runs four checks (benign solver boots, network blocked, mounted dir read-only, container runs non-root with capability bitmap cleared). Exits `2` (skip) if the Docker daemon is unreachable; not part of the canonical harness yet.\n\nThe default remains `\"none\"` so existing setups work unchanged; opt in by flipping `mode` to `\"docker\"` after `setup.sh` succeeds.\n\n#### Sandbox Python environment\n\nThe sandbox image is `python:3.11-slim` plus a small approved set of third-party packages (versions pinned in `Dockerfile`):\n\n| Package | Version | Purpose |\n|---------|---------|---------|\n| `sympy` | `1.13.3` | Symbolic algebra — useful for term parsing, substitution, equation normalization. Magma reasoning is non-associative, so most of sympy's group/ring engine doesn't apply directly, but the parser, free-variable utilities, and pattern matcher are still helpful. |\n\nThe standard library is otherwise the only thing available — no `numpy`, `z3`, `networkx`, etc. Submitting a solver that imports an unlisted package will fail at runtime with `ModuleNotFoundError`. To request additions, open an issue referencing the use case (see `CONTRIBUTING.md`).\n\n### Testing & Harness\n\nThe canonical completion gate is `python3 scripts/run_harness.py` — deterministic, offline, non-interactive. Exit `0` means every suite below passed.\n\n| Suite | Current count | Source of truth | Covers |\n|---|---|---|---|\n| Judge cases | 66 | `tests/harness_manifest.json` | Accepted / malformed / unparsed / incomplete_proof / incorrect on curated fixtures (incl. FALSE_CERT_TOO_LARGE) |\n| Judge internals | 32 | `run_judge_internal_cases` in `scripts/run_harness.py` | Unit-level invariants on verify.py helpers (equation normalization, byte-length cap, path stripping, render template stability, JudgeConfig budget-field plumbing for the three judge caps) |\n| Banned tokens | 24 | `run_banned_token_cases` in `scripts/run_harness.py` | Placeholder-detector word-boundary + substring matrix for every entry in `BANNED_PROOF_TOKENS` |\n| Repeatability | 4 | `repeatability_cases` in the same manifest | Selected cases run 3× and must project byte-identical results |\n| Pipeline regressions | 55 | Inlined in `scripts/run_harness.py` | Single-file `PROMPT` extraction (all bundled demos), stray `prompt.txt` is ignored, AST extractor hostile inputs (scope, type, first-wins, AnnAssign, NUL / invalid UTF-8), sandbox argv shape (none / docker / unknown), host-vs-container env selection, stderr drained into bounded ring buffer (so contestant tracebacks land in a `solver_stderr` log entry instead of being silently dropped, without re-introducing the kernel-pipe deadlock), 500 KB `solver.py` intake cap, single-file layout (helper / payload / subdir / symlink rejected), stdout line cap, wall-clock deadline clamping LLM + Lean timeouts, docker-cleanup-in-finally static check, doc-drift guard, public-allowlist demo count, `_call_llm` falls back to DeepSeek-style `reasoning_content` (streaming + non-streaming) when `content` is empty and surfaces `truncated: True` when `finish_reason=length` left no final answer |\n| Verify branches | 3 | `run_verify_branch_cases` in `scripts/run_harness.py` | LEAN_TIMEOUT via mocked `subprocess.run`; FALSE_CERT_TOO_LARGE rejection respects `JudgeConfig.max_false_cert_bytes` (cap=10 KB rejects 15 KB; cap=20 KB admits the same payload) |\n| Public challenger | 79 | `tests/challenger_manifest.json :: public_attack_cases` | Bypass attempts (banned placeholder / axiom / declaration smuggling, stdout injection) plus positive-control regressions for previously-false-negative proofs |\n| Infra challenger | 4 | same manifest, `infra_attack_cases` | Organizer-side malformed problems must raise `JudgeConfigurationError`, never map to a contestant verdict |\n\nCurrent repo baseline: **267 green checks** across the suites above (the harness also runs submit-CLI and loader smoke tests, plus a README self-check; the JSON summary lists every `passed_*_count` field separately). The README self-check (`run_readme_consistency_check`) reads the live `summary` map after every suite has run and compares each cell here to the matching `passed_*_count` — so adding a regression auto-bumps the canonical numbers, and any drift here fails the gate. Any nonzero exit blocks completion — do not weaken a test to get green.\n\nReading the JSON summary the harness prints:\n\n- `passed_case_count` / `case_count` — judge suite\n- `passed_pipeline_count` / `pipeline_count` — proxy-layer tests\n- `passed_repeatability_count` / `repeatability_count` — determinism\n- `challenger.passed_public_attack_count` / `public_attack_count` — challenger public\n- `challenger.passed_infra_attack_count` / `infra_attack_count` — organizer infra\n- `failing_*` arrays are empty on green; populated with the offending case detail on failure\n\nAdding a new regression (quickest path):\n\n1. Drop the fixture into `tests/fixtures/` (or `tests/challenger/` for adversarial cases).\n2. Append an entry to the matching manifest with `expected_status` and `expected_error_code`.\n3. Rerun `python3 scripts/run_harness.py` and confirm it picks up the new case.\n\nOpt-in Docker sandbox check — *not* part of the canonical gate because it needs the Docker daemon:\n\n```\npython3 scripts/sandbox_smoke.py\n```\n\nExits `0` when the sandbox image boots, blocks network, and blocks writes to the mounted solver dir; `2` when Docker is unreachable (treated as skip); `1` on any assertion failure.\n\n---\n\n## Project Structure\n\n```\n.\n├── README.md                        # This file (entry point + Pick Your Track)\n├── docs/                            # Track specs (read these before submitting)\n│   ├── solo_mode.md                 #   Solo track: I/O contract, budgets, scoring\n│   └── marathon_mode.md             #   Marathon track: same, plus compression_ratio\n│\n├── judge/                           # Deterministic Lean verifier (shared by both tracks)\n│   ├── verify.py                    #   Core verification logic\n│   ├── challenger.py                #   Adversarial test runner\n│   ├── JudgeMagma/Magma.lean        #   `◇` operator + Magma class\n│   ├── JudgeDecide/DecideBang.lean  #   `decideFin!` / `decide!` tactics\n│   ├── JudgeFinOp/MemoFinOp.lean    #   `finOpTable` helper for finite magmas\n│   └── JudgeSupport/Inspect.lean    #   #judge_report dep-tracking metaprogram\n│\n├── pipeline/                        # Evaluation orchestration\n│   ├── proxy.py                     #   Solo: launches solver, mediates stdin/stdout, fills prompts\n│   ├── runner.py                    #   Solo: batch evaluation entry point\n│   ├── config.json                  #   Solo per-problem budgets + LLM parameters\n│   ├── marathon_runner.py           #   Marathon: snapshot manifest, dual-budget watchdog\n│   ├── marathon_proxy.py            #   Marathon: local HTTP proxy (key isolation + token meter)\n│   ├── marathon_score.py            #   Marathon: last-write-wins parser + per-line verify_answer\n│   └── marathon_llm.py              #   Marathon: solver-side LLM helper (call_llm)\n│\n├── examples/                        # Demo submissions + sample problems\n│   ├── problems/                    #   Sample sets + HF JSONL mirrors\n│   │   ├── sample_20.json           #     20 sample problems\n│   │   ├── sample_200.json          #     200 problems (100 true + 100 false)\n│   │   └── (normal|hard1|hard2|hard3).jsonl   # HF SAIR sets\n│   ├── solo/                        #   Solo track: 3 reference demos + tutorial\n│   │   ├── TUTORIAL.md\n│   │   └── demos/\n│   │       ├── baseline/            #     Brute-force + singleton + LLM fallback (start here)\n│   │       ├── twophase/            #     gpt-oss-120b + two-phase strategy\n│   │       └── opnorm/              #     gpt-oss-120b + opnorm flagship reference solver\n│   └── marathon/                    #   Marathon track: 3 reference demos + tutorial\n│       ├── TUTORIAL.md\n│       └── demos/\n│           ├── baseline/            #     Sequential brute-force, no LLM (start here, zero token cost)\n│           ├── triage/              #     Difficulty-sorted Pass B + Pass C deeper-thought retry on Pass-B no-shows\n│           └── fewshot/             #     In-run lemma cache + few-shot transfer (Marathon-only strategy)\n│           # Each demo is a single solver.py\n│\n├── tests/                           # Test data\n│   ├── harness_manifest.json        #   Solo harness cases\n│   ├── challenger_manifest.json     #   Solo adversarial cases\n│   ├── fixtures/                    #   Solo fixtures\n│   ├── marathon_manifest.json       #   Marathon harness cases\n│   └── marathon_fixtures/           #   Marathon fixtures (manifests + fixture solvers)\n│\n├── scripts/\n│   ├── setup.sh                     #   One-command environment setup\n│   ├── run_harness.py               #   Solo harness — canonical green gate\n│   ├── run_marathon.py              #   Marathon CLI entry (run + score)\n│   ├── run_marathon_harness.py      #   Marathon harness — separate green gate\n│   └── submit.py                    #   Interactive CLI runner (colorized; Solo)\n│\n├── lakefile.lean                    #   Self-contained lake package (depends only on Mathlib)\n├── lake-manifest.json               #   Pinned Mathlib revision\n├── lean-toolchain                   #   Pinned Lean toolchain version\n└── .env.judge                       #   (gitignored) generated environment config\n```\n\n## Troubleshooting\n\n**\"missing lean/lake binary\"**\n-- `source .env.judge` to set the correct paths, or install elan and re-run setup.\n\n**Lean timeout on valid proofs**\n-- The pipeline already passes `judge.lean_timeout_seconds = 300` from `pipeline/config.json`; that value is what runs during evaluation. If you're invoking `judge/verify.py` directly (outside the runner), it falls back to a 120 s default — `export LEAN_TIMEOUT_SECONDS=300` matches the pipeline. To raise the cap globally, edit `pipeline/config.json`.\n\n**\"lake env failed\"**\n-- Mathlib isn't built in this working tree. Run `lake update && lake exe cache get && lake build JudgeMagma.Magma JudgeDecide.DecideBang JudgeFinOp.MemoFinOp JudgeSupport.Inspect`, or re-run `bash scripts/setup.sh`.\n\n**\"JudgeProblem does not have expected universe\"** / universe inference errors in the judge output\n-- Your submission's type uses `Type _` in a position where Lean can't infer the universe at elaboration. The judge's `Goal` is pinned to concrete `Type` (= `Type 0`); use `Type` in any explicit type annotations that must unify with `Goal`. See the Universe note under [Answer Format](#answer-format) for details.\n\n**LLM call returns an empty response with `reasoning` populated**\n-- The model exhausted its token budget mid-chain-of-thought. The proxy will fall back to `message.reasoning` automatically. The default `reasoning_effort` in `pipeline/config.json` is already `medium`; if you keep hitting this, drop it to `low` or `minimal`, or trim your PROMPT so the model has room to emit a structured answer after reasoning.\n\n**\"OPENAI_API_KEY or OPENROUTER_API_KEY not set\"**\n-- Set either one in the environment (they're interchangeable at the wire level). Persist to `.env` if you want it across shells.\n\n## License\n\nLicensed under the [Apache License, Version 2.0](LICENSE). See `LICENSE` for the full text.\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) — issue-first policy, bug-report\nrequired fields, and trivial-fix exceptions are documented there.\n",1780113385210]