Mathematics Distillation Challenge — Equational Theories — Stage 2
Official competition page: https://competition.sair.foundation/competitions/mathematics-distillation-challenge-equational-theories-stage2/overview
Background
The pilot task is equational implication over magmas (a set with
one binary operation ◇): given two laws E₁ and E₂, decide whether
E₁ ⇒ E₂ holds across every magma.
This challenge is based on the Equational Theories Project, initiated by Terence Tao:
- Raw implication graph: export_raw_implications
- Law list — 4694 laws of order ≤ 4: equations.txt
- Larger law list of order 5 used by Stage 2: bundled at
examples/problems/eq_size5.txt(~62 K laws)
Example: E_4: x = x * y implies E_3: x = x * x.
Stage 1 asked models for a yes/no answer. Stage 2 raises the bar:
every answer must come with a machine-verifiable Lean 4 certificate —
a proof for true implications, or a finite magma witness where the
hypothesis holds but the goal fails. A deterministic Lean judge
accepts or rejects each answer — no partial credit, no probabilistic
scoring, no LLM-as-judge. The same judge code runs locally and at the
official evaluation: if the harness in this repo turns green for
your solver.py, the judge returns the same verdict in production.
The submission is a single solver.py. The competition runs two
tracks with shared judging but different solver shapes — pick
whichever fits your strategy.
Pick Your Track
The competition has two tracks. Both share the same judge, the same
five-status verdict mapping (accepted / unparsed / malformed /
incomplete_proof / incorrect), and the same submission contract:
a single solver.py file, ≤ 500 KB. They differ only in how
problems and budgets are shaped — one solver source can support both.
→ Solo track
- One problem per solver subprocess. Every problem gets a fresh process.
- Fixed per-problem budget: 3600 s wall-clock; LLM calls capped at 65 536 output tokens each; submitted Lean code ≤ 100 KB.
- Communication: stdin (problem JSON) → stdout (answer JSON), one line each.
- Best for: getting started, deep single-problem search.
- Quick Start: Solo Quick Start below.
- Full spec:
docs/solo_mode.md.
→ Marathon track
- N problems per solver subprocess (reference: N=100). One process, one shared global budget.
- Compressed global budget:
compression_ratio × N × Marathon per-problem reference(600 s + 65 536 tokens per problem; deliberately tighter than Solo's wall-clock, seedocs/marathon_mode.md). Defaultcompression_ratio = 0.5— solver cannot finish all N at the per-problem reference cost and must triage. - Communication: file-based (read manifest JSONL, append answers JSONL).
- Best for: triage strategies, cross-problem caching, prompt reuse.
- Quick Start: Marathon Quick Start below.
- Full spec:
docs/marathon_mode.md.
Most contestants start with Solo. Marathon is the long-form track where strategic budget allocation is rewarded.
Solo Quick Start
# One-command setup (installs Lean, fetches Mathlib, builds judge modules)
bash scripts/setup.sh
# Install Python deps (OpenAI SDK — defaults to OpenRouter; override
# via OPENAI_BASE_URL / OPENAI_API_KEY to hit api.openai.com)
pip install openai
# Activate the environment
source .env.judge
# Verify the judge works
python3 scripts/run_harness.py
# Run a demo solver on 20 sample problems
python3 -m pipeline.runner \
--submission examples/solo/demos/baseline \
--problems examples/problems/sample_20.json
Prerequisites
- OS: macOS (Apple Silicon / Intel) or Linux (x86_64). Windows users should run under WSL 2 — the setup targets POSIX shells.
- Disk: ~3 GB free (Lean toolchain + Mathlib olean cache — this
repo is a self-contained lake package depending only on Mathlib; no
equational_theoriesclone required). - RAM: 8 GB minimum, 16 GB recommended.
- Network: Required for initial setup only.
- Python: 3.8+ (with
openaifor pipeline LLM calls). - Git: 2.x+.
Manual setup (step-by-step)
If you prefer to set things up step by step instead of using setup.sh:
-
Install elan (Lean version manager):
curl -sSf https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh | sh -s -- -y --default-toolchain none export PATH="$HOME/.elan/bin:$PATH" -
Install the Lean toolchain (version from this repo's
lean-toolchain):TOOLCHAIN=$(cat lean-toolchain | tr -d '[:space:]') elan toolchain install "$TOOLCHAIN" elan default "$TOOLCHAIN" -
Fetch Mathlib and build the judge modules:
lake update # pin Mathlib per lakefile.lean lake exe cache get # ~2 GB of pre-compiled Mathlib oleans lake build JudgeMagma.Magma JudgeDecide.DecideBang \ JudgeFinOp.MemoFinOp JudgeSupport.Inspect -
Configure environment:
cat > .env.judge <<EOF export LEAN_BIN="$(which lean)" export LAKE_BIN="$(which lake)" export PATH="\$HOME/.elan/bin:\$PATH" EOF source .env.judge -
Verify:
python3 scripts/run_harness.py
Marathon Quick Start
Marathon mode runs one solver subprocess against N problems under a
single global budget instead of one subprocess per problem. The solver
contract is the same single-file solver.py; the difference is the I/O
shape (file-based) and the budgeting.
# Run the bundled sequential baseline against a 5-problem manifest.
python3 scripts/run_marathon.py \
--solver examples/marathon/demos/baseline \
--manifest tests/marathon_fixtures/manifests/normal_5.jsonl
# Run a strategic, LLM-using marathon solver. Set OPENROUTER_API_KEY
# (or OPENAI_API_KEY) first — it's used by the marathon proxy, never
# forwarded into the solver subprocess.
export OPENROUTER_API_KEY=sk-...
python3 scripts/run_marathon.py \
--solver examples/marathon/demos/triage \
--manifest examples/problems/marathon/normal_100.jsonl \
--compression-ratio 0.5
# 100 problems × 600 s × 0.5 ≈ 30 000 s wall-clock. Swap in
# examples/problems/normal.jsonl (1000 problems, ~83 h at the same
# compression) when you're ready for the full reference set.
The runner derives budget_seconds and budget_tokens from
compression_ratio × N × Marathon-per-problem-reference (600 s and
65 536 tokens; see docs/marathon_mode.md).
Override either budget directly with --budget-seconds /
--budget-tokens, or change just the multiplier with
--compression-ratio (default 0.5; smaller squeezes harder, 1.0 =
no compression).
Regression harness (separate from run_harness.py):
python3 scripts/run_marathon_harness.py
Full Marathon spec: docs/marathon_mode.md.
What You Submit
Your submission is a single Python file named solver.py, up to
500 KB. The file is identical in shape for both tracks; what differs
is the I/O it implements (Solo: stdin/stdout JSON; Marathon:
manifest-in / answers-out files). One source file can support both — see
docs/marathon_mode.md for the env-var trigger.
my_submission/
└── solver.py # Your program. For Solo: stdin/stdout JSON protocol.
# For Marathon: see docs/marathon_mode.md.
# If it uses the LLM in Solo, a module-level
# PROMPT = """...""" string holds the template.
The Solo proxy extracts the PROMPT constant from solver.py via AST
parsing (the module is never imported or executed on the host), fills
placeholders, and sends the rendered prompt to the LLM on the solver's
behalf. The Marathon path uses the helper from marathon_llm import call_llm (or any OpenAI-SDK call) instead — it does not parse a
PROMPT constant.
The solver is a free-form program. There are no required function
signatures — the only requirement is the I/O protocol of the track you
are running (described in the Solo sections below and in
docs/marathon_mode.md respectively).
Reference problem sets
This repo bundles four problem sets at
examples/problems/ — mirrored from the
HuggingFace dataset
SAIRfoundation/equational-theories-selected-problems
— as practice and training material. The Stage 2 final evaluation
runs on a held-back set drawn from the same underlying corpus
(including order-5 laws), so the bundled sets are not the eval set;
they are the reference distribution you tune your solver against.
| Set | Size | True / False split | Difficulty |
|---|---|---|---|
normal | 1 000 | 500 / 500 | Reference distribution. Start here. |
hard1 | 69 | 24 / 45 | Tightly packed pairs; small set, high "compute / row" ratio. |
hard2 | 200 | 100 / 100 | Where the easy patterns run out. |
hard3 | 400 | 195 / 205 | Highest difficulty in the public split. |
Plus two synthetic samples for fast iteration:
examples/problems/sample_20.json
(smoke test) and
examples/problems/sample_200.json
(200, 100 true / 100 false). Beginners should validate their solver on
sample_20.json first, then move to normal once the loop is
reliable.
Examples & Tutorial
The examples/ directory contains demo submissions, sample problems, and per-track tutorials. Each track has 3 reference demos chosen as a learning ladder (skeleton → entry-level LLM → flagship strategy):
examples/
├── problems/ # Sample sets + HF JSONL mirrors
│ ├── sample_20.json # 20 sample problems (quick test)
│ ├── sample_200.json # 200 problems (100 true + 100 false)
│ └── (normal|hard1|hard2|hard3).jsonl
├── solo/
│ ├── TUTORIAL.md # Solo: 3 annotated walkthroughs
│ └── demos/
│ ├── baseline/ # Brute-force + singleton + generic LLM fallback (start here)
│ │ └── solver.py
│ ├── twophase/ # gpt-oss-120b: deeper search + analysis-then-implementation LLM
│ │ └── solver.py
│ └── opnorm/ # gpt-oss-120b: 16 deterministic strategies + structural-context LLM (flagship)
│ └── solver.py
└── marathon/
├── TUTORIAL.md # Marathon: 3 walkthroughs (baseline / triage / cross-problem state)
└── demos/
├── baseline/ # Sequential brute-force, no LLM (start here, zero token cost)
│ └── solver.py
├── triage/ # Difficulty-sorted Pass B + Pass C deeper-thought retry on Pass-B no-shows (entry-level LLM)
│ └── solver.py
└── fewshot/ # In-run lemma cache + few-shot transfer (cross-problem state, Marathon-only)
└── solver.py
Every submission — including every demo — is a single solver.py (≤ 500 KB). If the solver uses the LLM, the prompt template lives as a top-level PROMPT = """...""" constant inside that same file (Solo) or is embedded inline in the solver (Marathon).
Running demos
After source .env.judge:
source .env.judge
# Baseline demo on 20 problems
python3 -m pipeline.runner \
--submission examples/solo/demos/baseline \
--problems examples/problems/sample_20.json
# OSS two-phase demo on 200 problems
python3 -m pipeline.runner \
--submission examples/solo/demos/twophase \
--problems examples/problems/sample_200.json
# OSS opnorm reference solver on 200 problems
python3 -m pipeline.runner \
--submission examples/solo/demos/opnorm \
--problems examples/problems/sample_200.json
# Custom output path
python3 -m pipeline.runner \
--submission examples/solo/demos/baseline \
--problems examples/problems/sample_200.json \
--output results.json
Resume behavior: If the output file already exists, solved problems are skipped (their entries are kept verbatim). Failed entries are dropped on resume and re-run, and the new outcome replaces the old entry — only one row per problem id ever lands in the output file. To start fresh, delete or rename the output file.
Interactive CLI
scripts/submit.py wraps the same pipeline.proxy.run_solver engine as pipeline.runner, but adds colorized per-problem rows, a per-problem debug log, and exits 0 iff every selected problem is solved. Use it when you want a tighter feedback loop than the plain runner.
# Quick smoke on the bundled 20-problem sample
python3 scripts/submit.py \
--submission examples/solo/demos/baseline \
--problems examples/problems/sample_20.json
# Narrow to a handful of IDs and stream JSON results to disk atomically
python3 scripts/submit.py \
--submission examples/solo/demos/baseline \
--problems examples/problems/hard1.jsonl \
--problem-ids hard1_0001,hard1_0007,hard1_0012 \
--output pipeline/results/hard1_spot.json \
--verbose
Any typo in --problem-ids or an empty problem set fails with exit code 2 rather than silently running nothing, so a mistyped flag never masquerades as success.
Tutorial
Solo — see examples/solo/TUTORIAL.md for three annotated walkthroughs showing the full solver-proxy-judge interaction:
- Deterministic counterexample -- solver finds a Fin 5 counterexample, no LLM needed (1.9s)
- LLM feedback loop -- LLM tries 4 times with judge error feedback until proof accepted (77s)
- MATCH-COLLAPSE -- 9 deterministic strategies fail, then 1 LLM call with specialized prompt succeeds (73s)
Marathon — see examples/marathon/TUTORIAL.md for three walkthroughs of marathon-specific strategies:
- Free counterexample harvest -- baseline brute-force pass clears ~40-50% of
normalat zero token cost - Triage + deeper-thought retry -- difficulty-sorted Pass B + budget-aware Pass C re-attempt on Pass-B no-shows with bumped reasoning effort (
triage) - Marathon-distinctive: in-run lemma cache + few-shot transfer --
fewshotaccumulates winning patterns across problems and prepends them to later prompts; cross-problem state is structurally impossible in Solo
Problem Format
Problems use the HuggingFace-aligned format. The binary operation may use ◇ or * (auto-normalized to ◇ for Lean):
{
"id": "normal_0646",
"eq1_id": 2034,
"eq2_id": 2417,
"equation1": "x = (y ◇ (z ◇ w)) ◇ (u ◇ v)",
"equation2": "x = (y ◇ (z ◇ (w ◇ x))) ◇ z",
"answer": true
}
The question: Does equation1 imply equation2?
Answer Format
{"verdict": "true", "code": "<full Lean 4 source code>"}
verdict:"true"(prove implication) or"false"(prove non-implication)code: Complete Lean 4 source exposing asubmission : Goalterm (see below)
The judge writes a per-verify JudgeProblem.lean with the two problem
equations bound as EquationLHS / EquationRHS plus a verdict-specific
abbrev Goal. Submitter code lives in Submission.lean and only has
to expose a term named submission whose type is definitionally equal
to Goal. The goal statement itself is judge-controlled (lives in a
separately-generated Problem.lean), so the submitter doesn't need to
write the theorem header at all.
Lean primitives the certificates use (all provided by the judge — no external Mathlib imports needed for the canonical false-cert shape):
◇— the magma's binary operation (single character;*in your problem text is auto-normalized to◇)Magma G— Lean type class declaringGas a magma;[Magma G]introduces an instance bringing◇into scopeFin n— the standard finite type{0, 1, …, n-1}; the canonical false-certificate domainfinOpTable "<json>"— judge helper that turns a JSON-encoded n×n table into aFin n → Fin n → Fin noperationdecideFin!— judge tactic that closes a finite-domain goal by exhaustive evaluation of the magma's operation table
True certificate
Goal expands to ∀ (G : Type) [Magma G], EquationLHS G → EquationRHS G.
import JudgeProblem
def submission : Goal := by
intro G _ h x y z
…tactics that produce EquationRHS G using h …
False certificate
Goal expands to ∃ (G : Type) (_ : Magma G), EquationLHS G ∧ ¬ EquationRHS G.
import JudgeProblem
import JudgeDecide.DecideBang
import JudgeFinOp.MemoFinOp
open MemoFinOp
def submission : Goal := by
let m : Magma (Fin 2) := { op := finOpTable "[[0,0],[1,1]]" }
refine ⟨Fin 2, m, ?_⟩
decideFin!
Universe note:
Goalis pinned to concreteType(=Type 0) in both branches becauseabbrev Goal : Prop := ∀ (G : Type _) …leaves a stuck universe meta that Lean can't resolve atabbrevelaboration. Submitters work with small types (Fin n, concrete magmas) which all live inType 0, so this isn't a practical restriction.Backward compatibility: old-style
theorem submission : <explicit goal> := …submissions still verify if they use the newimport JudgeProblemimports —Goalis@[reducible], so the explicit type andGoalunify by definitional equality.
System Architecture (Solo)
Marathon's architecture is parallel but file-based and uses a local HTTP LLM proxy instead of stdin/stdout — see
docs/marathon_mode.md.
┌──────────────────────────────────────────────────────────────┐
│ Proxy (organizer) │
│ │
│ 1. Start solver as subprocess (sandboxed in production) │
│ 2. Send problem + budget to solver via stdin │
│ 3. Wait for solver requests on stdout │
│ 4. For judge calls: forward to judge, return result │
│ 5. For LLM calls: fill PROMPT template, call LLM API │
│ 6. On judge "accepted" → record result │
│ 7. On wall-clock timeout → terminate solver │
│ │
│ ┌────────────────┐ ┌──────────────────┐ │
│ │ Solver │ stdin/stdout JSON │ Proxy │ │
│ │ (contestant) │◄═══════════════════►│ (organizer) │ │
│ │ │ │ │ │ │ │
│ │ - isolated │ │ │ │ │ │
│ │ - no secrets │ │ ▼ ▼ │ │
│ │ │ │ ┌─────┐ ┌───┐ │ │
│ │ │ │ │Judge│ │LLM│ │ │
│ └────────────────┘ │ └─────┘ └───┘ │ │
└──────────────────────────────────────────────────────────────┘
| Component | Provider | Network | Description |
|---|---|---|---|
| Solver | Contestant | Isolated (no secrets, sandboxed in production) | Your program; communicates with proxy via stdin/stdout |
| Proxy | Organizer | Online | Launches solver, mediates all I/O, fills prompt templates, calls LLM API, enforces limits |
| Judge | Organizer | Offline | Deterministic Lean verifier, returns accepted or an error |
| LLM | Organizer | Online | Generates proofs/counterexamples when prompted |
| Prompt | Contestant | N/A | PROMPT constant inside solver.py (single-file submission); proxy fills its placeholders before each LLM call |
Communication Protocol
Solo only. Marathon uses a file-based contract (manifest in, JSONL out) plus a local HTTP LLM proxy — see
docs/marathon_mode.md.
All communication between solver and proxy uses JSON messages over stdin/stdout, one JSON object per line. The proxy starts one solver process per problem. No state carries between problems.
Startup: Proxy -> Solver
When the solver process starts, the proxy writes the problem and budget to stdin:
{
"problem": {
"id": "normal_0646",
"eq1_id": 2034,
"eq2_id": 2417,
"equation1": "x = (y ◇ (z ◇ w)) ◇ (u ◇ v)",
"equation2": "x = (y ◇ (z ◇ (w ◇ x))) ◇ z"
},
"budget": {
"timeout_seconds": 3600,
"max_code_length": 100000,
"max_false_cert_bytes": 20000
}
}
Solver -> Proxy: Judge Request
{"call": "judge", "verdict": "true", "code": "import JudgeProblem\n\ndef submission : Goal := by\n..."}
Proxy forwards to judge, returns:
{"status": "accepted"}
or:
{"status": "incorrect", "stderr": "type mismatch..."}
When proxy sees "status": "accepted", it records the result automatically. The solver does NOT need a separate "submit" action.
Solver -> Proxy: LLM Request
The solver sends a context dict (not a raw prompt). The proxy reads the PROMPT constant from solver.py, fills all placeholders, and sends the assembled prompt to the LLM.
{"call": "llm", "context": {"analysis": "No counterexample on Fin 2-3"}}
Proxy fills template, calls LLM, returns:
{"response": "{\"verdict\": \"true\", \"proof\": \"intro x y\\n...\"}"}
Full Example Session
Proxy ──stdin──→ {"problem": {...}, "budget": {...}}
(solver reads problem, does brute-force search, prepares context)
Solver ──stdout─→ {"call": "llm", "context": {"analysis": "No counterexample on Fin 2-5"}}
Proxy ──stdin──→ {"response": "{\"verdict\": \"true\", \"proof\": \"intro ...\"}"}
(solver parses LLM response, builds full Lean code)
Solver ──stdout─→ {"call": "judge", "verdict": "true", "code": "import ..."}
Proxy ──stdin──→ {"status": "incorrect", "stderr": "type mismatch ..."}
(solver retries — proxy auto-includes error in {history.*})
Solver ──stdout─→ {"call": "llm", "context": {"analysis": "Judge rejected: type mismatch..."}}
Proxy ──stdin──→ {"response": "{\"verdict\": \"true\", \"proof\": \"have ...\"}"}
Solver ──stdout─→ {"call": "judge", "verdict": "true", "code": "import ..."}
Proxy ──stdin──→ {"status": "accepted"}
(proxy records result, terminates solver process)
Prompt Template System
Solo only. The Marathon proxy is an HTTP forwarder — solvers build their own prompts and call it via the OpenAI SDK or
marathon_llm.call_llm. Seedocs/marathon_mode.md.
Contestants provide a prompt template as a PROMPT string constant inside solver.py, using placeholders from three namespaces. The proxy fills them before each LLM call.
{problem.*} -- Problem data (auto-filled)
| Placeholder | Example |
|---|---|
{problem.id} | normal_0646 |
{problem.eq1_id} | 2034 |
{problem.eq2_id} | 2417 |
{problem.eq1_name} | Equation2034 |
{problem.eq2_name} | Equation2417 |
{problem.equation1} | x = (y ◇ (z ◇ w)) ◇ (u ◇ v) |
{problem.equation2} | x = (y ◇ (z ◇ (w ◇ x))) ◇ z |
{history.*} -- Judge history (auto-accumulated)
| Placeholder | Description |
|---|---|
{history.attempts} | Formatted log of each attempt's verdict, status, and error |
{history.round} | Number of judge calls so far (0, 1, 2, ...) |
{history.last_error} | stderr or message from the most recent rejection |
{history.last_status} | incorrect, incomplete_proof, etc. |
{solver.*} -- Solver context (dynamic)
The solver sends arbitrary key-value pairs in the context field of its LLM request. The proxy maps each key k to {solver.k}.
Example: {"call": "llm", "context": {"analysis": "..."}} fills {solver.analysis} in the template.
Unfilled placeholders
Any {problem.*}, {solver.*}, or {history.*} placeholder not matched is silently removed.
Example PROMPT constant
PROMPT = """You are an expert in universal algebra and Lean 4 theorem proving.
Does {problem.eq1_name} imply {problem.eq2_name}?
Hypothesis ({problem.eq1_name}): ∀ elements, {problem.equation1}
Goal ({problem.eq2_name}): ∀ elements, {problem.equation2}
## Solver's analysis
{solver.analysis}
## Previous attempts (round {history.round})
{history.attempts}
## Response format
ONLY valid JSON, no markdown fences:
{"verdict": "true", "proof": "<tactic body>"}
or
{"verdict": "false", "counterexample_table": [[0,1],[1,0]]}
"""
Judge
Statuses
| Status | Meaning |
|---|---|
accepted | Proof compiles, type-checks, and passes dependency policy |
unparsed | Answer is not valid JSON |
malformed | JSON parses but violates required schema |
incomplete_proof | Uses sorry, admit, or banned axioms/dependencies |
incorrect | Structurally valid but Lean rejects the proof |
The judge is deterministic: same input always produces same output.
Constraints
| Constraint | Value |
|---|---|
| Max code length | 100,000 characters |
| Max false certificate code | 20,000 bytes |
| Lean timeout | 300 seconds per proof |
| Banned tokens | sorry, admit, sorryAx, dbg_trace, dbgTrace, run_tac, mkSorry, initialize, builtin_initialize |
Available Imports
Your code runs with a sandboxed LEAN_PATH covering the judge's own modules and the Mathlib olean cache. Available imports:
JudgeProblem— bindsEquationLHS/EquationRHSto the two problem equations (generated per-verify) plus anabbrev Goalwhose body is the verdict-specific ∀ / ∃ statementJudgeDecide.DecideBang—decideFin!/decide!tactics for finite-model checkingJudgeFinOp.MemoFinOp—open MemoFinOpexposesfinOpTable, a JSON-string →Fin n → Fin n → Fin nhelper for building finite magmasJudgeMagma.Magma— the◇operator (re-imported byJudgeProblem, so you rarely need this directly)Mathlib.*— any Mathlib module, pinned bylakefile.lean
Configuration
Below: Solo reference budgets and LLM parameters. Marathon derives its global budgets from these via
compression_ratio— seedocs/marathon_mode.md.
The numbers in
pipeline/config.json(wall-clock timeout, Lean timeout, code-size caps, sandbox limits, LLM parameters) are a reference configuration for Stage 2. They will be tuned based on community feedback as the competition progresses — expect the wall-clock budget and sandbox limits in particular to settle once we see how contestant solvers actually behave. The single-file solver contract and the public five-status verdict semantics are stable; the numerical knobs are not.
LLM Parameters
All LLM parameters are fixed by the organizer in pipeline/config.json. Contestants cannot change them.
| Parameter | Value |
|---|---|
| Model | openai/gpt-oss-120b |
| Provider | deepinfra/bf16 |
| Max output tokens | 65,536 |
| Temperature | 0.0 |
| Reasoning effort | medium |
| Seed | 0 (deterministic) |
reasoning_effort is pinned to medium as a reference: at high, openai/gpt-oss-120b on deepinfra/bf16 has been observed to burn the entire HTTP budget inside the reasoning chain and return empty content on hard problems. medium consistently emits substantive Lean within the budget and is the value we run the reference solver against. Subject to change as providers and models evolve.
LLM calls go through the OpenAI SDK with base_url pointing at
OpenRouter by default. Set one of OPENAI_API_KEY or
OPENROUTER_API_KEY; flip to OpenAI directly by also setting
OPENAI_BASE_URL=https://api.openai.com/v1 (and adjusting the model
name).
There are two routing styles, both work end-to-end:
-
Env-driven (single global provider) —
OPENAI_BASE_URL+OPENAI_API_KEYset in the shell apply to every config. Best for "I just want to swap OpenRouter for OpenAI everywhere". -
Config-driven (per-run / per-experiment provider) — set
llm.base_urlandllm.api_key_envin the config JSON. The environment value ofllm.api_key_envis read at call time:"llm": { "model": "deepseek-v4-flash", "base_url": "https://api.deepseek.com/v1", "api_key_env": "DEEPSEEK_API_KEY", "max_output_tokens": 8192, "temperature": 0.2 }The proxy talks to any OpenAI-compatible endpoint this way: DeepSeek, Kimi/Moonshot, GLM/Zhipu, Minimax, Qwen, api.openai.com, etc. — no code changes. OpenRouter-only fields (
provider,reasoning_effort) are emitted only whenbase_urlactually points at OpenRouter, so a config that adds justbase_url+api_key_envto the default never leaks OpenRouter routing hints to a direct provider.
Solver Budgets
| Limit | Value | Description |
|---|---|---|
| Wall-clock timeout | 3600s | Single per-problem budget; pacing LLM/judge calls within this is the solver's responsibility. Widened from the earlier 600s reference so multi-round LLM loops have room to finish under reasoning_effort=medium. |
| Solver file size | 500 KB | solver.py larger than this is rejected pre-launch |
Environment Variables
| Variable | Default | Description |
|---|---|---|
LEAN_BIN | auto-detected | Path to the lean binary |
LAKE_BIN | auto-detected | Path to the lake binary |
JUDGE_ARTIFACT_DIR | .artifacts | Where per-verify JudgeProblem.lean, Submission.lean, and Problem.lean are written |
JUDGE_LEAN_PATH | (none; falls back to lake env) | Operator override for LEAN_PATH — useful when .lake/ is read-only and lake env can't recompute |
LEAN_TIMEOUT_SECONDS | 120 (raw judge/verify.py) / 300 (via pipeline, from judge.lean_timeout_seconds in pipeline/config.json) | Per-proof compilation timeout. The pipeline's 300 s value is what actually runs during evaluation; the 120 s default only applies if you invoke judge/verify.py directly without the runner. |
OPENAI_API_KEY | (none) | Preferred API key for LLM calls — OpenAI SDK reads it first |
OPENROUTER_API_KEY | (none) | Fallback key if OPENAI_API_KEY is unset; same wire format |
OPENAI_BASE_URL | https://openrouter.ai/api/v1 | Env-level base URL; overridden by llm.base_url in the config |
<llm.api_key_env> | (none) | Whichever name the config's llm.api_key_env points at — e.g. DEEPSEEK_API_KEY for direct DeepSeek routing |
Solver Sandbox (optional, MVP)
Contestant solver.py can be run inside a Docker container for host isolation. Mode is controlled by pipeline/config.json:
"sandbox": {
"mode": "none", // "none" (default) | "docker"
"image": "ee-solver:latest",
"memory_mb": 2048,
"cpus": 2,
"pids_limit": 64,
"tmpfs_size_mb": 64
}
With mode = "none" the memory_mb, cpus, pids_limit, and tmpfs_size_mb fields are inert — the solver runs in-process on the host and inherits the host's resources. They take effect only when mode = "docker", where they are mapped to --memory / --memory-swap / --cpus / --pids-limit / --tmpfs /tmp:size= on the docker run invocation below. The values shown are reference numbers that let the bundled reference solver and demos finish within budget on a modest box; like the rest of the config they will be refined from community-feedback once Docker-mode runs are common.
When mode = "docker" the solver is launched as:
docker run --rm -i --network=none --read-only \
--cap-drop=ALL --security-opt=no-new-privileges:true \
--memory=<memory_mb>m --memory-swap=<memory_mb>m \
--cpus=<cpus> --pids-limit=<pids_limit> \
--tmpfs /tmp:size=<tmpfs_size_mb>m \
-v <submission>:/solver:ro -e PYTHONUNBUFFERED=1 <image>
Hardening layers: no network, read-only root FS, all capabilities dropped, no-new-privileges, non-root solver user (from the image), --memory-swap pinned to --memory so swap can't double the effective limit, bounded CPU/pid/tmpfs, /solver mount read-only. The host docker CLI inherits the full host environment (so DOCKER_HOST / DOCKER_CONFIG / TLS vars reach the daemon); the container sees only the minimal env injected via explicit -e flags.
Building the image: bash scripts/setup.sh will build ee-solver:latest automatically when Docker is running (silently skipped otherwise).
Verifying the sandbox: python3 scripts/sandbox_smoke.py runs four checks (benign solver boots, network blocked, mounted dir read-only, container runs non-root with capability bitmap cleared). Exits 2 (skip) if the Docker daemon is unreachable; not part of the canonical harness yet.
The default remains "none" so existing setups work unchanged; opt in by flipping mode to "docker" after setup.sh succeeds.
Sandbox Python environment
The sandbox image is python:3.11-slim plus a small approved set of third-party packages (versions pinned in Dockerfile):
| Package | Version | Purpose |
|---|---|---|
sympy | 1.13.3 | Symbolic algebra — useful for term parsing, substitution, equation normalization. Magma reasoning is non-associative, so most of sympy's group/ring engine doesn't apply directly, but the parser, free-variable utilities, and pattern matcher are still helpful. |
The standard library is otherwise the only thing available — no numpy, z3, networkx, etc. Submitting a solver that imports an unlisted package will fail at runtime with ModuleNotFoundError. To request additions, open an issue referencing the use case (see CONTRIBUTING.md).
Testing & Harness
The canonical completion gate is python3 scripts/run_harness.py — deterministic, offline, non-interactive. Exit 0 means every suite below passed.
| Suite | Current count | Source of truth | Covers |
|---|---|---|---|
| Judge cases | 66 | tests/harness_manifest.json | Accepted / malformed / unparsed / incomplete_proof / incorrect on curated fixtures (incl. FALSE_CERT_TOO_LARGE) |
| Judge internals | 32 | run_judge_internal_cases in scripts/run_harness.py | Unit-level invariants on verify.py helpers (equation normalization, byte-length cap, path stripping, render template stability, JudgeConfig budget-field plumbing for the three judge caps) |
| Banned tokens | 24 | run_banned_token_cases in scripts/run_harness.py | Placeholder-detector word-boundary + substring matrix for every entry in BANNED_PROOF_TOKENS |
| Repeatability | 4 | repeatability_cases in the same manifest | Selected cases run 3× and must project byte-identical results |
| Pipeline regressions | 55 | Inlined in scripts/run_harness.py | Single-file PROMPT extraction (all bundled demos), stray prompt.txt is ignored, AST extractor hostile inputs (scope, type, first-wins, AnnAssign, NUL / invalid UTF-8), sandbox argv shape (none / docker / unknown), host-vs-container env selection, stderr drained into bounded ring buffer (so contestant tracebacks land in a solver_stderr log entry instead of being silently dropped, without re-introducing the kernel-pipe deadlock), 500 KB solver.py intake cap, single-file layout (helper / payload / subdir / symlink rejected), stdout line cap, wall-clock deadline clamping LLM + Lean timeouts, docker-cleanup-in-finally static check, doc-drift guard, public-allowlist demo count, _call_llm falls back to DeepSeek-style reasoning_content (streaming + non-streaming) when content is empty and surfaces truncated: True when finish_reason=length left no final answer |
| Verify branches | 3 | run_verify_branch_cases in scripts/run_harness.py | LEAN_TIMEOUT via mocked subprocess.run; FALSE_CERT_TOO_LARGE rejection respects JudgeConfig.max_false_cert_bytes (cap=10 KB rejects 15 KB; cap=20 KB admits the same payload) |
| Public challenger | 79 | tests/challenger_manifest.json :: public_attack_cases | Bypass attempts (banned placeholder / axiom / declaration smuggling, stdout injection) plus positive-control regressions for previously-false-negative proofs |
| Infra challenger | 4 | same manifest, infra_attack_cases | Organizer-side malformed problems must raise JudgeConfigurationError, never map to a contestant verdict |
Current repo baseline: 267 green checks across the suites above (the harness also runs submit-CLI and loader smoke tests, plus a README self-check; the JSON summary lists every passed_*_count field separately). The README self-check (run_readme_consistency_check) reads the live summary map after every suite has run and compares each cell here to the matching passed_*_count — so adding a regression auto-bumps the canonical numbers, and any drift here fails the gate. Any nonzero exit blocks completion — do not weaken a test to get green.
Reading the JSON summary the harness prints:
passed_case_count/case_count— judge suitepassed_pipeline_count/pipeline_count— proxy-layer testspassed_repeatability_count/repeatability_count— determinismchallenger.passed_public_attack_count/public_attack_count— challenger publicchallenger.passed_infra_attack_count/infra_attack_count— organizer infrafailing_*arrays are empty on green; populated with the offending case detail on failure
Adding a new regression (quickest path):
- Drop the fixture into
tests/fixtures/(ortests/challenger/for adversarial cases). - Append an entry to the matching manifest with
expected_statusandexpected_error_code. - Rerun
python3 scripts/run_harness.pyand confirm it picks up the new case.
Opt-in Docker sandbox check — not part of the canonical gate because it needs the Docker daemon:
python3 scripts/sandbox_smoke.py
Exits 0 when the sandbox image boots, blocks network, and blocks writes to the mounted solver dir; 2 when Docker is unreachable (treated as skip); 1 on any assertion failure.
Project Structure
.
├── README.md # This file (entry point + Pick Your Track)
├── docs/ # Track specs (read these before submitting)
│ ├── solo_mode.md # Solo track: I/O contract, budgets, scoring
│ └── marathon_mode.md # Marathon track: same, plus compression_ratio
│
├── judge/ # Deterministic Lean verifier (shared by both tracks)
│ ├── verify.py # Core verification logic
│ ├── challenger.py # Adversarial test runner
│ ├── JudgeMagma/Magma.lean # `◇` operator + Magma class
│ ├── JudgeDecide/DecideBang.lean # `decideFin!` / `decide!` tactics
│ ├── JudgeFinOp/MemoFinOp.lean # `finOpTable` helper for finite magmas
│ └── JudgeSupport/Inspect.lean # #judge_report dep-tracking metaprogram
│
├── pipeline/ # Evaluation orchestration
│ ├── proxy.py # Solo: launches solver, mediates stdin/stdout, fills prompts
│ ├── runner.py # Solo: batch evaluation entry point
│ ├── config.json # Solo per-problem budgets + LLM parameters
│ ├── marathon_runner.py # Marathon: snapshot manifest, dual-budget watchdog
│ ├── marathon_proxy.py # Marathon: local HTTP proxy (key isolation + token meter)
│ ├── marathon_score.py # Marathon: last-write-wins parser + per-line verify_answer
│ └── marathon_llm.py # Marathon: solver-side LLM helper (call_llm)
│
├── examples/ # Demo submissions + sample problems
│ ├── problems/ # Sample sets + HF JSONL mirrors
│ │ ├── sample_20.json # 20 sample problems
│ │ ├── sample_200.json # 200 problems (100 true + 100 false)
│ │ └── (normal|hard1|hard2|hard3).jsonl # HF SAIR sets
│ ├── solo/ # Solo track: 3 reference demos + tutorial
│ │ ├── TUTORIAL.md
│ │ └── demos/
│ │ ├── baseline/ # Brute-force + singleton + LLM fallback (start here)
│ │ ├── twophase/ # gpt-oss-120b + two-phase strategy
│ │ └── opnorm/ # gpt-oss-120b + opnorm flagship reference solver
│ └── marathon/ # Marathon track: 3 reference demos + tutorial
│ ├── TUTORIAL.md
│ └── demos/
│ ├── baseline/ # Sequential brute-force, no LLM (start here, zero token cost)
│ ├── triage/ # Difficulty-sorted Pass B + Pass C deeper-thought retry on Pass-B no-shows
│ └── fewshot/ # In-run lemma cache + few-shot transfer (Marathon-only strategy)
│ # Each demo is a single solver.py
│
├── tests/ # Test data
│ ├── harness_manifest.json # Solo harness cases
│ ├── challenger_manifest.json # Solo adversarial cases
│ ├── fixtures/ # Solo fixtures
│ ├── marathon_manifest.json # Marathon harness cases
│ └── marathon_fixtures/ # Marathon fixtures (manifests + fixture solvers)
│
├── scripts/
│ ├── setup.sh # One-command environment setup
│ ├── run_harness.py # Solo harness — canonical green gate
│ ├── run_marathon.py # Marathon CLI entry (run + score)
│ ├── run_marathon_harness.py # Marathon harness — separate green gate
│ └── submit.py # Interactive CLI runner (colorized; Solo)
│
├── lakefile.lean # Self-contained lake package (depends only on Mathlib)
├── lake-manifest.json # Pinned Mathlib revision
├── lean-toolchain # Pinned Lean toolchain version
└── .env.judge # (gitignored) generated environment config
Troubleshooting
"missing lean/lake binary"
-- source .env.judge to set the correct paths, or install elan and re-run setup.
Lean timeout on valid proofs
-- The pipeline already passes judge.lean_timeout_seconds = 300 from pipeline/config.json; that value is what runs during evaluation. If you're invoking judge/verify.py directly (outside the runner), it falls back to a 120 s default — export LEAN_TIMEOUT_SECONDS=300 matches the pipeline. To raise the cap globally, edit pipeline/config.json.
"lake env failed"
-- Mathlib isn't built in this working tree. Run lake update && lake exe cache get && lake build JudgeMagma.Magma JudgeDecide.DecideBang JudgeFinOp.MemoFinOp JudgeSupport.Inspect, or re-run bash scripts/setup.sh.
"JudgeProblem does not have expected universe" / universe inference errors in the judge output
-- Your submission's type uses Type _ in a position where Lean can't infer the universe at elaboration. The judge's Goal is pinned to concrete Type (= Type 0); use Type in any explicit type annotations that must unify with Goal. See the Universe note under Answer Format for details.
LLM call returns an empty response with reasoning populated
-- The model exhausted its token budget mid-chain-of-thought. The proxy will fall back to message.reasoning automatically. The default reasoning_effort in pipeline/config.json is already medium; if you keep hitting this, drop it to low or minimal, or trim your PROMPT so the model has room to emit a structured answer after reasoning.
"OPENAI_API_KEY or OPENROUTER_API_KEY not set"
-- Set either one in the environment (they're interchangeable at the wire level). Persist to .env if you want it across shells.
License
Licensed under the Apache License, Version 2.0. See LICENSE for the full text.
Contributing
See CONTRIBUTING.md — issue-first policy, bug-report required fields, and trivial-fix exceptions are documented there.