Verina: Benchmarking Verifiable Code Generation

Prerequisites

Setup

uv sync
lake exe cache get
lake update

API Key Configuration

See .env.example for the required environment variables.

Running Benchmarks with Prefect

Starting Prefect Server

Before running the benchmarks, start the Prefect server:

docker compose up -d # This will start the database for prefect in the background
prefect server start

The server will be available at http://127.0.0.1:4200 with the API at http://127.0.0.1:4200/api.

Using Configuration Files

output_dir = "output/gpt-4o-mini-5rnd"  # Directory where all generated output will be saved
max_workers = 128  # Number of parallel worker tasks to use for generation and evaluation
rounds = 5  # Total number of rounds to run each benchmark task
fewshot_example_names = ["verina_basic_15", "verina_basic_44"]  # Identifiers for example tasks used in few-shot prompting

# Tasks selected to run:
code_gen = false         # Code generation task
spec_gen = false         # Specification generation task
proof_gen = true         # Proof generation task
code_spec_gen = false    # Combined code and specification generation task
code_proof_gen = true    # Combined code and proof generation task
spec_proof_gen = true    # Combined specification and proof generation task
code_spec_proof_gen = true  # Combined code, specification and proof generation task

[gen_lm_config]  # Settings for the language model generation, see utils/lm.py for more details
provider = "openai"     # The language model API provider to use
model_name = "gpt-4o-mini"  # Specific model name for generation

[baseline_config]
name = "baseline" # Name of the baseline method ["baseline", "proof-refinement]
resume_from_checkpoint = true # whether to resume from the previous result file
refinements = 64 # Number of refinements to run for the proof refinement baseline

Running Benchmark

To run benchmark with Prefect:

PREFECT_API_URL=http://127.0.0.1:4200/api uv run scripts/benchmark.py -c configs/[config_name].toml

It is faster to run the benchmarks by separating the generation and evaluation steps.

# For generation only
PREFECT_API_URL=http://127.0.0.1:4200/api uv run scripts/benchmark.py -c configs/<config_name>.toml --no-eval

# For evaluation only
PREFECT_API_URL=http://127.0.0.1:4200/api uv run scripts/benchmark.py -c configs/<config_name>.toml --no-gen -ew <evaluation_worker_num_override>

Running Quality Assurance

PREFECT_API_URL=http://127.0.0.1:4200/api uv run scripts/quality_assurance.py -c configs/qa.toml

Development

Setup

# pre-commit
pre-commit install

Code Formatting

# Sort imports
ruff check --select I --fix

# Format code
ruff format