VerifiedGPU: Graded-Effect Semantics for Race‑Free WebGPU Kernels (Lean 4)
VerifiedGPU is a proof-of-concept Lean 4 project that formalizes a lightweight kernel model close to the WebGPU Shading Language (WGSL), and uses a language quantale to reason compositionally about absence of data races and about barrier placement. WGSL is the shading language for the modern web graphics standard called WebGPU. This proof of concept demonstrates the approach with a verified Blelloch‑style scan (exclusive prefix sum) featuring workgroup barriers that are intended to be optimal, along with certified WGSL code emission. You can run the certified WGSL in the browser via the WebGPU harness at bkase.github.io/verified-gpu.
- Quantale of languages:
Lang α := Set (Word α)with concatenation and arbitrary joins - GPU “phase alphabet”:
Phase(reads/writes/guards/space) - Grades: words over
Phase(phase traces) with singleton denotation into the quantale - Per-phase obligations: pairwise-disjoint writes (
WritesDisjointPhase) and no cross-thread read-after-write within a phase (NoRAWIntraPhase) - Goal: Type-checking implies data-race freedom in workgroups; barriers adequate and, we hope, minimal
- WGSL backend:
WGSL.Codegenlowers the IR to WGSL, proves grade preservation, and produces a certified kernel viaCertifiedEmit_wgScan
Table of Contents
Why this exists
Kernels for graphics processing units are easy to get subtly wrong (missing barriers, aliasing writes, read‑after‑write hazards, and write‑after‑write hazards). Rather than ad‑hoc proofs, we:
- track who reads/writes what, and when (per phase),
- split phases at barriers,
- prove pairwise disjointness and intra‑phase read‑after‑write safety,
- lift into a quantale of phase traces for compositional laws,
- emit WGSL code matching the proven barrier structure.
What is the Blelloch scan? It is the classic work‑efficient parallel exclusive prefix‑sum that proceeds in two tree‑shaped passes over an array: an “upsweep” reduction to build partial sums, followed by a “downsweep” to produce exclusive prefixes; each level uses synchronization. It performs a total amount of work that scales linearly with the input size and has span that grows logarithmically with the input size. It is a standard primitive for graphics processing units. See the overview in GPU Gems 3, Chapter 39: Parallel Prefix Sum (Scan) with CUDA.
Getting started
Toolchain is pinned in lean-toolchain (currently Lean 4.25 RC). Build and run everything via lake (mathlib required).
# optional: reproducible shell
direnv allow # or: nix develop / devenv shell
# ensure toolchain
elan toolchain install $(cat lean-toolchain)
lake build
# default target only checks `LanguageQuantale`; build other libraries explicitly
lake build Effects # phase alphabet + grades + normalization
lake build Kernel # Syntax, Typing, Lemmas (e.g., affine facts)
lake build Tests # sample programs / grade synthesis smoke tests
lake build WGSL # WGSL AST + footprint + certified emitter
# run the executable kernel harness once proofs/build succeed
cargo test -p wgsl-harness --manifest-path wgsl-harness/Cargo.toml
Repository layout (high‑level):
LanguageQuantale.lean -- Words and languages; quantale instances
Effects.lean -- Phase alphabet for graphics processing units; grade operations, normalization, denotation; safety predicates
Kernel/ -- Core intermediate representation close to the WGSL, and the typing judgment
Syntax.lean -- Expressions, locations, statements (abstract syntax tree)
Typing.lean -- Graded typing rules and synthesizer (`gradeOf`)
Lemmas/ -- Reusable arithmetic and guard reasoning facts
Affine.lean -- Affine index lemmas for Blelloch‑style scans
Examples/ -- Drop-in proof patterns for specific kernels
Blelloch.lean -- Upsweep/downsweep and end-to-end scan grade proofs
Tests/ -- Quick grade synthesis checks
GradeEval.lean -- End-to-end sample touching every statement form
WGSL/ -- WGSL AST, footprint/erasure logic, and certified emitter
AST.lean -- Minimal WGSL syntax + pretty-printer
Footprint.lean -- WGSL-side phases/grades with erasure into Effects.Grade
Codegen.lean -- IR → WGSL lowering, grade preservation, CertifiedEmit_wgScan
Design in 90 seconds
-
A Phase summarizes the intra‑epoch memory footprint:
reads,writes(each a list ofAccesswithspace,base,idx : Aff2,guard). A phase is the unit of reasoning between barriers. -
A Grade is a word of phases (a phase trace). Concretely, it is a sequence that records, in order, the abstract memory footprint that each epoch performs, with explicit barrier cuts. “Phase trace” means we track only the “who, where, and when” of memory actions (memory spaces, bases, indices, and guards) — not the computed values — and we keep the sequence boundaries that matter for synchronization. Barriers appear as a special separator that phases never cross.
-
Denotation embeds a concrete grade as a singleton language
{word}in the quantaleLang Phase. Why denotation? It gives an extensional meaning we can equate modulo normalization: two grades are “the same behavior” if their denotations (after normalization that fuses adjacent non‑empty phases but never crosses barriers) coincide. This avoids baking rewrite rules into syntax and lets us use set‑theoretic reasoning for joins and refinement. -
The quantale gives us exactly the algebra we need for compositional effect reasoning:
- monoidal sequencing
⊗for;(concatenate traces), - arbitrary joins
⋁for branching/families (if,for, non‑det choice), - order
≤for refinement (“implementation refines specification”), - distributivity
x ⊗ ⋁S = ⋁{x ⊗ s | s ∈ S}so local reasoning composes through control‑flow. Using the language quantale (sets of words, concat, union) makes these properties immediate.
- monoidal sequencing
-
The concurrency rule checks per phase:
WritesDisjointPhase: distinct threads never write the same combination of memory space, base, and address
-
NoRAWIntraPhase: within a single phase, no read from one thread aliases a write from another thread
Typing rules
Our typing judgment has the shape Γ ⊢ s ▷ g: “In environment Γ, statement s synthesizes grade g.” See Kernel/Typing.lean for constructors like:
g_for_threads(parallel threads, data‑race‑freedom side‑conditions),g_if_guard(guard stamping),g_seq,g_bar_wg/g_bar_st,g_load,g_store, …
The synthesis function gradeOf : Stmt → Effects.Grade computes a conservative phase trace and is sound for the thread‑free fragment; the for_threads rule imposes the data‑race‑freedom side‑conditions via PhasesSafe g.
Rule reference (inference form)
Abbreviations:
ε= empty grade (Grade.eps)g ⊙ h= sequential composition (Grade.seq g h)B(g)= append a phase boundary (g ⊙ Grade.ofBarrier)R(loc)/W(loc)= singleton read/write footprints fromlocstamp(g,guard)= attachguardto all accesses ing
Basic constructs
Memory effects
Composition and barriers
Guards and families
Parallel threads (crux)
phases(g)isGrade.phases g(exposed in Lean for the side-condition)WritesDisjointPhase/NoRAWIntraPhasecome fromEffects.lean
How typing yields absence of data races
Typing is not just bookkeeping — it enforces absence of data races:
gradeOf s = gsummarizessas a phase trace with explicit barriers.- To type
.for_threads s, ruleg_for_threadsrequiresPhasesSafe g. PhasesSafe gpackages per‑phase obligations:WritesDisjointPhaseand the “no cross-thread read‑after‑write within a phase” predicateNoRAWIntraPhasefor every phase ing.- Barriers in
gsplit epochs; safety is required within phases, and barriers provide the happens‑before relation between phases. - Therefore, if
Γ ⊢ .for_threads s ▷ gholds, then runningsin parallel workgroup threads has no data races within each epoch and between epochs separated by barriers.
Theorem spotlight: certified WGSL Blelloch scan end-to-end
What the theorem says (now, for the emitted WGSL)
/-- Certified WGSL emission for the Blelloch scan. -/
theorem WGSL.CertifiedEmit_wgScan {Γ} (env : WGSL.Env) (offs : List Nat) :
let s := Kernel.Examples.WG.IR.wgScanStmt offs;
HasGrade Γ (.for_threads s) (WGSL.eraseGrade (WGSL.gradeOfGen env s)) ∧
WGSL.eraseGrade (WGSL.gradeOfGen env s) ≈ Kernel.Examples.WG.wgScanGrade offs
Plain language summary. For any offsets list offs (e.g. the usual Blelloch schedule) and any WGSL codegen environment env (names/bindings/sizes), let s := WG.IR.wgScanStmt offs. Then:
- The emitted WGSL program (obtained by lowering
swithgradeOfGenand erasing back to effects witheraseGrade) type-checks underfor_threadswith gradeeraseGrade (gradeOfGen env s). By ruleg_for_threads, this enforces the per-phase DRF side-conditions (pairwise-disjoint writes; no cross-thread RAW within a phase), so the kernel is race-free between barriers. - The erased WGSL grade matches the abstract specification
WG.wgScanGrade offsup to normalization (≈), i.e. after fusing adjacent non-empty phases without crossing explicit barriers.
Together, this is an implementation + codegen certificate: the WGSL you run is race-free with the intended barrier structure, and it agrees with the spec grade modulo explicit, semantics-preserving normalization.
Why “up to normalization”?
Grades are words of phases; we make normalization explicit so that proofs never smuggle rewrites into sequencing. Normalization (Effects.Grade.normalize) only
- fuses adjacent non‑empty phases,
- never crosses explicit barrier pairs
[⟨[],[]⟩, ⟨[],[]⟩](seenormalizeList_barrier_*lemmas).
Thus g ≈ h means {normalize g} = {normalize h} as languages. This gives the right extensionality notion for “same phase trace modulo trivial chunking”, while keeping barrier cuts hard.
Where the pieces live (code map)
-
Intermediate representation of the scan (concrete program & safety facts):
Kernel/Examples/Blelloch.lean→namespace WG.IR(hereIRis short for “intermediate representation”)wgScanStmt: builds the sequence of upsweeps, then downsweeps, each ending in a barrier.gradeOf_wgScanStmt: syntax-directed grade for the IR.hasGrade_forThreads_wgScanStmt_upToNorm: IR underfor_threads+ normalization equality to the spec.
-
Specification grade (abstract target):
Kernel/Examples/Blelloch.lean→namespace WGwgScanGrade:gradeUpsweeps offs ⋆ gradeDownsweeps offs.wgScanGrade_safe: all phases in the specification are proven safe for absence of data races.
-
Bridging IR ↔ Spec (normalization + safety):
gradeUpsweepIR_normalizes,gradeDownsweepIR_normalizes: per-stage normalization equalities.WG.IR.wgScanGradeIR_normalizes: IR grade normalizes to the spec.WG.IR.wgScanGradeIR_safe: IR phases are DRF-safe.
-
Arithmetic and affine lemmas discharging the obligations for absence of data races:
Kernel/Lemmas/Affine.leanupsweep_index_distinct,downsweep_index_distinct_both, …- Guard reasoning:
upsweep_guard_mixed_targets_ne, and related lemmas.
-
WGSL backend (lowering + grade preservation + certificate):
WGSL/Codegen.leangradeOfGenanderaseGrade: WGSL-side footprint/erasure.emit_grade_eq_IRandcg_preserves_grade: erasure preservesKernel.gradeOffor the IR.CertifiedEmit_wgScan: this theorem — composes the IR certificate with codegen to certify the emitted WGSL.
-
Concurrency rule and the data-race-freedom check (shared):
Kernel/Typing.leanHasGrade.g_for_threads: requiresPhasesSafe g.PhasesSafe.seq,PhasesSafe.barrier: compositional safety facts.
Proof sketch and how the parts compose
Pipeline (conceptual, end-to-end):
(concrete intermediate representation) (abstract specification)
wgScanStmt offs ─────→ wgScanGrade offs
│ ▲
│ gradeOf (syntax‑directed) │ normalization equality
▼ │ (no barrier crossing)
gradeOf (wgScanStmt offs) ──── normalize ──────┘
and, independently:
HasGrade Γ (wgScanStmt offs) (gradeOf …) ∧ PhasesSafe (gradeOf …)
└── by `g_for_threads` plus lemmas establishing absence of data races (affine reasoning per phase)
- IR certificate: Prove
.for_threads (wgScanStmt offs)with synthesized grade and show that grade normalizes to the spec (hasGrade_forThreads_wgScanStmt_upToNorm), using the affine lemmas plus the barrier-cut normalizer. - Codegen preservation: Show that WGSL lowering/erasure preserves the IR grade (
emit_grade_eq_IR,cg_preserves_grade). - Compose: Conclude
CertifiedEmit_wgScan: the emitted WGSL has theHasGradejudgment and matches the spec up to normalization.
What this buys us. The artifact you run (WGSL) is:
- Race-free per phase under
for_threads(DRF side-conditions hold). - Barrier-faithful (barriers are explicit and never crossed by normalization).
- Behaviorally equal to the abstract scan grade up to normalization.
How to reproduce / inspect in Lean
A few helpful commands while browsing:
-- IR-level certificate (used internally by the WGSL theorem):
#check Kernel.Examples.WG.IR.hasGrade_forThreads_wgScanStmt_upToNorm
-- End-to-end WGSL certificate:
#check WGSL.CertifiedEmit_wgScan
#check WGSL.emit_grade_eq_IR
#eval WGSL.PP.emit (WGSL.buildModule WGSL.emitEnv (Kernel.Examples.WG.IR.wgScanStmt [1,2,4,8]))
To build everything:
lake build Kernel
Current status
- Normalizer fixed points (lists): Abstract upsweep and downsweep chains are fixed points of
normalizeList. SeegradeUpsweeps_normal_listandgradeDownsweeps_normal_list. - Barrier‑cut lemmas:
normalizeList_barrier*ensures runs of non‑empty phases fuse but never cross explicit barrier pairs. - Equivalence between specification and intermediate representation: The grade of the full scan in the IR normalizes to the spec grade (
wgScanGradeIR_normalizes,gradeOf_wgScanStmt_normalizes). - Concrete IR performs the real math: Upsweep/downsweep bodies now issue the same pair of loads/writes that the spec accounts for, so there are no "ghost" temporaries or synthetic adders hidden in codegen. The IR-side phases (
gradeUpsweepIR,gradeDownsweepIR) already contain the two reads plus one/two writes, and the proofs exploit new triple/quad fusers inEffects.lean. - Discharging the obligations for absence of data races: Per-phase
WritesDisjointPhaseand the cross-threadNoRAWIntraPhaseare proven using affine index facts (Kernel/Lemmas/Affine.lean), packaged as…_safelemmas. - WGSL lowering mirrors the IR directly:
WGSL.Codegenno longer relies on ad-hoc expanders such asexpandUpsweepLoadStore?orexpandDownsweepTwoStores?; the standard recursive translator is sufficient because the IR already expresses the load/add/store sequences. This keeps the emitter and the proof of grade preservation in lockstep with the certified IR. - New end-to-end result:
hasGrade_forThreads_wgScanStmt_upToNorm(inKernel/Examples/Blelloch.lean) establishes that the concrete Blelloch scan in the intermediate representation, when run underfor_threads, synthesizes a grade whose normalized denotation equals the specification grade, and theHasGradejudgment ensures absence of data races per phase. This is the core certification that the implementation is race-free with the intended barriers. - Certified WGSL emission:
WGSL.CodegenlowersKernel.Stmtprograms to a tiny WGSL subset, proves grade erasure preservesKernel.gradeOf, and packages the existing scan proof intoCertifiedEmit_wgScan. That theorem states the emitted WGSL forwgScanStmt offsis race-free (viaHasGrade) and its grade matches the spec up to normalization, giving a turnkey artifact for downstream tooling.
What’s next: WebGPU Shading Language emission and running on hardware
With the core absence‑of‑data‑races and normalization story in place, the remaining steps are mechanical and pave the path to execution:
-
Host scaffolding and execution: Provide a small runtime to upload buffers, dispatch the certified WGSL kernel (
WGSL.PP.emit (WGSL.buildModule …)), and validate results (for example, compare against a CPU scan). -
Pipeline integration / artifacts: Hook the emitter into the build so a concrete WGSL file (and maybe SPIR-V) is produced alongside the Lean proof, making it easy to feed into
wgpu/nagatooling. -
(Optional) Minimality: Prove that removing any barrier in the specification or the intermediate representation increases the normalized word or introduces a data race (barrier optimality).
References
- Atkey, Parameterised Notions of Computation (graded/parameterised monads)
- Quantales for effect semantics; languages over free monoids
- Blelloch, Prefix Sums and Their Applications (work‑efficient scans)
Appendix: Key definitions (as implemented)
-
Lang α := Set (Word α)withQuantaleinstance (concat distributes oversSup) -
Phase(alphabet):reads : List Access,writes : List Access -
Grade := Word Phaseseq : Grade → Grade → Grade,barrier : Grade → Grade,normalize : Grade → Grade,denote : Grade → Lang Phase- lemmas:
toList_normalize,normalizeList_stage_pair,normalizeList_barrier_*
-
Obligations:
WritesDisjointPhase : Phase → PropNoRAWIntraPhase : Phase → Prop(cross-thread RAW forbidden within a phase)
-
Concurrency rule:
HasGrade.g_for_threads : (∀ p ∈ phases g, …) → HasGrade Γ (.for_threads body) g
Pointers for code archeology
LanguageQuantale.lean: words, languages, quantale instancesEffects.lean:Phase,Access,Guard,Grade, normalization, safety predicatesKernel/Syntax.lean: abstract syntax tree for the intermediate representation (expressions, loads/stores, guards, barriers, loops)Kernel/Typing.lean: typing rules, synthesisgradeOf,PhasesSafeKernel/Lemmas/Affine.lean: arithmetic facts for disjointness under guardsKernel/Examples/Blelloch.lean: specification and intermediate representation of the Blelloch scan, normalization and absence‑of‑data‑races proofs (including the spotlight lemma) From there, the WGSL backend reuses the exact same inhabitant:CertifiedEmit_wgScan(inWGSL/Codegen.lean) states that the pretty-printed WGSL module produced fromwgScanStmt offscarries the erased grade of the IR proof and that grade still matches the spec up to normalization. In short, the code you can hand to a WebGPU runtime is the one that satisfies the theorem above.