[{"data":1,"prerenderedAt":4},["ShallowReactive",2],{"Q0HemLsX4S":3},"# Hesper\n\n**Write GPU programs in Lean 4. Prove them correct. Run on WebGPU.**\n\n> [!IMPORTANT]\n> **This is Alpha Software.**\n> The APIs, verification features, and compiler are under active development and subject to breaking changes. While core functionality works, this project is primarily for research and experimentation.\n\nHesper is a verified GPU programming framework that brings the power of formal verification to GPU computing. Write type-safe shaders, execute tensor operations, and build graphics applications—all in Lean 4.\n\n```lean\nimport Hesper.WGSL.DSL\n\n-- Type-safe shader expressions with compile-time verification\nlet x : Exp (.scalar .f32) := var \"x\"\nlet y : Exp (.scalar .f32) := var \"y\"\nlet result := sqrt (x * x + y * y)  -- Generates: sqrt(x * x + y * y)\n\n-- Cannot mix types (compile error!)\n-- let wrong := x + (var \"i\" : Exp (.scalar .i32))  ✗ Type error!\n```\n\n## BitNet b1.58 Inference: 125 TPS on M4 Max\n\nHesper includes a complete **BitNet b1.58 2B** inference engine running entirely on WebGPU, achieving **125 tokens/second** on Apple M4 Max:\n\n```\n$ lake exe bitnet-complete --stats\n> Hello, world!\nHello, world! I'm a 20-year-old college student...\n\nPerformance: 125.6 TPS (8.0 ms/token)\n  Model: BitNet b1.58 2B (30 layers, 2560 dim, i2_s ternary weights)\n```\n\n**Key optimizations:**\n- **Flash Attention**: fused score + online softmax + apply in 1 kernel (3 kernels → 1)\n- Ternary weight kernel (i2_s): 2-bit packed weights, addition-only matmul\n- Kernel fusion: fused gate+up+ReLU²×mul and fused KV cache write (150 fewer dispatches/token)\n- Shared memory F16 matmul for LM head (128K vocab)\n- PreparedDispatch graph capture: ~99% pipeline cache hit rate\n- Command buffer batching: single GPU submit per token\n- KV cache with grouped-query attention (20 heads, 5 KV heads)\n\n**Also: 40 TPS on RTX 4070 Ti (Vulkan)**\n\nSee [bitnet.lean](https://github.com/Verilean/bitnet.lean) for the full inference pipeline.\n\n### LoRA Finetuning (Alpaca-style Instruction Tuning)\n\nHesper supports LoRA finetuning with a **verified backward pass**:\n\n```bash\n# Train on Alpaca-format dataset\nlake exe alpaca-finetune --model model.gguf --data alpaca_data.json --epochs 50 --rank 8\n\n# Inference with LoRA adapter\nlake exe bitnet-complete model.gguf \"What is Hesper?\" 60 --lora lora_weights.bin\n```\n\n**Training features:**\n- Complete backward chain: 13/13 ops (attention 7 + FFN 6)\n- Verified AD: each backward op numerically checked against CPU spec\n- GPU ↔ CPU consistency: all backward kernels match CPU spec (error = 0.0)\n- Type-safe backward chain: missing ops cause compile-time error\n- AdamW optimizer with gradient clipping, LR scheduling (cosine + warmup)\n- GPU-batched forward + backward (1 GPU submit per token)\n\n### Verified Automatic Differentiation\n\nEvery backward operation is verified correct:\n\n```bash\n$ lake exe verified-ad\n  PASS Softmax, RoPE, RMSNorm, ScaledDot, ReLU²×Mul  (numerical gradient check)\n  PASS Chain rule composition: error = 0.0\n  ✓ All AD verifications PASSED\n\n$ lake exe gpu-vs-cpu-test\n  ✓ SoftmaxBackward, RMSNormBackward, RoPEBackward, ReLU²×Mul  (GPU matches CPU spec)\n\n$ lake exe chain-completeness\n  ✓ Backward chain is COMPLETE (13/13 ops)\n```\n\n## Why Hesper?\n\nModern GPU programming lacks safety guarantees. Hesper provides:\n\n- **Type Safety**: Shaders are type-checked at compile time, preventing type mismatches\n- **Formal Verification**: Prove correctness properties about your GPU programs\n- **Verified Training**: Backward ops numerically checked, GPU kernels match CPU specs\n- **WebGPU Backend**: Cross-platform GPU access via Dawn (Metal, Vulkan, D3D12)\n- **Lean Integration**: Use Lean's powerful theorem proving alongside GPU computation\n- **Multi-GPU Support**: Select and coordinate across multiple GPU adapters\n\n## Quick Start\n\n### Prerequisites\n\n- **Platform**: macOS (Metal), Linux (Vulkan), or Windows (D3D12/Vulkan)\n\n### 🐳 Docker Environment (Recommended for Linux/CI)\n\nFor a reproducible build environment, especially on Linux, you can use the provided Docker image:\n\n```bash\n# Build the image\ndocker build -t hesper-ci .\n\n# Run build and tests inside container\ndocker run -it hesper-ci lake test-all\n```\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/Verilean/hesper.git\ncd hesper\n\n# Build native dependencies (this will take a while on first build)\nlake run buildNative\n\n# Build and run a demo\nlake build dsl-basics\n./.lake/build/bin/dsl-basics\n```\n\n### Your First Hesper Program\n\nCreate `MyFirst.lean`:\n\n```lean\nimport Hesper\nimport Hesper.WebGPU.Device\n\ndef main : IO Unit := do\n  -- Initialize WebGPU\n  Hesper.init\n\n  -- Get a GPU device\n  let device ← Hesper.WebGPU.getDevice\n\n  IO.println \"✓ GPU ready!\"\n```\n\nBuild and run:\n\n```bash\nlake build myfirst\n./.lake/build/bin/myfirst\n```\n\n## Features\n\n### 🚀 Portable SIMD CPU Backend (Google Highway)\n\nHardware-accelerated CPU operations powered by **Google Highway**, providing high-performance SIMD across x86, ARM, and RISC-V:\n\n```lean\nimport Hesper.Simd\nimport Hesper.Float32\nimport Hesper.Float16\n\n-- Float64 (8 bytes): Native Lean Float, NEON 2/op, AVX2 4/op\nlet a64 := FloatArray.mk #[1.0, 2.0, 3.0, 4.0]\nlet c64 := Hesper.Simd.simdAdd a64 b64\n\n-- Float32 (4 bytes): 2x memory savings, NEON 4/op, AVX2 8/op\nlet a32 := Float32.fromFloatArray a64\nlet c32 := Float32.simdAdd a32 b32\n\n-- Float16 (2 bytes): 4x memory savings, NEON 8/op, AVX2+F16C 8/op\n-- Requires ARMv8.2-A FP16 or x86_64 F16C - returns error if unavailable\nlet hasFP16 ← Float16.hasHardwareSupport\nif hasFP16 then\n  let a16 ← Float16.fromFloatArray a64\n  let c16 ← Float16.simdAdd a16 b16\n```\n\n**Features:**\n- **Google Highway Integration**: Portable SIMD implementation with runtime dispatch\n- **Architecture Support**: NEON (ARM), AVX2/AVX-512 (x86), optional FP16 vector arithmetic\n- **Multi-Precision**: Optimized paths for Float64, Float32, and Float16\n- **OpenMP Support**: Optional multithreading for large tensor operations\n\n**Zero-Conversion Architecture:**\nAll operations work directly on raw `ByteArray` with no automatic type conversions. Conversions are explicit only when needed.\n\n### ⚡️ High-Level Parallel API\n\nInspired by `webgpu-dawn`, Hesper provides an easy-to-use API for data-parallelism that handles all GPU boilerplate (buffers, shaders, synchronization) in a single call.\n\n#### parallelFor\n\nQuickly execute a WGSL shader over a `Float` array:\n\n```lean\nimport Hesper.Compute\n\n-- Multiply each element by 1000 on the GPU\nlet result ← parallelFor device shader inputData\n```\n\n#### Device.compute\n\nRun a computation with multiple named buffers directly on the `Device`:\n\n```lean\ndevice.compute myKernel [(\"input\", inputBuf), (\"output\", outputBuf)] config\n```\n\n### 🎯 Type-Safe Shader DSL\n\nWrite WGSL shaders with Lean's type system guaranteeing correctness:\n\n```lean\nimport Hesper.WGSL.DSL\n\n-- Expressions are typed and checked at compile time\nlet x : Exp (.scalar .f32) := var \"x\"\nlet y : Exp (.scalar .f32) := var \"y\"\n\n-- Arithmetic operators work naturally\nlet distance := sqrt (x * x + y * y)\n\n-- Built-in functions\nlet clamped := Exp.clamp x (lit 0.0) (lit 1.0)\nlet power := Exp.pow x (lit 2.0)\n\n-- Generate WGSL code\nIO.println distance.toWGSL  -- Output: sqrt((x * x) + (y * y))\n```\n\n### 🧩 Verified Composable Kernels (Operator Fusion)\n\nHesper's `VerifiedOpFusion` architecture allows you to compose multiple GPU operations into a single kernel pass while maintaining formal correctness:\n\n```lean\n-- Fuses MatMul and ReLU into a single GPU kernel\n-- Correctness is proven by construction\nlet fusedOp := matmulKernel |> reluKernel\n```\n\n**Key Advantages:**\n- **Zero-Copy Fusion**: Eliminate expensive memory roundtrips between kernels.\n- **Formal Correctness**: Each fused kernel is verified against a high-level CPU specification (`spec_forward`).\n- **Unified Interface**: Same code runs on GPU (via WGSL) or CPU (via Google Highway) for easy debugging.\n\n### 📈 Unified Verified Automatic Differentiation\n\nHesper's unique architecture unifies **formal verification** with **automatic differentiation** via a shared **Differentiable** interface. This allows the AD engine to treat complex, verified GPU kernels as first-class primitives.\n\n#### The Differentiable Interface\n\nAll operations in Hesper—from simple scalar addition to fused ResNet blocks—implement this common trait:\n\n```lean\nclass Differentiable (I O : Type) where\n  /-- Primal execution (Forward pass) -/\n  forward : I → O\n  \n  /-- Adjoint computation (Backward pass) -/\n  /-- Matrix-Free Vector-Jacobian Product (Jᵀv) -/\n  backward : I → O → I\n```\n\n#### Why it Matters:\n\n- **Unified Logic**: Scalar-CPU logic and Tensor-GPU kernels share the same mathematical abstraction.\n- **End-to-End Correctness**: By \"lifting\" `VerifiedOp` instances into the AD tape, Hesper ensures that backpropagation is as formally correct as the forward pass.\n- **Zero-Copy Fusion**: The AD engine can calculate gradients across fused kernels (e.g., `MatMul |> ReLU`) without writing intermediate tensors to VRAM.\n\n```lean\n-- AD engine automatically dispatches to hand-optimized GPU kernels\nlet grad := diff (matmul |> relu |> crossEntropy) input \n```\n\n**Key Features:**\n- **Hybrid AD**: Seamlessly switch between CPU-scalar AD and GPU-tensor AD.\n- **Verified Primitives**: Every AD node is backed by a verified `spec_forward` and `spec_backward`.\n- **High Performance**: Leverages Hand-optimized WGSL and Google Highway SIMD.\n\n### ⚙️ High-Level Optimizers\n\nTrain models using state-of-the-art optimizers that integrate with Hesper's verified tensors:\n\n```lean\nimport Hesper.Optimizer.SGD\n\n-- Configure SGD with momentum\nlet opt := SGDConfig.default\n  |>.withLearningRate 0.01 \n  |>.withMomentum 0.9\n\n-- Perform optimization step\nlet (newParams, newState) := opt.step params grads state\n```\n\n### 🎮 Graphics & Windowing\n\nBuild interactive graphics applications with GLFW integration:\n\n```lean\nimport Hesper.GLFW\n\ndef main : IO Unit := do\n  Hesper.init\n\n  withGLFW do\n    let window ← createWindow 800 600 \"Hesper Graphics\"\n    let device ← Hesper.WebGPU.getDevice\n    let surface ← createSurface device window\n\n    -- Render loop\n    gameLoop window surface\n```\n\n### 🔌 Multi-GPU Support\n\nEnumerate and select GPUs in multi-GPU systems:\n\n```lean\nimport Hesper.WebGPU.Device\n\n-- List all available GPUs\nHesper.WebGPU.listAdapters\n\n-- Select specific GPU\nlet device0 ← getDeviceByIndex 0  -- First GPU\nlet device1 ← getDeviceByIndex 1  -- Second GPU\n\n-- Get adapter information\nlet info ← getAdapterInfo 0\nIO.println s!\"GPU: {info.name} (Backend: {info.backendType})\"\n```\n\n## Examples\n\n### WebGPU Tetris\n\nA full Tetris implementation using GLFW and WebGPU, demonstrating:\n- Dynamic shader generation\n- Real-time rendering\n- Input handling\n- Game state management\n\n```bash\nlake build tetris\n./.lake/build/bin/tetris\n```\n\n**Controls**: A/D (move), S (drop), Space (rotate), ESC (exit)\n\n### Matrix Multiplication\n\nHigh-performance matrix multiplication with subgroup optimizations:\n\n```bash\nlake build matmul-demo\n./.lake/build/bin/matmul-demo\n```\n\nDemonstrates:\n- GPU buffer management\n- Compute shader execution\n- Performance profiling\n- Result verification\n\n### SIMD CPU Backend\n\nMulti-precision SIMD operations with hardware acceleration:\n\n```bash\n# Run multi-precision test (Float64/Float32/Float16)\nlake script run buildSimd\nlake build multi-precision\n./.lake/build/bin/multi-precision\n\n# Run SIMD benchmarks\nlake build simd-bench\n./.lake/build/bin/simd-bench\n```\n\nOutput:\n```\nBackend: NEON (ARM64) - F64: 2/op, F32: 4/op, FP16\n\n─── Float64 (8 bytes/element) ───\nResult: #[6.0, 8.0, 10.0, 12.0] ✓\n\n─── Float32 (4 bytes/element) ───\nResult: Float32[4]: [6.0, 8.0, 10.0, 12.0] ✓\n\n─── Float16 (2 bytes/element) ───\nFP16 hardware detected!\nResult: Float16[4]: [6.0, 8.0, 10.0, 12.0] ✓\n```\n\n### Multi-GPU Demo\n\nEnumerate GPUs and create devices from specific adapters:\n\n```bash\nlake build multigpu\n./.lake/build/bin/multigpu\n```\n\nOutput:\n```\nFound 2 GPU adapter(s):\n  [0] NVIDIA GeForce RTX 3080 (Backend: Vulkan)\n  [1] Intel UHD Graphics 630 (Backend: Vulkan)\n✓ Device created from GPU 0\n```\n\n### Neural Network Training\n\nAutomatic differentiation and gradient descent on GPU:\n\n```bash\nlake build nn-gpu-demo\n./.lake/build/bin/nn-gpu-demo\n```\n\nFeatures:\n- Conv2D layers with verified gradients\n- Backpropagation on GPU\n- Real-time training visualization\n\n## Building and Testing\n\n### Building the Project\n\nHesper requires building both native C++ dependencies (Google Dawn) and Lean code.\n\n**Step 1: Build Native Dependencies**\n\nThe first build will take 5-15 minutes as it compiles Google Dawn from source:\n\n```bash\n# Build the native WebGPU bridge (hesper_native library)\nlake script run buildNative\n```\n\nThis compiles:\n- Google Dawn WebGPU implementation\n- C++ FFI bridge (`native/bridge.cpp`)\n- SIMD CPU backend (`c_src/simd_ops.cpp`)\n\n**Step 2: Build Lean Code**\n\nOnce native dependencies are built, compile the Lean libraries and executables:\n\n```bash\n# Build the core library\nlake build Hesper\n\n# Or build a specific executable\nlake build simple-write\n```\n\n**Clean Build** (if you encounter issues):\n\n```bash\nlake clean\nlake script run buildNative\nlake build\n```\n\n### Testing the Installation\n\n#### 1. Simple GPU Test (Raw WGSL + DSL)\n\nThis test verifies both raw WGSL shaders and DSL-generated shaders execute correctly on your GPU:\n\n```bash\nlake build simple-write\n./.lake/build/bin/simple-write\n```\n\n**Expected output:**\n```\n╔══════════════════════════════════════╗\n║   GPU Double Test (DSL + Raw)        ║\n╚══════════════════════════════════════╝\n\n📝 DSL-generated WGSL:\n─────────────────────────────────────\n@group(0) @binding(0) var\u003Cstorage, read_write> input: array\u003Cf32>;\n@group(0) @binding(1) var\u003Cstorage, read_write> output: array\u003Cf32>;\n\n@compute @workgroup_size(1)\nfn main(@builtin(global_invocation_id) gid: vec3\u003Cu32>) {\n    let idx = gid.x;\n    if (idx \u003C arrayLength(&input)) {\n        output[idx] = input[idx] * 2.0;\n    }\n}\n\n🚀 Initializing WebGPU...\n  ✓ Created input buffer\n  ✓ Wrote input: [1.0, 2.0, 3.0, 4.0]\n  ✓ Created output buffer\n\n  🔹 Test 1: Raw WGSL shader\n  ✓ Raw WGSL executed\n\n  🔹 Test 2: DSL-generated shader\n  ✓ DSL shader executed\n\n📊 Results:\n  Input → Expected → Raw WGSL → DSL WGSL\n  [0] 1.0 → 2.0 → 2.0 ✓ → 2.0 ✓\n  [1] 2.0 → 4.0 → 4.0 ✓ → 4.0 ✓\n  [2] 3.0 → 6.0 → 6.0 ✓ → 6.0 ✓\n  [3] 4.0 → 8.0 → 8.0 ✓ → 8.0 ✓\n\n✅ SUCCESS: Both shaders work correctly!\n   - Raw WGSL shader: ✓\n   - DSL-generated shader (ShaderM monad): ✓\n   - Both produce identical correct results\n```\n\nThis test validates:\n- ✓ WebGPU initialization and GPU discovery\n- ✓ Buffer creation and data transfer (CPU ↔ GPU)\n- ✓ Raw WGSL shader compilation and execution\n- ✓ DSL shader code generation (ShaderM monad → WGSL)\n- ✓ DSL shader execution on GPU\n- ✓ Correct data marshalling across the FFI boundary\n\n#### 2. FFI Boundary Tests\n\nTest data conversion across the Lean ↔ C++ FFI boundary:\n\n```bash\nlake build ffi-tests\n./.lake/build/bin/ffi-tests\n```\n\n**Expected output:**\n```\n╔══════════════════════════════════════╗\n║   FFI Boundary Tests                 ║\n╚══════════════════════════════════════╝\n\nTest 1: Lean writes data, C++ reads\n  ✓ Lean wrote: [1.0, 2.0, 3.0, 4.0]\n  ✓ C++ verified byte-level accuracy\n\nTest 2: C++ writes data, Lean reads\n  ✓ GPU wrote: [10.0, 20.0, 30.0, 40.0]\n  ✓ Lean verified byte-level accuracy\n\nTest 3: Round-trip (Lean → GPU → Lean)\n  ✓ Input: [5.0, 10.0, 15.0, 20.0]\n  ✓ Output: [10.0, 20.0, 30.0, 40.0]\n  ✓ Data integrity preserved\n\n✅ All FFI boundary tests passed!\n```\n\nThis validates:\n- Lean writes ByteArray → C++ reads correct bytes\n- C++ writes bytes → Lean reads correct Float values\n- Round-trip data integrity across FFI boundary\n\n#### 3. SIMD CPU Backend Test\n\nTest multi-precision SIMD operations (Float64/Float32/Float16):\n\n```bash\nlake script run buildSimd\nlake build multi-precision\n./.lake/build/bin/multi-precision\n```\n\n**Expected output (on ARM64 with FP16 support):**\n```\nBackend: NEON (ARM64) - F64: 2/op, F32: 4/op, FP16\n\n─── Float64 (8 bytes/element) ───\nResult: #[6.0, 8.0, 10.0, 12.0] ✓\n\n─── Float32 (4 bytes/element) ───\nResult: Float32[4]: [6.0, 8.0, 10.0, 12.0] ✓\n\n─── Float16 (2 bytes/element) ───\nFP16 hardware detected!\nResult: Float16[4]: [6.0, 8.0, 10.0, 12.0] ✓\n```\n\n### For Contributors: Testing Your Changes\n\nWhen making changes to Hesper, run these tests to ensure you haven't broken anything:\n\n#### 1. Core FFI Tests\n```bash\n# Test Lean ↔ C++ data conversion\nlake build ffi-tests\n./.lake/build/bin/ffi-tests\n```\n\n#### 2. GPU Shader Tests\n```bash\n# Test raw WGSL and DSL shader execution\nlake build simple-write\n./.lake/build/bin/simple-write\n```\n\n#### 3. SIMD Tests\n```bash\n# Rebuild SIMD library and run tests\nlake script run buildSimd\nlake build simd-test\n./.lake/build/bin/simd-test\n```\n\n#### 4. Full Test Suite\n```bash\n# Run all tests\nlake build test-all\n./.lake/build/bin/test-all\n```\n\n### Troubleshooting\n\n#### Issue: \"Build failed: native library not found\"\n**Solution:** Rebuild the native library:\n```bash\nlake clean\nlake script run buildNative\nlake build\n```\n\n#### Issue: \"No GPU adapters found\"\n**Solution:** Ensure you have proper GPU drivers:\n- **macOS**: No action needed (Metal is built-in)\n- **Linux**: Install Vulkan drivers (`vulkan-tools`, `mesa-vulkan-drivers`)\n- **Windows**: Install latest GPU drivers with D3D12/Vulkan support\n\n#### Issue: \"SIMD library not found\"\n**Solution:** Build the SIMD backend:\n```bash\nlake script run buildSimd\n```\n\n#### Issue: \"FP16 not supported\"\n**Solution:** This is expected on older hardware. Float16 requires:\n- ARM64: ARMv8.2-A with FP16 extension (Apple M1+, AWS Graviton2+)\n- x86_64: F16C extension (Intel Ivy Bridge+ / AMD Bulldozer+)\n\nThe library will gracefully fall back to Float32 operations.\n\n#### Issue: Dawn build takes too long\n**Solution:** Dawn's first build can take 10-15 minutes. Subsequent builds are incremental and much faster. To speed up:\n```bash\n# Use more CPU cores (adjust -j value)\nlake script run buildNative -- -j 16\n```\n\n## How It Works\n\n```\n┌─────────────────────────────────────────────────────────────┐\n│                    Lean 4 Code                               │\n│  • Type-safe shader DSL                                      │\n│  • Tensor operations                                         │\n│  • Formal proofs                                             │\n└─────────────────────┬───────────────────────────────────────┘\n                      │\n                      ▼\n┌─────────────────────────────────────────────────────────────┐\n│              WGSL Code Generation                            │\n│  Exp (.scalar .f32) → WGSL shader source                    │\n└─────────────────────┬───────────────────────────────────────┘\n                      │\n                      ▼\n┌─────────────────────────────────────────────────────────────┐\n│              Lean FFI (C++ Bridge)                           │\n│  • lean_hesper_* functions                                   │\n│  • Resource management via Lean.External                     │\n└─────────────────────┬───────────────────────────────────────┘\n                      │\n                      ▼\n┌─────────────────────────────────────────────────────────────┐\n│              Google Dawn (WebGPU Native)                     │\n│  • Metal (macOS)                                             │\n│  • Vulkan (Linux/Windows)                                    │\n│  • D3D12 (Windows)                                           │\n└─────────────────────────────────────────────────────────────┘\n```\n\n### Architecture Layers\n\n1. **DSL Layer**: Type-safe WGSL expression builder with dependent types\n2. **Tensor Layer**: High-level operations (matmul, conv2d, pooling)\n3. **Compute Layer**: Shader compilation, buffer management, execution\n4. **WebGPU Layer**: FFI bindings to Dawn native implementation\n5. **Backend Layer**: Platform-specific GPU drivers (Metal/Vulkan/D3D12)\n\n## Project Structure\n\n```\nHesper/\n├── Hesper/\n│   ├── WGSL/          # Type-safe shader DSL\n│   │   ├── Types.lean      # WGSL type system\n│   │   ├── Exp.lean        # Expression AST\n│   │   └── DSL.lean        # User-facing DSL\n│   ├── WebGPU/        # WebGPU bindings\n│   │   ├── Device.lean     # GPU device management\n│   │   ├── Buffer.lean     # GPU buffers\n│   │   ├── Shader.lean     # Shader modules\n│   │   ├── Pipeline.lean   # Compute/render pipelines\n│   │   └── Errors.lean     # Comprehensive error handling\n│   ├── Tensor/        # Tensor operations\n│   │   └── MatMul.lean     # Matrix multiplication\n│   ├── NN/            # Neural network layers\n│   │   └── Conv.lean       # Convolution layers\n│   ├── GLFW/          # Windowing and graphics\n│   │   └── GLFW.lean       # GLFW bindings\n│   ├── Simd.lean      # SIMD Float64 operations\n│   ├── Float32.lean   # SIMD Float32 operations\n│   ├── Float16.lean   # SIMD Float16 operations\n│   └── Compute.lean   # High-level compute API\n├── Examples/          # Example programs\n│   ├── Tetris.lean         # Full game demo\n│   ├── MultiGPU.lean       # Multi-GPU support\n│   ├── DSLBasics.lean      # DSL tutorial\n│   └── ...\n├── native/            # C++ WebGPU bridge\n│   ├── bridge.cpp          # FFI implementation\n│   └── CMakeLists.txt      # Build configuration\n├── c_src/             # SIMD CPU backend\n│   └── simd_ops.cpp        # NEON/AVX2 implementations\n├── Tests/             # Comprehensive test suite\n│   ├── ErrorTests.lean     # Error handling tests\n│   ├── ShaderTests.lean    # Shader monad tests\n│   └── ...\n└── lakefile.lean      # Lake build script\n```\n\n## Roadmap\n\n**Current Status**: Early Development (Alpha)\n\n- [x] **Multi-precision SIMD CPU backend (Google Highway)**\n- [x] **Architecture detection (NEON/AVX2/F16C)**\n- [x] **Comprehensive error handling with structured error types**\n- [x] **Complete test suite (error handling, shader monad)**\n- [x] **Docker-based CI environment**\n- [x] **Verified Composable Kernels (VerifiedOpFusion)**\n\n- [x] **BitNet b1.58 inference engine (125 TPS on M4 Max)**\n- [x] **Kernel fusion: fused gate+up+ReLU²×mul, fused KV cache write**\n- [x] **KV cache with grouped-query attention**\n- [x] **PreparedDispatch graph capture (99%+ cache hit rate)**\n\nIn Progress:\n- [ ] Comprehensive tensor operation library (GEMM, Conv3D)\n- [ ] Gemma 3 / Transformer support\n- [ ] Automatic differentiation on GPU kernels\n- [ ] Formal proofs of kernel numerical stability\n- [ ] Integration with Lean's tactic framework\n\n## Contributing\n\nHesper is part of the **Verilean** organization's effort to bring verified computing to GPUs.\n\n### How to Contribute\n\n1. **Fork the repository** and create a feature branch\n2. **Make your changes** following the existing code style\n3. **Run the test suite** to ensure nothing broke:\n   ```bash\n   # Core FFI boundary tests\n   lake build ffi-tests\n   ./.lake/build/bin/ffi-tests\n\n   # GPU shader tests (raw WGSL + DSL)\n   lake build simple-write\n   ./.lake/build/bin/simple-write\n\n   # SIMD tests (if you modified SIMD code)\n   lake script run buildSimd\n   lake build simd-test\n   ./.lake/build/bin/simd-test\n   ```\n4. **Add tests** for new features (see `Examples/Tests/` for examples)\n5. **Submit a pull request** with a clear description of changes\n\n### Testing Guidelines\n\n- **FFI changes**: Always run `test-ffi` to verify Lean ↔ C++ data marshalling\n- **DSL changes**: Run `simple-write` to verify WGSL code generation\n- **GPU operations**: Test with real GPU hardware, not just compilation\n- **SIMD changes**: Test on both ARM64 (NEON) and x86_64 (AVX2) if possible\n- **Cross-platform**: macOS (Metal), Linux (Vulkan), Windows (D3D12/Vulkan)\n\n### Code Organization for Contributors\n\n```\nHesper/\n├── Hesper/               # Core library\n│   ├── WGSL/            # Type-safe shader DSL\n│   ├── WebGPU/          # WebGPU bindings (Device, Buffer, Shader, Pipeline)\n│   ├── Compute.lean     # High-level compute API\n│   ├── Simd.lean        # SIMD Float64 operations\n│   ├── Float32.lean     # SIMD Float32 operations\n│   └── Float16.lean     # SIMD Float16 operations\n├── Examples/             # Example programs (organized by category)\n│   ├── DSL/             # DSL feature demonstrations\n│   ├── Compute/         # GPU compute examples\n│   ├── MachineLearning/ # Neural network training\n│   ├── Graphics/        # GLFW rendering demos\n│   ├── SIMD/            # CPU SIMD benchmarks\n│   ├── Tests/           # Integration tests\n│   └── Utilities/       # Helper utilities\n├── Tests/                # Unit tests\n│   ├── FFIBoundaryTests.lean  # Lean ↔ C++ data conversion tests\n│   └── FusionTest.lean        # Operator fusion tests\n├── native/               # C++ WebGPU bridge\n│   ├── bridge.cpp       # FFI implementation (lean_hesper_* functions)\n│   └── CMakeLists.txt   # Dawn build configuration\n├── c_src/                # SIMD CPU backend\n│   └── simd_ops.cpp     # NEON/AVX2 implementations\n└── lakefile.lean         # Lake build script\n```\n\n**Key files for contributors:**\n- **`native/bridge.cpp`**: FFI boundary - all Lean ↔ C++ data conversion happens here\n- **`Hesper/WGSL/Monad.lean`**: ShaderM monad for imperative shader construction\n- **`Hesper/WGSL/Execute.lean`**: Compiles ShaderM → WGSL and executes on GPU\n- **`Examples/Tests/SimpleWrite.lean`**: Reference test showing raw WGSL vs DSL execution\n- **`Tests/FFIBoundaryTests.lean`**: Reference test for FFI data conversion\n\n### Links\n\n- **Report Issues**: [GitHub Issues](https://github.com/Verilean/hesper/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/Verilean/hesper/discussions)\n- **Sister Project**: [Sparkle HDL](https://github.com/Verilean/sparkle) - Verified hardware design in Lean 4\n\n## Author\n\n**Junji Hashimoto**\n\nTwitter/X: [@junjihashimoto3](https://twitter.com/junjihashimoto3)\n\n## License\n\nApache License 2.0 - see LICENSE file for details\n\n## Acknowledgments\n\n- **Google Dawn** for the WebGPU native implementation\n- **Lean 4** for the foundation of verified programming\n- **WebGPU Working Group** for the standard\n- **gpu.cpp (Answer.AI):** High-level C++ API wrapper inspiration.\n\n---\n\n*Write GPU code that's not just fast—make it correct by construction.*\n",1776560096154]