LeanHuffmanCoding
A Lean 4 library that implements Huffman coding and formally proves the algorithm optimal in the same codebase. Originally a course project for IISc, Proofs and Programs 2025.
The defining feature: every runtime function is bridged to a proof-side
counterpart, so the executable code inherits the optimality theorem. It also
emits/consumes the classic Unix pack .z format.
[!Note] Work in progress. Interop against the system
pack/unpackbinary is not yet verified (see Status below).
Features
- Huffman tree construction from
(symbol, frequency)tables. - Formal proof of optimality of the Huffman algorithm (
HuffmanProofs), bridged to the runtime API (Huffman.leastEncodedData_optimal). - A reusable
Codecwith symbol↔code lookup, stream encode/decode, and explicitExcept Stringerrors. Two bit representations:List Bool(spec-faithful):encodeSymbols/decodeBits.- ByteArray (fast, for large data):
encodeBytesBA/decodeBytesBA.
- Convenience wrappers for
StringandByteArray. - Unix
pack.zarchive format (packBytes/unpackBytes,packFile/unpackFile): real magic0x1f 0x1e, canonical MSB-first codes, implicit EOF, and JPEG-style code-length limiting (≤25) so any input is valid.
Build
lake exe cache get # fetch the prebuilt Mathlib cache (first time)
lake build # runtime library + proofs + tests (CI default)
lake build Huffman # just the runtime library
Toolchain is pinned (lean-toolchain = v4.29.0, Mathlib v4.29.0).
Quick start (library)
import Huffman
open Huffman
def table : FrequencyTable Char := [('a', 5), ('b', 2), ('c', 1)]
def run : Except String (List Char) := do
let codec ← buildCodec table
let bits ← encodeSymbols codec ['a', 'b', 'a', 'c']
decodeBits codec bits
pack / unpack .z
import Huffman
open Huffman
#eval packFile "input.txt" "input.txt.z" -- → IO (Except String Unit)
#eval unpackFile "input.txt.z" "out.txt"
Or from the shell via the helper scripts:
lake env lean --run scripts/PackFile.lean input.txt input.txt.z
lake env lean --run scripts/UnpackFile.lean input.txt.z out.txt
See PACK_FORMAT.md for the byte-level format and sources.
Using it as a dependency
In your lakefile.toml:
[[require]]
name = "huffman"
git = "https://github.com/<owner>/LeanHuffmanCoding"
rev = "main"
Depending on the Huffman library does not pull in the proof stack
(HuffmanProofs) or tests.
Status / future work
- ✅ Optimality proof, runtime bridge, ByteArray fast path,
pack.zformat, code-length limiting,#guardtest suite + CI gating (incl.sorrycheck). - ⏳ Verify interop with the system
pack/unpackbinary and add golden.zround-trip tests (blocked on installingpack). - ⏳ Trim remaining linter warnings.
Acknowledgements
Thanks to Professor Siddhartha Gadgil, IISc for guidance throughout the project.