Lean 4 Regex
A regular expression engine written in Lean 4.
This library is based on the Rust regex crate and extended for compatibility with PCRE2 (see Syntax).
Main restrictions:
- focus on unicode data,
- simple api and few configurations,
- BoundedBacktracker as the single regex engine,
- no optimizations.
Installation
Add the following dependency to lakefile.toml:
[[require]]
name = "Regex"
git = "https://github.com/bergmannjg/regex"
Documentation
The main documentation is in Regex.lean
Verification
Some parts of the regex compilation and execution are verified.
- The compilation throws no exceptions.
- All transitions in the resulting NFA are valid (i.e. every transition maps to a valid state).
- Executing the NFA with the bounded backtracker always terminates.
- The matches of the regex are slices of the original string.
Example
Get captures of "∀ (n : Nat), 0 ≤ n" :
def Main : IO Unit := do
let re := regex% r"^\p{Math}\s*.(?<=\()([a-z])[^,]+,\s*(\p{Nd})\s*(\p{Math})\s*\1$"
let captures := Regex.captures "∀ (n : Nat), 0 ≤ n" re
IO.println s!"{captures}"
Output is
fullMatch: '∀ (n : Nat), 0 ≤ n', 0, 22
groups: #[(some ('n', 5, 6)), (some ('0', 15, 16)), (some ('≤', 17, 20))]
Api
- regex% : notation to build the regular expression at compile time
- captures : searches for the first match of the regular expression
Components of regular expression:
- \p{Math} : match all characters with the Math property
- (?<=\() : lookbehind of char '('
- (\p{Nd}) : capturing group of numeric characters
- \1 : backreference to first capturing group
Test
The library is tested with the testdata of the PCRE2 library and the testdata of the Rust Regex crate.
The tests are run with
lake test
License
This project is licensed under the MIT license.
The data in Regex/Unicode/data/ is licensed under the Unicode
License Agreement
(LICENSE-UNICODE).
The .toml files in testdata/rust are licensed under the
(LICENSE-RUST-REGEX).