wireform-html
wireform-html implements the HTML5 parsing algorithm and the DOM-style tools
you need on top of it: CSS selector queries, a streaming rewriter, incremental
parsing, and typed ToHTML/FromHTML deriving. Use it when you scrape or
transform HTML in Haskell and want spec-correct tree construction (validated
against html5lib, 1,779 / 1,779 tests passing) plus fast byte-level rewriting
without building a full DOM.
Key features
Section titled “Key features”| Capability | Module | Why it matters |
|---|---|---|
| HTML5 tree builder | HTML.Parse, HTML.DOM | Spec-compliant documents and fragments |
| CSS Selectors Level 4 | HTML.Selector, HTML.DOM | Query parsed trees with familiar selector syntax |
| Streaming rewriter | HTML.Rewriter | lol-html-style transforms in one pass, bounded memory |
| Incremental parsing | HTML.DOM | Feed HTML chunks as they arrive |
| Template Haskell deriving | HTML.Class, HTML.Derive | deriveHTML with wireform-derive annotations; Generic defaults for simple cases |
| C SIMD scanner | cbits/fast_html.c | Vectorized tag and text scanning on hot paths |
Basic usage
Section titled “Basic usage”Parse a document and query with CSS selectors
Section titled “Parse a document and query with CSS selectors”Build a document once, then use selector strings the same way you would in browser DevTools. The parser also builds a pre-order element index so repeated queries stay fast.
{-# LANGUAGE OverloadedStrings #-}import Data.Maybe (fromMaybe)import Data.Text (Text)import Data.ByteString (ByteString)import HTML.DOM ( parseDocument , documentElement , querySelectorAll , getAttribute )
extractLinkHrefs :: ByteString -> [Text]extractLinkHrefs html = let doc = parseDocument html root = documentElement doc links = querySelectorAll root "a[href]" in map (\node -> fromMaybe "" (getAttribute node "href")) linksFor a single match, querySelector returns Maybe Node. Pre-parse selectors
with HTML.Selector.parseSelector when you run the same query many times.
Streaming rewriter: mutate HTML in one pass
Section titled “Streaming rewriter: mutate HTML in one pass”The rewriter fires callbacks when CSS selectors match and writes transformed output without materializing a tree. Memory scales with nesting depth and the number of registered selectors, not document size. This is the right tool for adding attributes, rewriting tags, or redacting text in large HTML files.
{-# LANGUAGE OverloadedStrings #-}import Data.ByteString (ByteString)import HTML.Rewriter ( buildRewriter , onElement , setElemAttr , rewrite )import HTML.Selector (parseSelector)
addNoopener :: ByteString -> IO ByteStringaddNoopener input = do let Right sel = parseSelector "a[target=_blank]" let Right rw = buildRewriter $ onElement sel $ \er -> setElemAttr er "rel" "noopener noreferrer" rewrite rw inputRegister handlers with onText, onComment, and onDoctype for non-element
nodes. Element mutation helpers include setTagName, replaceElement,
prependToElement, and removeElement.
Typed HTML fragments
Section titled “Typed HTML fragments”When your output is structured data rendered as HTML, derive ToHTML with the
Template Haskell deriver and emit with encodeHTMLTyped.
{-# LANGUAGE DeriveGeneric #-}{-# LANGUAGE TemplateHaskell #-}{-# LANGUAGE OverloadedStrings #-}import GHC.Generics (Generic)import Data.Text (Text)import HTML.Class (ToHTML, encodeHTMLTyped)import HTML.Derive (deriveHTML)
data Person = Person { name :: !Text , role :: !Text } deriving stock (Generic)
$(deriveHTML ''Person)
renderPerson :: Person -> TextrenderPerson = encodeHTMLTypedFor simple cases with no wire-format customization, Generic defaults also
work: add deriving Generic and declare empty instance ToHTML Person and
instance FromHTML Person declarations.
Incremental parsing
Section titled “Incremental parsing”For network streams or chunked file reads, create a Parser with newParser,
call feedParser for each chunk, and finish with finishParser. The tree
builder carries incomplete tag fragments across chunk boundaries.
Performance
Section titled “Performance”All numbers on a 29 KB HTML document (100-item product catalog), GHC 9.8.4, Apple Silicon. Cross-language references are lol-html (Cloudflare’s Rust streaming rewriter) and lexbor (C, the engine behind Servo and Cloudflare).
Tokenizer and tree builder
Section titled “Tokenizer and tree builder”| Operation | wireform-html | lol-html (Rust) | lexbor (C) |
|---|---|---|---|
| Tokenize (one-shot) | 1091 MB/s | 886 MiB/s | n/a |
| Tree build (one-shot) | 316 MB/s | 305 MB/s (70% target) | 192 MB/s |
| Tree build (incremental, 4 KB) | 145 MB/s | n/a | 195 MB/s |
wireform-html’s tokenizer runs faster than lol-html’s tag scanner. The one-shot tree builder is 1.6x faster than lexbor (C). Incremental parsing is roughly at parity with lexbor.
Streaming rewriter
Section titled “Streaming rewriter”| Operation | wireform-html | lol-html (Rust) |
|---|---|---|
| Selector matching (5 handlers) | 205 MB/s | 181-228 MiB/s |
| Sparse mutation (1 match in ~600 tags) | 425 MB/s | 541 MiB/s |
| Full mutations (tag rename + attr + text) | 149 MB/s | 228 MiB/s |
The selector-matching path is competitive with lol-html. Sparse mutations (the common case for rewriters that touch a few elements in a large document) run at 425 MB/s. The full-mutation path is the one benchmark still behind lol-html’s dual-parser architecture.
CSS selectors and querySelectorAll
Section titled “CSS selectors and querySelectorAll”| Selector | wireform-html | lexbor (C) | JSDOM |
|---|---|---|---|
div | 0.7 µs | 7.7 µs | 150 µs |
div.item | 0.9 µs | 8.6 µs | 18 µs |
div.item span.name | 2 µs | 13.4 µs | 34 µs |
div:first-child | 1 µs | 7.9 µs | 26 µs |
:nth-child(2n+1) | 2 µs | 20.9 µs | 33 µs |
:not(.item) | 4 µs | 16.3 µs | 28 µs |
[id] | 6 µs | 7.9 µs | 170 µs |
div.catalog > div + div | 6 µs | 10.1 µs | 44 µs |
wireform-html’s indexed DOM is 4-10x faster than lexbor and
5-170x faster than JSDOM on querySelector workloads. The gap
is largest on structural pseudo-classes (:nth-child, :not)
where the index avoids full-tree traversal.
Custom allocation harness, GHC 9.8.4, Apple Silicon. lexbor 3.0.0 numbers
from bench/native/html_bench.c on the same machine and input.
Notable modules
Section titled “Notable modules”| Module | Role |
|---|---|
HTML.Parse | HTML5 tokenizer and tree-construction algorithm |
HTML.DOM | Zipper-based Node API, parseDocument, selector queries |
HTML.Selector | CSS Selectors Level 4 parser and matcher |
HTML.Rewriter | Single-pass streaming rewriter with selector callbacks |
HTML.Class / HTML.Derive | ToHTML / FromHTML and annotation-driven TH |
HTML.Encode | Serialize DOM nodes back to HTML bytes |
HTML.TagId | Intern table for hot tag-name comparisons |