Skip to content

wireform-html

wireform-html implements the HTML5 parsing algorithm and the DOM-style tools you need on top of it: CSS selector queries, a streaming rewriter, incremental parsing, and typed ToHTML/FromHTML deriving. Use it when you scrape or transform HTML in Haskell and want spec-correct tree construction (validated against html5lib, 1,779 / 1,779 tests passing) plus fast byte-level rewriting without building a full DOM.

CapabilityModuleWhy it matters
HTML5 tree builderHTML.Parse, HTML.DOMSpec-compliant documents and fragments
CSS Selectors Level 4HTML.Selector, HTML.DOMQuery parsed trees with familiar selector syntax
Streaming rewriterHTML.Rewriterlol-html-style transforms in one pass, bounded memory
Incremental parsingHTML.DOMFeed HTML chunks as they arrive
Template Haskell derivingHTML.Class, HTML.DerivederiveHTML with wireform-derive annotations; Generic defaults for simple cases
C SIMD scannercbits/fast_html.cVectorized tag and text scanning on hot paths

Parse a document and query with CSS selectors

Section titled “Parse a document and query with CSS selectors”

Build a document once, then use selector strings the same way you would in browser DevTools. The parser also builds a pre-order element index so repeated queries stay fast.

{-# LANGUAGE OverloadedStrings #-}
import Data.Maybe (fromMaybe)
import Data.Text (Text)
import Data.ByteString (ByteString)
import HTML.DOM
( parseDocument
, documentElement
, querySelectorAll
, getAttribute
)
extractLinkHrefs :: ByteString -> [Text]
extractLinkHrefs html =
let doc = parseDocument html
root = documentElement doc
links = querySelectorAll root "a[href]"
in map (\node -> fromMaybe "" (getAttribute node "href")) links

For a single match, querySelector returns Maybe Node. Pre-parse selectors with HTML.Selector.parseSelector when you run the same query many times.

Streaming rewriter: mutate HTML in one pass

Section titled “Streaming rewriter: mutate HTML in one pass”

The rewriter fires callbacks when CSS selectors match and writes transformed output without materializing a tree. Memory scales with nesting depth and the number of registered selectors, not document size. This is the right tool for adding attributes, rewriting tags, or redacting text in large HTML files.

{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString (ByteString)
import HTML.Rewriter
( buildRewriter
, onElement
, setElemAttr
, rewrite
)
import HTML.Selector (parseSelector)
addNoopener :: ByteString -> IO ByteString
addNoopener input = do
let Right sel = parseSelector "a[target=_blank]"
let Right rw = buildRewriter $
onElement sel $ \er ->
setElemAttr er "rel" "noopener noreferrer"
rewrite rw input

Register handlers with onText, onComment, and onDoctype for non-element nodes. Element mutation helpers include setTagName, replaceElement, prependToElement, and removeElement.

When your output is structured data rendered as HTML, derive ToHTML with the Template Haskell deriver and emit with encodeHTMLTyped.

{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE OverloadedStrings #-}
import GHC.Generics (Generic)
import Data.Text (Text)
import HTML.Class (ToHTML, encodeHTMLTyped)
import HTML.Derive (deriveHTML)
data Person = Person
{ name :: !Text
, role :: !Text
} deriving stock (Generic)
$(deriveHTML ''Person)
renderPerson :: Person -> Text
renderPerson = encodeHTMLTyped

For simple cases with no wire-format customization, Generic defaults also work: add deriving Generic and declare empty instance ToHTML Person and instance FromHTML Person declarations.

For network streams or chunked file reads, create a Parser with newParser, call feedParser for each chunk, and finish with finishParser. The tree builder carries incomplete tag fragments across chunk boundaries.

All numbers on a 29 KB HTML document (100-item product catalog), GHC 9.8.4, Apple Silicon. Cross-language references are lol-html (Cloudflare’s Rust streaming rewriter) and lexbor (C, the engine behind Servo and Cloudflare).

Operationwireform-htmllol-html (Rust)lexbor (C)
Tokenize (one-shot)1091 MB/s886 MiB/sn/a
Tree build (one-shot)316 MB/s305 MB/s (70% target)192 MB/s
Tree build (incremental, 4 KB)145 MB/sn/a195 MB/s

wireform-html’s tokenizer runs faster than lol-html’s tag scanner. The one-shot tree builder is 1.6x faster than lexbor (C). Incremental parsing is roughly at parity with lexbor.

Operationwireform-htmllol-html (Rust)
Selector matching (5 handlers)205 MB/s181-228 MiB/s
Sparse mutation (1 match in ~600 tags)425 MB/s541 MiB/s
Full mutations (tag rename + attr + text)149 MB/s228 MiB/s

The selector-matching path is competitive with lol-html. Sparse mutations (the common case for rewriters that touch a few elements in a large document) run at 425 MB/s. The full-mutation path is the one benchmark still behind lol-html’s dual-parser architecture.

Selectorwireform-htmllexbor (C)JSDOM
div0.7 µs7.7 µs150 µs
div.item0.9 µs8.6 µs18 µs
div.item span.name2 µs13.4 µs34 µs
div:first-child1 µs7.9 µs26 µs
:nth-child(2n+1)2 µs20.9 µs33 µs
:not(.item)4 µs16.3 µs28 µs
[id]6 µs7.9 µs170 µs
div.catalog > div + div6 µs10.1 µs44 µs

wireform-html’s indexed DOM is 4-10x faster than lexbor and 5-170x faster than JSDOM on querySelector workloads. The gap is largest on structural pseudo-classes (:nth-child, :not) where the index avoids full-tree traversal.

Custom allocation harness, GHC 9.8.4, Apple Silicon. lexbor 3.0.0 numbers from bench/native/html_bench.c on the same machine and input.

ModuleRole
HTML.ParseHTML5 tokenizer and tree-construction algorithm
HTML.DOMZipper-based Node API, parseDocument, selector queries
HTML.SelectorCSS Selectors Level 4 parser and matcher
HTML.RewriterSingle-pass streaming rewriter with selector callbacks
HTML.Class / HTML.DeriveToHTML / FromHTML and annotation-driven TH
HTML.EncodeSerialize DOM nodes back to HTML bytes
HTML.TagIdIntern table for hot tag-name comparisons