wireform-parquet

wireform-parquet implements the Apache Parquet columnar file format. Parquet is the on-disk format behind most data warehouses and lakehouse table formats (Iceberg, Delta Lake, Hudi). Use this package when you need to read or write Parquet files directly in Haskell, with support for the encodings and compression codecs that real-world writers emit.

Key features

Full read and write via Parquet.HighLevel and lower-level page APIs
All major encodings: PLAIN, dictionary, DELTA_BINARY_PACKED, BYTE_STREAM_SPLIT, and hybrid RLE index pages
Compression codecs behind Cabal flags: Snappy, Zstd, LZ4, Gzip, and Brotli
Bloom filters and page indexes for sub-row-group predicate pruning
Modular column encryption (AES-GCM) per the Parquet Modular Encryption spec
Predicate pushdown over footer statistics, page indexes, and bloom filters
Nested columns (lists, maps, structs) via Parquet.Nested
Arrow bridge for typed record batches through Parquet.Arrow
Template Haskell deriver via Parquet.Derive
Interop-tested against pyarrow

Basic usage

Most callers start with the high-level decode API for in-memory bytes, or openParquetReader for mmap-aware streaming over large files on disk:

import qualified Data.Vector        as V
import qualified Parquet.HighLevel  as PH
import qualified Parquet.Read       as PR
import qualified Parquet.Types      as PT

readParquetBytes :: ByteString -> IO ()
readParquetBytes bytes =
  case PH.decodeParquet PH.defaultReadOptions bytes of
    Left err ->
      putStrLn err
    Right pf ->
      let fm = PR.pfFooter pf
      in putStrLn $
           "rows="
             ++ show (PT.fmNumRows fm)
             ++ " rowGroups="
             ++ show (V.length (PT.fmRowGroups fm))

readParquetFile :: FilePath -> IO ()
readParquetFile path = do
  result <- PR.openParquetReader path
  case result of
    Left err ->
      putStrLn err
    Right (pf, _rowGroupIter) ->
      let fm = PR.pfFooter pf
      in putStrLn $
           "rows="
             ++ show (PT.fmNumRows fm)
             ++ " rowGroups="
             ++ show (V.length (PT.fmRowGroups fm))

For writing, pass an Arrow-shaped schema and column batches to encodeParquet:

import qualified Parquet.HighLevel as PH

writeParquet :: PH.Schema -> [V.Vector PH.ColumnData] -> ByteString
writeParquet schema rowGroups =
  PH.encodeParquet PH.defaultWriteOptions schema rowGroups

When you need projection or filter pushdown without loading every column, use the predicate and aggregate modules together with the Arrow bridge or the cross-format Wireform.Columnar facade.

Performance

XXH64 hash: C kernel vs pure Haskell

Operation	C kernel	pure Haskell	ratio
8 B	12.1 ns	10.3 ns	0.85x
64 B	18.1 ns	42.10 ns	2.37x
1 KiB	79.9 ns	457 ns	5.71x
64 KiB	4387 ns	28291 ns	6.45x

_{Last run 2026-06-27 11:45:59 UTC. ghc-9.8.4 on darwin-aarch64, criterion 1.6.5.}

The C kernel pulls ahead of the pure Haskell fallback from 64 bytes up. At 8 bytes the pure path wins slightly due to call overhead. The C path is the default when the FFI is available. Page-level decode throughput depends heavily on encoding (PLAIN, DELTA_BINARY_PACKED, RLE/bitpacked) and compression codec.

The chart and table above are regenerated by wireform-stats from wireform-parquet/bench-results/summary/parquet-xxh64-c-vs-pure.json — the same source the README chart is built from.

Notable modules

Module	Purpose
`Parquet.HighLevel`	`encodeParquet`, `decodeParquet`, `WriteOptions`, `ReadOptions`
`Parquet.Read`	`loadParquetFilePath`, `openParquetReader`, column chunk decoders
`Parquet.Write`	Page encoders, row group assembly, `buildParquetFile`
`Parquet.Footer`	Thrift-encoded footer parse and emit
`Parquet.Page` / `Parquet.PageIndex`	Data page headers and per-page statistics
`Parquet.BloomFilter`	Split-block bloom filter decode
`Parquet.Encryption`	Column-level and footer encryption (PME, AES-GCM)
`Parquet.Predicate`	Statistics and bloom-filter predicate evaluation
`Parquet.Aggregate`	`count(*)`, `count(col)`, `min`, `max` from footer stats
`Parquet.Arrow`	Parquet columns to Arrow `ColumnArray` bridge
`Parquet.Derive`	Template Haskell deriver with `wireform-derive` annotations

Interop

The reader handles files produced by pyarrow, parquet-cpp, and arrow-rs, including dictionary-encoded strings, delta-packed integers, and BYTE_STREAM_SPLIT floats. Cross-language round-trip tests live in the package probe suite.