Skip to content

wireform-parquet

wireform-parquet implements the Apache Parquet columnar file format. Parquet is the on-disk format behind most data warehouses and lakehouse table formats (Iceberg, Delta Lake, Hudi). Use this package when you need to read or write Parquet files directly in Haskell, with support for the encodings and compression codecs that real-world writers emit.

  • Full read and write via Parquet.HighLevel and lower-level page APIs
  • All major encodings: PLAIN, dictionary, DELTA_BINARY_PACKED, BYTE_STREAM_SPLIT, and hybrid RLE index pages
  • Compression codecs behind Cabal flags: Snappy, Zstd, LZ4, Gzip, and Brotli
  • Bloom filters and page indexes for sub-row-group predicate pruning
  • Modular column encryption (AES-GCM) per the Parquet Modular Encryption spec
  • Predicate pushdown over footer statistics, page indexes, and bloom filters
  • Nested columns (lists, maps, structs) via Parquet.Nested
  • Arrow bridge for typed record batches through Parquet.Arrow
  • Template Haskell deriver via Parquet.Derive
  • Interop-tested against pyarrow

Most callers start with the high-level decode API for in-memory bytes, or openParquetReader for mmap-aware streaming over large files on disk:

import qualified Data.Vector as V
import qualified Parquet.HighLevel as PH
import qualified Parquet.Read as PR
import qualified Parquet.Types as PT
readParquetBytes :: ByteString -> IO ()
readParquetBytes bytes =
case PH.decodeParquet PH.defaultReadOptions bytes of
Left err ->
putStrLn err
Right pf ->
let fm = PR.pfFooter pf
in putStrLn $
"rows="
++ show (PT.fmNumRows fm)
++ " rowGroups="
++ show (V.length (PT.fmRowGroups fm))
readParquetFile :: FilePath -> IO ()
readParquetFile path = do
result <- PR.openParquetReader path
case result of
Left err ->
putStrLn err
Right (pf, _rowGroupIter) ->
let fm = PR.pfFooter pf
in putStrLn $
"rows="
++ show (PT.fmNumRows fm)
++ " rowGroups="
++ show (V.length (PT.fmRowGroups fm))

For writing, pass an Arrow-shaped schema and column batches to encodeParquet:

import qualified Parquet.HighLevel as PH
writeParquet :: PH.Schema -> [V.Vector PH.ColumnData] -> ByteString
writeParquet schema rowGroups =
PH.encodeParquet PH.defaultWriteOptions schema rowGroups

When you need projection or filter pushdown without loading every column, use the predicate and aggregate modules together with the Arrow bridge or the cross-format Wireform.Columnar facade.

Parquet XXH64 hash: C kernel vs pure Haskell across input sizes Parquet XXH64 hash: C kernel vs pure Haskell across input sizes lower is better · ns · ghc-9.8.4 on darwin-aarch64, criterion 1.6.5 0 12500 25000 37500 50000 12.1 10.3 18.1 43.0 79.9 457 4387 28291 8 B 64 B 1 KiB 64 KiB C kernel pure Haskell Parquet XXH64 hash: C kernel vs pure Haskell across input sizes lower is better · ns · ghc-9.8.4 on darwin-aarch64, criterion 1.6.5 0 12500 25000 37500 50000 12.1 10.3 18.1 43.0 79.9 457 4387 28291 8 B 64 B 1 KiB 64 KiB C kernel pure Haskell
OperationC kernelpure Haskellratio
8 B12.1 ns10.3 ns0.85x
64 B18.1 ns42.10 ns2.37x
1 KiB79.9 ns457 ns5.71x
64 KiB4387 ns28291 ns6.45x

Last run 2026-06-27 11:45:59 UTC. ghc-9.8.4 on darwin-aarch64, criterion 1.6.5.

The C kernel pulls ahead of the pure Haskell fallback from 64 bytes up. At 8 bytes the pure path wins slightly due to call overhead. The C path is the default when the FFI is available. Page-level decode throughput depends heavily on encoding (PLAIN, DELTA_BINARY_PACKED, RLE/bitpacked) and compression codec.

The chart and table above are regenerated by wireform-stats from wireform-parquet/bench-results/summary/parquet-xxh64-c-vs-pure.json — the same source the README chart is built from.

ModulePurpose
Parquet.HighLevelencodeParquet, decodeParquet, WriteOptions, ReadOptions
Parquet.ReadloadParquetFilePath, openParquetReader, column chunk decoders
Parquet.WritePage encoders, row group assembly, buildParquetFile
Parquet.FooterThrift-encoded footer parse and emit
Parquet.Page / Parquet.PageIndexData page headers and per-page statistics
Parquet.BloomFilterSplit-block bloom filter decode
Parquet.EncryptionColumn-level and footer encryption (PME, AES-GCM)
Parquet.PredicateStatistics and bloom-filter predicate evaluation
Parquet.Aggregatecount(*), count(col), min, max from footer stats
Parquet.ArrowParquet columns to Arrow ColumnArray bridge
Parquet.DeriveTemplate Haskell deriver with wireform-derive annotations

The reader handles files produced by pyarrow, parquet-cpp, and arrow-rs, including dictionary-encoded strings, delta-packed integers, and BYTE_STREAM_SPLIT floats. Cross-language round-trip tests live in the package probe suite.