Skip to content

wireform-orc

wireform-orc implements the Apache ORC columnar file format. ORC is the storage format behind Hive and many Hadoop-era data lakes, optimized for large sequential scans with lightweight indexes per stripe. Use this package when you need to read ORC files in Haskell, evaluate predicates from stripe statistics, or bridge ORC columns into Arrow record batches.

  • Stripe-level read with lazy footer and stripe footer parsing
  • RLE v1 and v2 decoders for integer, boolean, and string columns
  • Compression (Snappy, Zstd, LZ4) with configurable block sizes
  • Bloom filters for equality predicate pruning at the stripe level
  • Statistics-based pushdown via ORC.Statistics
  • Stripe encryption for encrypted ORC files
  • Row indexes for sub-stripe seek within a column
  • Arrow bridge through ORC.Arrow and the Wireform.Columnar facade

Load an ORC file from disk and inspect its stripe layout:

import qualified Data.Vector as V
import qualified ORC.Read as OR
import qualified ORC.Types as OT
inspectOrcFile :: FilePath -> IO ()
inspectOrcFile path = do
result <- OR.loadORCFilePath path
case result of
Left err ->
putStrLn err
Right orcFile -> do
let footer = OR.ofFooter orcFile
putStrLn $
"stripes="
++ show (V.length (OT.orcStripes footer))
++ " rows="
++ show (OT.orcNumberOfRows footer)

Decode a single integer column from the first stripe:

readIntColumn :: OR.ORCFile -> Either String (V.Vector (Maybe Int64))
readIntColumn orcFile =
OR.readColumn orcFile 0 0 0

For filtered scans, build a predicate with Columnar.Predicate and pass it to ORC.streamStripesFilteredIter or the unified Wireform.Columnar.decodeFilteredIter entry point.

ModulePurpose
ORC.ReadloadORCFilePath, openORCReader, column decoders
ORC.WriteStripe and file writer
ORC.FooterFile footer parse and metadata
ORC.StripeStripe footer protobuf and stream layout
ORC.StatisticsColumn statistics predicate evaluator
ORC.BloomFilterBloom filter decode and membership checks
ORC.EncryptionStripe-level encryption
ORC.RowIndexRow index decode for sub-stripe seek
ORC.ArrowORC to Arrow column bridge
ORC.DeriveTemplate Haskell deriver

ORC files written by Hive, Spark, and Trino round-trip through the reader when they use the standard RLE and compression layouts exercised by the package test fixtures.