Written by
romain
on
on
ParquetDiff: a lightweight tool to compare Parquet Schemas
ParquetDiff is a small utility to identify differences between Parquet schemas.
It can be used as a command-line tool or embedded directly into an application as a dependency.
ParquetDiff is useful for analyzing Parquet directories with multiple partitions, detecting schema differences across files quickly and efficiently.
A common use case is a data pipeline that append data in a Parquet directory on a regular basis (hourly, daily, etc).
Over time, data schemas may evolve due to changes in input data or application logic. ParquetDiff helps track and validate these schema changes:
output.parquet/
|-- date=2025-03-25 #(generated by run 1)
|--part-0000.snappy.parquet
|--part-0001.snappy.parquet
|-- date=2025-03-26 #(generated by run 2)
|--part-0000.snappy.parquet
|--part-0001.snappy.parquet
|-- date=2025-03-27 #(generated by run 3, with a new column)
|--part-0000.snappy.parquet
....
This is what ParquetDiff was designed for. You can find more details on how to use it in the project repository: https://github.com/romibuzi/ParquetDiff.