ParquetDiff: a lightweight tool to compare Parquet Schemas

ParquetDiff is a small utility to identify differences between Parquet schemas.

It can be used as a command-line tool or embedded directly into an application as a dependency.

ParquetDiff is useful for analyzing Parquet directories with multiple partitions, detecting schema differences across files quickly and efficiently.

A common use case is a data pipeline that append data in a Parquet directory on a regular basis (hourly, daily, etc).

Over time, data schemas may evolve due to changes in input data or application logic. ParquetDiff helps track and validate these schema changes:

output.parquet/
|-- date=2025-03-25 #(generated by run 1)
    |--part-0000.snappy.parquet
    |--part-0001.snappy.parquet
|-- date=2025-03-26 #(generated by run 2)
    |--part-0000.snappy.parquet
    |--part-0001.snappy.parquet
|-- date=2025-03-27 #(generated by run 3, with a new column)
    |--part-0000.snappy.parquet
    ....

This is what ParquetDiff was designed for. You can find more details on how to use it in the project repository: https://github.com/romibuzi/ParquetDiff.