Parquet – Performance Benchmark


If you are following today’s trend for building an efficient Modern Data Stack, all are of aware of the “parquet” format offering efficient data storage and retrieval.

In addition to being an in-memory column-oriented storage format similar to ORC, it also provides features such as efficient data compression and encoding schemas, resulting in enhanced performance for both batch and interactive use cases.

The capability of using in-memory format provides zero-copy reads for fast data access without serialization overhead.

There are few python libraries out there to convert from CSV to parquet or to store data in a parquet format. Notable python libraries are

  1. Pandas (itself)
  2. PyArrow
  3. FastParquet – offers compression benefits

In this blog post, I have compared the above 3 python libraries in a Python Notebook uploaded to GitHub, and seems that PyArrow has proven to provide excellent performance considering both reading a CSV/parquet and writing to a parquet, even though FastParquet offers a little to some compression capabilities.

The sample data used for this demo can be found here.

Results

Apache Parquet has been there was a while, but with the upcoming Databricks, it has come to much popularity.

As a matter of fact, both Azure Databricks and Azure Synapse support PyArrow for implementing Delta Lakehouse.

Databricks

By default, PyArrow is enabled for High Concurrency clusters, and also for workspaces enabled with Unity Catalog. To enable other types of clusters, spark.sql.execution.arrow.pyspark.enabled needs to be set to true in the Spark Configuration.

These are unsupported conversions to be noted –

  1. MapType
  2. ArrayType – TimestampType
  3. StructType

Azure SynapseAzure Synapse runtimes

Delta Lake

What is Delta Lake?

The Delta Lake is currently the most popular data storage format, with increasing market-share and companies adapting to it. While the topic of Delta Lake itself needs to be a separate blog, the most important capability it can offer with Parquet is the support of “ACID” properties to start with.

With just using the parquet updating can still happen but with some minor challenges. Whereas when used with Delta format, the updates are tracked in a Delta Log.

From my experience designing and implementing Delta Lake in Azure and using Azure Databricks, the below requirements are achieved with great performance –

  1. Less storage capacity than storing CSV files
  2. Partitioning – reading non-partitioned key attributes which are not included in the predicate statement.
  3. ACID transactions – keeping track of the latest updates of data
  4. Read Speed – a definite improvement from the native parquet format
  5. Uplifting from Data Lake to Delta Lake
  6. Streaming data using Databricks Autoloader – Azure, AWS and GCP.
  7. Interoperability between Azure Databricks and Azure Synapse.

I would like you to share your achievements by using the Delta Lake / Parquet format.

Happy Architecting …… 🙂

Loading


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.