Read Parquet files using Pandas read_parquet
The read_parquet
function in Pandas allows you to read Parquet files into a DataFrame.
It offers the capability to read a Parquet file from either a local file path or a URL. Its versatility doesn’t stop there; the function provides several extra options for loading and handling the data from the file.
What is a Parquet File?
Parquet, also known as Apache Parquet, is a column-oriented data storage format, which is optimized for work with large data sets – it minimizes file size while maximizing efficiency when querying data.
Parquet is also designed to be a self-describing file format. The schema is embedded in the data, which makes reading and writing the data more flexible and robust.
Parquet files are not plain text, they’re binary files.
One of the advantages of Parquet files is that they can contain many different types of data, including complex nested structures, and can handle data compression and encoding schemes effectively.
Now, let’s create a Parquet file as sample data to use in this tutorial:
import pandas as pd data = { 'Name': ['John', 'Anna', 'Peter', 'Linda'], 'Age': [28, 24, 35, 32], 'City': ['New York', 'Paris', 'Berlin', 'London'] } df = pd.DataFrame(data) df.to_parquet('people.parquet')
The code above creates a DataFrame from sample data, then converts the DataFrame into a Parquet file named ‘people.parquet’ using to_parquet
function.
Syntax and Parameters
The syntax for using the read_parquet
function is as follows:
pandas.read_parquet(path, engine='auto', columns=None, **kwargs)
Now let’s break down these parameters:
- path: A string representing the file path or URL from where the parquet file will be read. Both local and remote (via a valid URL scheme) files can be accessed.
- engine: The engine to use for reading the files. The options are ‘auto’, ‘pyarrow’, or ‘fastparquet’. ‘auto’ lets Pandas decide which engine to use.
- columns: If not None, only these columns will be read from the file. This can significantly optimize the reading process if you only need specific columns.
- kwargs: Extra options that are passed to the engine. Check the respective engine’s documentation for more information on these options.
For instance, PyArrow supports additional options like memory_map
and buffer_size
.
The memory_map
option is a boolean to specify whether to use mmap, a method for memory mapping file reading, which can improve performance in some scenarios.
The buffer_size
sets the size of read buffer, which can also influence read performance if used wisely.
The read_parquet
function returns a DataFrame object, which contains the data read from the file.
How to Read Parquet Files with Pandas
You can read a Parquet file using the read_parquet
function by passing the parquet file to the function like this:
import pandas as pd df = pd.read_parquet('people.parquet') print(df)
Output:
Name Age City 0 John 28 New York 1 Anna 24 Paris 2 Peter 35 Berlin 3 Linda 32 London
In this example, the function reads the file and returns a DataFrame.
The indices (0, 1, 2, 3) are automatically generated by Pandas.
Understanding Different Engine Options
As we discussed earlier, the read_parquet
function supports two engines: PyArrow and fastparquet.
To use them, you need to install them first:
pip install fastparquet pyarrow
To specify the engine while reading a Parquet file, you can pass it as an argument like this:
import pandas as pd df_pyarrow = pd.read_parquet('people.parquet', engine='pyarrow') df_fastparquet = pd.read_parquet('people.parquet', engine='fastparquet') print('Using PyArrow') print(df_pyarrow) print('Using fastparquet') print(df_fastparquet)
Output:
Using PyArrow Name Age City 0 John 28 New York 1 Anna 24 Paris 2 Peter 35 Berlin 3 Linda 32 London Using fastparquet Name Age City 0 John 28 New York 1 Anna 24 Paris 2 Peter 35 Berlin 3 Linda 32 London
In these examples, we’re specifying the engine to be used for reading the Parquet file. While the resulting data is the same, there can be differences in performance.
Let’s measure that:
import pandas as pd import time start_time = time.time() df_pyarrow = pd.read_parquet('people.parquet', engine='pyarrow') end_time = time.time() pyarrow_time = end_time - start_time print('Time taken using PyArrow: {} seconds'.format(pyarrow_time)) start_time = time.time() df_fastparquet = pd.read_parquet('people.parquet', engine='fastparquet') end_time = time.time() fastparquet_time = end_time - start_time print('Time taken using fastparquet: {} seconds'.format(fastparquet_time))
Output:
Time taken using PyArrow: 0.0388336181640625 seconds Time taken using fastparquet: 0.12515997886657715 seconds
The result is for reading 4 records and 3 columns from Parquet file.
Specifying Columns to Load
The read_parquet
function also allows you to specify which columns to load from the Parquet file. This can be particularly useful when working with large datasets with many columns.
You can specify the columns to load by passing a list of column names to the columns
parameter of the read_parquet
function:
import pandas as pd df = pd.read_parquet('people.parquet', columns=['Name', 'Age']) print(df)
Output:
Name Age 0 John 28 1 Anna 24 2 Peter 35 3 Linda 32
In this example, we’re only loading the ‘Name’ and ‘Age’ columns from the Parquet file. As you can see from the output, the resulting DataFrame only contains these two columns.
Working with Partitioned Parquet Files
Partitioning is a technique where data is split across multiple files based on the values of one or more columns.
This technique can significantly speed up queries and data loads as it limits the amount of data that needs to be scanned.
Each partition will be represented as a separate subdirectory within the output directory, and the file names within each subdirectory will reflect the values of the partitioning columns.
For example, if you partitioned your data based on the ‘year’ and ‘month’ columns, the file names would follow this structure:
partitioned_people/ └── year=2021/ └── month=01/ ├── part.0.parquet ├── part.1.parquet └── ... └── year=2021/ └── month=02/ ├── part.0.parquet ├── part.1.parquet └── ... └── ...
In this example, each subdirectory represents a unique combination of the ‘year’ and ‘month’ values. Inside each subdirectory, you will find the Parquet files containing the data corresponding to that partition.
When reading partitioned Parquet data, Pandas read_parquet
treats each file within the directory as a separate DataFrame, and then it concatenates them into one large DataFrame.
Here’s how you can read this partitioned data:
import pandas as pd df = pd.read_parquet('partitioned_people') print(df)
The read_parquet
function will automatically detect the partitioning scheme and read the data from all the Parquet files within the subdirectories. The resulting DataFrame (df
in the example above) will contain the consolidated data from all the partitions.
Further Reading
https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.