Read Parquet files using Pandas read_parquet

The read_parquet function in Pandas allows you to read Parquet files into a DataFrame.

It offers the capability to read a Parquet file from either a local file path or a URL. Its versatility doesn’t stop there; the function provides several extra options for loading and handling the data from the file.

 

 

What is a Parquet File?

Parquet, also known as Apache Parquet, is a column-oriented data storage format, which is optimized for work with large data sets – it minimizes file size while maximizing efficiency when querying data.

Parquet is also designed to be a self-describing file format. The schema is embedded in the data, which makes reading and writing the data more flexible and robust.
Parquet files are not plain text, they’re binary files.

One of the advantages of Parquet files is that they can contain many different types of data, including complex nested structures, and can handle data compression and encoding schemes effectively.
Now, let’s create a Parquet file as sample data to use in this tutorial:

import pandas as pd
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
df.to_parquet('people.parquet')

The code above creates a DataFrame from sample data, then converts the DataFrame into a Parquet file named ‘people.parquet’ using to_parquet function.

 

Syntax and Parameters

The syntax for using the read_parquet function is as follows:

pandas.read_parquet(path, engine='auto', columns=None, **kwargs)

Now let’s break down these parameters:

  1. path: A string representing the file path or URL from where the parquet file will be read. Both local and remote (via a valid URL scheme) files can be accessed.
  2. engine: The engine to use for reading the files. The options are ‘auto’, ‘pyarrow’, or ‘fastparquet’. ‘auto’ lets Pandas decide which engine to use.
  3. columns: If not None, only these columns will be read from the file. This can significantly optimize the reading process if you only need specific columns.
  4. kwargs: Extra options that are passed to the engine. Check the respective engine’s documentation for more information on these options.

For instance, PyArrow supports additional options like memory_map and buffer_size.

The memory_map option is a boolean to specify whether to use mmap, a method for memory mapping file reading, which can improve performance in some scenarios.

The buffer_size sets the size of read buffer, which can also influence read performance if used wisely.
The read_parquet function returns a DataFrame object, which contains the data read from the file.

 

How to Read Parquet Files with Pandas

You can read a Parquet file using the read_parquet function by passing the parquet file to the function like this:

import pandas as pd
df = pd.read_parquet('people.parquet')
print(df)

Output:

    Name  Age      City
0   John   28  New York
1   Anna   24     Paris
2  Peter   35    Berlin
3  Linda   32    London

In this example, the function reads the file and returns a DataFrame.
The indices (0, 1, 2, 3) are automatically generated by Pandas.

 

Understanding Different Engine Options

As we discussed earlier, the read_parquet function supports two engines: PyArrow and fastparquet.

To use them, you need to install them first:

pip install fastparquet pyarrow

To specify the engine while reading a Parquet file, you can pass it as an argument like this:

import pandas as pd
df_pyarrow = pd.read_parquet('people.parquet', engine='pyarrow')
df_fastparquet = pd.read_parquet('people.parquet', engine='fastparquet')
print('Using PyArrow')
print(df_pyarrow)
print('Using fastparquet')
print(df_fastparquet)

Output:

Using PyArrow
    Name  Age      City
0   John   28  New York
1   Anna   24     Paris
2  Peter   35    Berlin
3  Linda   32    London
Using fastparquet
    Name  Age      City
0   John   28  New York
1   Anna   24     Paris
2  Peter   35    Berlin
3  Linda   32    London

In these examples, we’re specifying the engine to be used for reading the Parquet file. While the resulting data is the same, there can be differences in performance.

Let’s measure that:

import pandas as pd
import time

start_time = time.time()
df_pyarrow = pd.read_parquet('people.parquet', engine='pyarrow')
end_time = time.time()
pyarrow_time = end_time - start_time
print('Time taken using PyArrow: {} seconds'.format(pyarrow_time))

start_time = time.time()
df_fastparquet = pd.read_parquet('people.parquet', engine='fastparquet')
end_time = time.time()
fastparquet_time = end_time - start_time
print('Time taken using fastparquet: {} seconds'.format(fastparquet_time))

Output:

Time taken using PyArrow: 0.0388336181640625 seconds
Time taken using fastparquet: 0.12515997886657715 seconds

The result is for reading 4 records and 3 columns from Parquet file.

 

Specifying Columns to Load

The read_parquet function also allows you to specify which columns to load from the Parquet file. This can be particularly useful when working with large datasets with many columns.
You can specify the columns to load by passing a list of column names to the columns parameter of the read_parquet function:

import pandas as pd
df = pd.read_parquet('people.parquet', columns=['Name', 'Age'])
print(df)

Output:

    Name  Age
0   John   28
1   Anna   24
2  Peter   35
3  Linda   32

In this example, we’re only loading the ‘Name’ and ‘Age’ columns from the Parquet file. As you can see from the output, the resulting DataFrame only contains these two columns.

 

Working with Partitioned Parquet Files

Partitioning is a technique where data is split across multiple files based on the values of one or more columns.

This technique can significantly speed up queries and data loads as it limits the amount of data that needs to be scanned.

Each partition will be represented as a separate subdirectory within the output directory, and the file names within each subdirectory will reflect the values of the partitioning columns.

For example, if you partitioned your data based on the ‘year’ and ‘month’ columns, the file names would follow this structure:

partitioned_people/
    └── year=2021/
        └── month=01/
            ├── part.0.parquet
            ├── part.1.parquet
            └── ...
    └── year=2021/
        └── month=02/
            ├── part.0.parquet
            ├── part.1.parquet
            └── ...
    └── ...

In this example, each subdirectory represents a unique combination of the ‘year’ and ‘month’ values. Inside each subdirectory, you will find the Parquet files containing the data corresponding to that partition.

When reading partitioned Parquet data, Pandas read_parquet treats each file within the directory as a separate DataFrame, and then it concatenates them into one large DataFrame.
Here’s how you can read this partitioned data:

import pandas as pd
df = pd.read_parquet('partitioned_people')
print(df)

The read_parquet function will automatically detect the partitioning scheme and read the data from all the Parquet files within the subdirectories. The resulting DataFrame (df in the example above) will contain the consolidated data from all the partitions.

 

Further Reading

https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html

Leave a Reply

Your email address will not be published. Required fields are marked *