Reading Pickle Files in Pandas using read_pickle

read_pickle in Pandas allows you to load pickled Pandas objects.

It can load data such as DataFrames and Series that were saved using Pandas to_pickle method.

In this tutorial, We’ll uncover its syntax, load pickle files into DataFrames, and benchmark its performance under different compression algorithms.

 

 

Pandas read_pickle Syntax and Parameters

The basic syntax for read_pickle is as follows:

pandas.read_pickle(filepath_or_buffer, compression='infer', storage_options=None)
  • filepath_or_buffer: The path to the file which contains the pickled object. This can be either a string representing the file path, a file-like object, or a bytes-like object.
  • compression: The type of compression to use, if any. By default, it’s set to ‘infer’, which means the method will try to infer the compression type from the file extension. Supported compression types include ‘bz2’, ‘gzip’, ‘xz’, and ‘zip’.
  • storage_options: This is a dict parameter which is relevant if you’re using specific storage connection settings, especially when working with remote storage like S3 or GCS.

 

Risks of Unpickling Data from Untrusted Sources

It’s crucial to understand the potential dangers associated with unpickling data, especially when the source of that data is untrusted.

  1. Arbitrary Code Execution: The pickled data can contain executable code.
    Unpickling from an untrusted source can run arbitrary code, potentially harming your system or compromising sensitive data.
  2. Denial of Service (DoS): A specially crafted pickled file can cause your application to crash or hang, leading to a Denial of Service attack.

Never unpickle data received from untrusted or unauthenticated sources

 

How to read a pickle file

You can use read_pickle function to read pickle file in Panads like this:

import pandas as pd
df = pd.read_pickle('sample_data.pkl')

Output:

   Name  Age  Salary
0  Alex   25   50000
1  John   30   60000
2  Jane   28   55000

Here, we loaded the DataFrame stored in the “sample_data.pkl” file.

The DataFrame, as shown in the output, has three columns: ‘Name’, ‘Age’, and ‘Salary’ and three entries for demonstration.

 

Read Compressed Pickle

Pandas natively supports several compression protocols:

  • gzip: An extensively used compression method, particularly suitable for textual data.
  • bz2: Another compression method that often provides a better compression ratio than gzip, albeit at a slightly slower speed.
  • xz: Provides one of the best compression ratios, although it can be much slower than the other methods.
  • zip: Widely known and used, it is also supported by Pandas for both pickling and reading.

Assuming you have pickled and compressed a DataFrame using one of the supported methods, you can read the compressed file directly using read_pickle by specifying the appropriate compression type.

For gzip compression:

df_gzip = pd.read_pickle('dataframe.pkl.gz', compression='gzip')

For bz2 compression:

df_bz2 = pd.read_pickle('dataframe.pkl.bz2', compression='bz2')

For xz compression:

df_xz = pd.read_pickle('dataframe.pkl.xz', compression='xz')

For zip compression:

df_zip = pd.read_pickle('dataframe.pkl.zip', compression='zip')

One of the handy features is that if you omit the compression parameter when calling read_pickle, Pandas will try to infer the compression based on the file extension.

 

Read first Row or n Rows from Pickle

Unlike CSV or other textual formats, pickled files are not designed for partial reading.

The primary mechanism with pickles is all or nothing. Once the data is loaded, you can easily access the first row.

Load the Pickle and Access the First Row:

df = pd.read_pickle('dataframe.pkl')
first_row = df.iloc[0]

Output:

A    1
B    4
Name: 0, dtype: int64

The output showcases the values from the first row of our sample DataFrame. Here, we’ve used the iloc property of the DataFrame.

Alternative Method using head():

Pandas DataFrames have a built-in method called head(), which returns the first n rows of the DataFrame.

first_row_with_head = df.head(1)

Output:

   A  B
0  1  4
If you want to retrieve the first 10 rows, you'll use df.head(10).

Output (for n=3, as an example):

   A  B
0  1  4
1  2  5
2  3  6

 

Benchmark for Different Compression Algorithms

Below is a Python code that creates sample pickle files with different compressions and benchmarks the reading times using read_pickle:

import pandas as pd
import time

data = {'A': range(1, 100001), 'B': range(100001, 1, -1)}
df = pd.DataFrame(data)

# Pickle with different compressions
df.to_pickle("dataframe.pkl")  # No compression
df.to_pickle("dataframe_gzip.pkl.gz", compression='gzip')
df.to_pickle("dataframe_bz2.pkl.bz2", compression='bz2')
df.to_pickle("dataframe_xz.pkl.xz", compression='xz')
df.to_pickle("dataframe_zip.pkl.zip", compression='zip')

# Measure load times
files = ["dataframe.pkl", "dataframe_gzip.pkl.gz", "dataframe_bz2.pkl.bz2", "dataframe_xz.pkl.xz", "dataframe_zip.pkl.zip"]
compression_methods = ["No Compression", "gzip", "bz2", "xz", "zip"]

for file, method in zip(files, compression_methods):
    start_time = time.time()
    _ = pd.read_pickle(file)
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Reading time with {method}: {elapsed_time:.4f} seconds")

Output:

Reading time with No Compression: 0.0932 seconds
Reading time with gzip: 0.7555 seconds
Reading time with bz2: 4.9183 seconds
Reading time with xz: 2.1486 seconds
Reading time with zip: 0.7317 seconds

As you can see, gzip and zip compression is the fastest compression you can read from.

The slowest one is the xz compression yet, in case of creating the pickle file it was the smallest in size.

 

Error Handling and Troubleshooting

One of the common errors when unpickling in Pandas is the unsupported pickle protocol issue.

This arises due to version mismatches between the Python libraries that were used to pickle the data and those being used to unpickle it.

The error message might look something like: ValueError: unsupported pickle protocol: 5.

Solutions

Upgrade Python: If the error is due to an older Python version, consider upgrading to a newer one that supports the required protocol.

Re-Pickle with a Lower Protocol: If you have access to the environment where the data was originally pickled, you can re-pickle it specifying a lower protocol. For example:python
df.to_pickle("dataframe_lower_protocol.pkl", protocol=4)

Use a Virtual Environment: If you need to maintain multiple Python versions or library versions, consider using tools like venv or conda to create isolated environments.

General Troubleshooting Tips

Always check the versions of Python and Pandas when facing such issues. This can be done using:

import sys
print(sys.version)

and

print(pd.__version__)

 

Resource

https://pandas.pydata.org/docs/reference/api/pandas.read_pickle.html

Leave a Reply

Your email address will not be published. Required fields are marked *