Reading Pickle Files in Pandas using read_pickle
read_pickle
in Pandas allows you to load pickled Pandas objects.
It can load data such as DataFrames and Series that were saved using Pandas to_pickle
method.
In this tutorial, We’ll uncover its syntax, load pickle files into DataFrames, and benchmark its performance under different compression algorithms.
Pandas read_pickle Syntax and Parameters
The basic syntax for read_pickle
is as follows:
pandas.read_pickle(filepath_or_buffer, compression='infer', storage_options=None)
- filepath_or_buffer: The path to the file which contains the pickled object. This can be either a string representing the file path, a file-like object, or a bytes-like object.
- compression: The type of compression to use, if any. By default, it’s set to ‘infer’, which means the method will try to infer the compression type from the file extension. Supported compression types include ‘bz2’, ‘gzip’, ‘xz’, and ‘zip’.
- storage_options: This is a dict parameter which is relevant if you’re using specific storage connection settings, especially when working with remote storage like S3 or GCS.
Risks of Unpickling Data from Untrusted Sources
It’s crucial to understand the potential dangers associated with unpickling data, especially when the source of that data is untrusted.
- Arbitrary Code Execution: The pickled data can contain executable code.
Unpickling from an untrusted source can run arbitrary code, potentially harming your system or compromising sensitive data. - Denial of Service (DoS): A specially crafted pickled file can cause your application to crash or hang, leading to a Denial of Service attack.
Never unpickle data received from untrusted or unauthenticated sources
How to read a pickle file
You can use read_pickle
function to read pickle file in Panads like this:
import pandas as pd df = pd.read_pickle('sample_data.pkl')
Output:
Name Age Salary 0 Alex 25 50000 1 John 30 60000 2 Jane 28 55000
Here, we loaded the DataFrame stored in the “sample_data.pkl” file.
The DataFrame, as shown in the output, has three columns: ‘Name’, ‘Age’, and ‘Salary’ and three entries for demonstration.
Read Compressed Pickle
Pandas natively supports several compression protocols:
- gzip: An extensively used compression method, particularly suitable for textual data.
- bz2: Another compression method that often provides a better compression ratio than gzip, albeit at a slightly slower speed.
- xz: Provides one of the best compression ratios, although it can be much slower than the other methods.
- zip: Widely known and used, it is also supported by Pandas for both pickling and reading.
Assuming you have pickled and compressed a DataFrame using one of the supported methods, you can read the compressed file directly using read_pickle
by specifying the appropriate compression type.
For gzip compression:
df_gzip = pd.read_pickle('dataframe.pkl.gz', compression='gzip')
For bz2 compression:
df_bz2 = pd.read_pickle('dataframe.pkl.bz2', compression='bz2')
For xz compression:
df_xz = pd.read_pickle('dataframe.pkl.xz', compression='xz')
For zip compression:
df_zip = pd.read_pickle('dataframe.pkl.zip', compression='zip')
One of the handy features is that if you omit the compression parameter when calling read_pickle
, Pandas will try to infer the compression based on the file extension.
Read first Row or n Rows from Pickle
Unlike CSV or other textual formats, pickled files are not designed for partial reading.
The primary mechanism with pickles is all or nothing. Once the data is loaded, you can easily access the first row.
Load the Pickle and Access the First Row:
df = pd.read_pickle('dataframe.pkl') first_row = df.iloc[0]
Output:
A 1 B 4 Name: 0, dtype: int64
The output showcases the values from the first row of our sample DataFrame. Here, we’ve used the iloc
property of the DataFrame.
Alternative Method using head()
:
Pandas DataFrames have a built-in method called head()
, which returns the first n rows of the DataFrame.
first_row_with_head = df.head(1)
Output:
A B 0 1 4
If you want to retrieve the first 10 rows, you'll use df.head(10)
.
Output (for n=3, as an example):
A B 0 1 4 1 2 5 2 3 6
Benchmark for Different Compression Algorithms
Below is a Python code that creates sample pickle files with different compressions and benchmarks the reading times using read_pickle
:
import pandas as pd import time data = {'A': range(1, 100001), 'B': range(100001, 1, -1)} df = pd.DataFrame(data) # Pickle with different compressions df.to_pickle("dataframe.pkl") # No compression df.to_pickle("dataframe_gzip.pkl.gz", compression='gzip') df.to_pickle("dataframe_bz2.pkl.bz2", compression='bz2') df.to_pickle("dataframe_xz.pkl.xz", compression='xz') df.to_pickle("dataframe_zip.pkl.zip", compression='zip') # Measure load times files = ["dataframe.pkl", "dataframe_gzip.pkl.gz", "dataframe_bz2.pkl.bz2", "dataframe_xz.pkl.xz", "dataframe_zip.pkl.zip"] compression_methods = ["No Compression", "gzip", "bz2", "xz", "zip"] for file, method in zip(files, compression_methods): start_time = time.time() _ = pd.read_pickle(file) end_time = time.time() elapsed_time = end_time - start_time print(f"Reading time with {method}: {elapsed_time:.4f} seconds")
Output:
Reading time with No Compression: 0.0932 seconds Reading time with gzip: 0.7555 seconds Reading time with bz2: 4.9183 seconds Reading time with xz: 2.1486 seconds Reading time with zip: 0.7317 seconds
As you can see, gzip and zip compression is the fastest compression you can read from.
The slowest one is the xz compression yet, in case of creating the pickle file it was the smallest in size.
Error Handling and Troubleshooting
One of the common errors when unpickling in Pandas is the unsupported pickle protocol issue.
This arises due to version mismatches between the Python libraries that were used to pickle the data and those being used to unpickle it.
The error message might look something like: ValueError: unsupported pickle protocol: 5
.
Solutions
Upgrade Python: If the error is due to an older Python version, consider upgrading to a newer one that supports the required protocol.
Re-Pickle with a Lower Protocol: If you have access to the environment where the data was originally pickled, you can re-pickle it specifying a lower protocol. For example:python
df.to_pickle("dataframe_lower_protocol.pkl", protocol=4)
Use a Virtual Environment: If you need to maintain multiple Python versions or library versions, consider using tools like venv
or conda
to create isolated environments.
General Troubleshooting Tips
Always check the versions of Python and Pandas when facing such issues. This can be done using:
import sys print(sys.version)
and
print(pd.__version__)
Resource
https://pandas.pydata.org/docs/reference/api/pandas.read_pickle.html
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.