Export Pandas DataFrame to Pickle file using to_pickle() function
The to_pickle
function in Pandas allows you to serialize (pickle) a DataFrame or Series object to pickle file format.
This is useful when you want to save the DataFrame or Series’ current state and retrieve it later without any loss of data or metadata.
Pandas to_pickle
Syntax and Parameters
The syntax of to_pickle
function is as follows:
DataFrame.to_pickle(path, compression='infer', protocol=5, storage_options=None) Series.to_pickle(path, compression='infer', protocol=5, storage_options=None)
Here are the parameters:
- path: The file path where the pickled object will be stored. Can be a string representing a file path or a Python file-like object.
- compression (default=’infer’): Specifies the on-disk compression to use. Options include:
'infer'
,'bz2'
,'gzip'
,'xz'
,'zstd'
,'zip'
, andNone
. If set to'infer'
, Pandas determines the compression from the filename extension (like ‘.gz’ for gzip). - protocol (default=5): The pickling protocol version to use.
By default, it uses protocol 5 which was introduced in Python 3.8. You can set it to lower numbers for compatibility with older Python versions. - storage_options (default=None): Extra options for storage connection, useful when saving to some remote or platform-specific storage solutions.
Serialize DataFrame to Pickle File
Let’s look at a simple usage of to_pickle
.
import pandas as pd df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']}) df.to_pickle("sample.pkl") # Output: The file "sample.pkl" is created in the current directory.
Here, we created a sample DataFrame df
and used the to_pickle
method to save it as a pickle file named “sample.pkl” in the current directory.
Serialize Series (Column) to Pickle File
First, let’s start by creating a sample Series:
import pandas as pd s = pd.Series([1, 2, 3, 4, 5], name="Sample Series")
You can use the to_pickle
method of the Series to serialize it to a pickle file.
s.to_pickle("sample_series.pkl")
This code will generate a file “sample_series.pkl” in the current directory.
You can use the read_pickle
function to read the serialized Series back into a Pandas Series object:
loaded_series = pd.read_pickle("sample_series.pkl") print(loaded_series)
Output:
0 1 1 2 2 3 3 4 4 5 Name: Sample Series, dtype: int64
As you can see, all the attributes of the Series, such as the name
and dtype
, are preserved during this process.
Supported Compression Types
Pandas to_pickle
method allows you to choose among various compression types to potentially reduce the file size.
This not only saves storage but can also speed up IO operations when reading and writing data.
The supported compression types are:
1. infer
- This is the default compression type.
- Pandas will determine the compression protocol from the file extension. For instance, if you save your file as “data.gz”, Pandas will infer and use the gzip compression.
2. bz2
- Uses the BZ2 compression algorithm.
- Generally offers a good balance between compression ratio and speed.
df.to_pickle("data.bz2", compression="bz2")
3. gzip
- Utilizes the GZIP file format.
- Known for its widespread use and compatibility.
df.to_pickle("data.gz", compression="gzip")
4. xz
- Employs the LZMA algorithm found in the XZ file format.
- Offers high compression ratios.
df.to_pickle("data.xz", compression="xz")
5. zstd
- Stands for “Zstandard”.
- It’s a real-time compression algorithm, providing high compression ratios and maintaining a fast decompression speed.
df.to_pickle("data.zst", compression="zstd")
6. zip
The most common compression algorithm used.
df.to_pickle("data.zip", compression="zip")
Compression Comparison (Which is the best?)
Let’s generate compressed files from a sample DataFrame to see which is the best compression algorithm.
import pandas as pd import numpy as np import os num_rows = 1000000 num_cols = 100 data = {f"col_{i}": np.random.randn(num_rows) for i in range(num_cols)} df = pd.DataFrame(data) compression_types = ['bz2', 'gzip', 'xz', 'zip', 'zstd'] compressed_file_paths = [] for compression in compression_types: file_path = f"df_{compression}" df.to_pickle(file_path, compression=compression) compressed_file_paths.append(file_path)
The output file sizes are:
bz2: 745 MB
gzip: 732 MB
xz: 712 MB
zip: 732 MB
zstd: 731 MB
The xz
compression is the best due to its smaller size.
to_pickle() Backwards Compatibility
The protocol
parameter allows you to control which protocol version to use for serialization.
Understanding these protocols is vital when considering backward compatibility, especially if you’re sharing your pickled data across different Python environments.
Supported Protocols
- Protocol version 0: The original “human-readable” protocol and is backward compatible with earlier versions of Python.
Text-based format and less efficient than newer binary formats. - Protocol version 1: An older binary format.
- Protocol version 2: Introduced in Python 2.3. Provides much more efficient pickling of new-style classes.
- Protocol version 3: Introduced in Python 3.0. Added support for bytes objects and cannot be unpickled by Python 2.x. This is the default protocol for Python 3.0-3.7.
- Protocol version 4: Introduced in Python 3.4. Added support for very large objects, pickling more kinds of objects, and some data format optimizations.
- Protocol version 5: Introduced in Python 3.8. Added support for out-of-band data and improved support for memoryview and custom classes.
Specifying the Protocol
When saving a DataFrame or Series using to_pickle
, the protocol can be defined as follows:
df.to_pickle("data.pkl", protocol=4)
In the above example, the DataFrame df
will be pickled using protocol version 4.
By default, to_pickle
uses protocol version 5.
If you are working in an environment with multiple Python versions, always double-check and explicitly set the protocol for clarity.
Serialize Large DataFrame using different protocols
Let’s start by generating a large DataFrame for demonstration purposes.
After that, we’ll serialize the DataFrame using the to_pickle
method with different protocols ranging from 0 to 5.
import pandas as pd import numpy as np # Create a large dataframe with random data num_rows = 1000000 num_cols = 100 data = {f"col_{i}": np.random.randn(num_rows) for i in range(num_cols)} df = pd.DataFrame(data) file_paths = [] for protocol in range(6): file_path = f"df_protocol_{protocol}.pkl" df.to_pickle(file_path, protocol=protocol) file_paths.append(file_path)
The output file sizes are:
Protocol 0: 824 MB
Protocol 1: 1.15 GB
Protocol 2: 1.15 GB
Protocol 3: 762 MB
Protocol 4: 762 MB
Protocol 5: 762 MB
As you can see, different protocols have different serialization efficiencies, with protocols 3, 4, and 5 being the most efficient for this dataset.
Limitations
While pickling provides a convenient way to serialize and deserialize data, it does come with its set of limitations, particularly related to data size.
Memory Usage
When you pickle a large DataFrame or when reading a large pickle file, the entire object is loaded into memory.
This means you need enough RAM to hold the entire object, which can be an issue with very large DataFrames.
Pickle File Size
While the file size of a pickled object depends on the protocol version and compression used, pickle files can become large, especially with huge DataFrames.
While pickling is a robust method for serializing and deserializing data in Python, it’s essential to be aware of these limitations and potential workarounds, especially when dealing with large datasets.
To overcome these limitations, you can pickle to a socket (over network).
Instead of writing the serialized object to a file on disk, you write it directly to a network socket.
Pickling to a Socket
Pickling over a socket allows for real-time transfer of data between a client and a server.
This is a real-world example I achieved before to overcome the huge pickle file size problem.
This can be especially valuable in scenarios like streaming analytics, where data is continually generated and needs to be processed immediately.
Advantages of pickling to a socket:
- Reduced I/O Overhead: Writing to and reading from disk files involves I/O operations, which can be a performance bottleneck, especially if the data size is large or the storage medium is slow. Transferring data over sockets eliminates this disk I/O overhead.
- Dynamic Interactions: When transmitting data over a network socket, it’s possible to establish dynamic interactions between the sender and receiver.
For instance, a client can send data, the server can process it and immediately send back a response or result.
First, set up a socket connection:
One Python script acts as the server, waiting for incoming connections.
Another script acts as the client, connecting to the server and sending data.
Then, serialize the DataFrame:
Instead of using to_pickle
with a file path, you pass the socket’s file handle to the method.
Let’s walk through a basic example where the server waits for incoming pickled DataFrames and the client sends one.
Server:
import pandas as pd import socket # Set up the server socket with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: s.bind(('localhost', 65432)) s.listen() conn, addr = s.accept() with conn: print('Connected by', addr) # Load DataFrame directly from socket connection df_received = pd.read_pickle(conn.makefile('rb')) print(df_received)
Client:
import pandas as pd import socket df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']}) # Connect to the server and send the DataFrame with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: s.connect(('localhost', 65432)) df.to_pickle(s.makefile('wb'))
Output:
Connected by ('127.0.0.1', 13630) A B 0 1 a 1 2 b 2 3 c
To run the example, start the server script first. Once it’s waiting for connections, run the client script.
The client will serialize the DataFrame and send it over the socket connection, which the server then reads directly into a Pandas DataFrame.
Resources
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_pickle.html
https://pandas.pydata.org/docs/reference/api/pandas.Series.to_pickle.html
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.