Export Pandas DataFrame to Pickle file using to_pickle() function

The to_pickle function in Pandas allows you to serialize (pickle) a DataFrame or Series object to pickle file format.

This is useful when you want to save the DataFrame or Series’ current state and retrieve it later without any loss of data or metadata.

 

 

Pandas to_pickle Syntax and Parameters

The syntax of to_pickle function is as follows:

DataFrame.to_pickle(path, compression='infer', protocol=5, storage_options=None)
Series.to_pickle(path, compression='infer', protocol=5, storage_options=None)

Here are the parameters:

  • path: The file path where the pickled object will be stored. Can be a string representing a file path or a Python file-like object.
  • compression (default=’infer’): Specifies the on-disk compression to use. Options include: 'infer', 'bz2', 'gzip', 'xz', 'zstd', 'zip', and None. If set to 'infer', Pandas determines the compression from the filename extension (like ‘.gz’ for gzip).
  • protocol (default=5): The pickling protocol version to use.
    By default, it uses protocol 5 which was introduced in Python 3.8. You can set it to lower numbers for compatibility with older Python versions.
  • storage_options (default=None): Extra options for storage connection, useful when saving to some remote or platform-specific storage solutions.

 

Serialize DataFrame to Pickle File

Let’s look at a simple usage of to_pickle.

import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
df.to_pickle("sample.pkl")

# Output: The file "sample.pkl" is created in the current directory.

Here, we created a sample DataFrame df and used the to_pickle method to save it as a pickle file named “sample.pkl” in the current directory.

 

Serialize Series (Column) to Pickle File

First, let’s start by creating a sample Series:

import pandas as pd
s = pd.Series([1, 2, 3, 4, 5], name="Sample Series")

You can use the to_pickle method of the Series to serialize it to a pickle file.

s.to_pickle("sample_series.pkl")

This code will generate a file “sample_series.pkl” in the current directory.

You can use the read_pickle function to read the serialized Series back into a Pandas Series object:

loaded_series = pd.read_pickle("sample_series.pkl")
print(loaded_series)

Output:

0    1
1    2
2    3
3    4
4    5
Name: Sample Series, dtype: int64

As you can see, all the attributes of the Series, such as the name and dtype, are preserved during this process.

 

Supported Compression Types

Pandas to_pickle method allows you to choose among various compression types to potentially reduce the file size.

This not only saves storage but can also speed up IO operations when reading and writing data.

The supported compression types are:

1. infer

  • This is the default compression type.
  • Pandas will determine the compression protocol from the file extension. For instance, if you save your file as “data.gz”, Pandas will infer and use the gzip compression.

2. bz2

  • Uses the BZ2 compression algorithm.
  • Generally offers a good balance between compression ratio and speed.
df.to_pickle("data.bz2", compression="bz2")

3. gzip

  • Utilizes the GZIP file format.
  • Known for its widespread use and compatibility.
df.to_pickle("data.gz", compression="gzip")

4. xz

  • Employs the LZMA algorithm found in the XZ file format.
  • Offers high compression ratios.
df.to_pickle("data.xz", compression="xz")

5. zstd

  • Stands for “Zstandard”.
  • It’s a real-time compression algorithm, providing high compression ratios and maintaining a fast decompression speed.
df.to_pickle("data.zst", compression="zstd")

6. zip

The most common compression algorithm used.

df.to_pickle("data.zip", compression="zip")

Compression Comparison (Which is the best?)

Let’s generate compressed files from a sample DataFrame to see which is the best compression algorithm.

import pandas as pd
import numpy as np
import os

num_rows = 1000000
num_cols = 100
data = {f"col_{i}": np.random.randn(num_rows) for i in range(num_cols)}
df = pd.DataFrame(data)

compression_types = ['bz2', 'gzip', 'xz', 'zip', 'zstd']
compressed_file_paths = []
for compression in compression_types:
    file_path = f"df_{compression}"
    df.to_pickle(file_path, compression=compression)
    compressed_file_paths.append(file_path)

The output file sizes are:

bz2: 745 MB

gzip: 732 MB

xz: 712 MB

zip: 732 MB

zstd:  731 MB

The xz compression is the best due to its smaller size.

 

to_pickle() Backwards Compatibility

The protocol parameter allows you to control which protocol version to use for serialization.

Understanding these protocols is vital when considering backward compatibility, especially if you’re sharing your pickled data across different Python environments.

Supported Protocols

  • Protocol version 0: The original “human-readable” protocol and is backward compatible with earlier versions of Python.
    Text-based format and less efficient than newer binary formats.
  • Protocol version 1: An older binary format.
  • Protocol version 2: Introduced in Python 2.3. Provides much more efficient pickling of new-style classes.
  • Protocol version 3: Introduced in Python 3.0. Added support for bytes objects and cannot be unpickled by Python 2.x. This is the default protocol for Python 3.0-3.7.
  • Protocol version 4: Introduced in Python 3.4. Added support for very large objects, pickling more kinds of objects, and some data format optimizations.
  • Protocol version 5: Introduced in Python 3.8. Added support for out-of-band data and improved support for memoryview and custom classes.

Specifying the Protocol

When saving a DataFrame or Series using to_pickle, the protocol can be defined as follows:

df.to_pickle("data.pkl", protocol=4)

In the above example, the DataFrame df will be pickled using protocol version 4.

By default, to_pickle uses protocol version 5.

If you are working in an environment with multiple Python versions, always double-check and explicitly set the protocol for clarity.

 

Serialize Large DataFrame using different protocols

Let’s start by generating a large DataFrame for demonstration purposes.

After that, we’ll serialize the DataFrame using the to_pickle method with different protocols ranging from 0 to 5.

import pandas as pd
import numpy as np

# Create a large dataframe with random data
num_rows = 1000000
num_cols = 100
data = {f"col_{i}": np.random.randn(num_rows) for i in range(num_cols)}
df = pd.DataFrame(data)

file_paths = []
for protocol in range(6):
    file_path = f"df_protocol_{protocol}.pkl"
    df.to_pickle(file_path, protocol=protocol)
    file_paths.append(file_path)

The output file sizes are:

Protocol 0: 824 MB

Protocol 1:  1.15 GB

Protocol 2:  1.15 GB

Protocol 3:  762 MB

Protocol 4:  762 MB

Protocol 5: 762 MB

As you can see, different protocols have different serialization efficiencies, with protocols 3, 4, and 5 being the most efficient for this dataset.

 

Limitations

While pickling provides a convenient way to serialize and deserialize data, it does come with its set of limitations, particularly related to data size.

Memory Usage

When you pickle a large DataFrame or when reading a large pickle file, the entire object is loaded into memory.

This means you need enough RAM to hold the entire object, which can be an issue with very large DataFrames.

Pickle File Size

While the file size of a pickled object depends on the protocol version and compression used, pickle files can become large, especially with huge DataFrames.

While pickling is a robust method for serializing and deserializing data in Python, it’s essential to be aware of these limitations and potential workarounds, especially when dealing with large datasets.

To overcome these limitations, you can pickle to a socket (over network).

Instead of writing the serialized object to a file on disk, you write it directly to a network socket.

 

Pickling to a Socket

Pickling over a socket allows for real-time transfer of data between a client and a server.

This is a real-world example I achieved before to overcome the huge pickle file size problem.

This can be especially valuable in scenarios like streaming analytics, where data is continually generated and needs to be processed immediately.

Advantages of pickling to a socket:

  • Reduced I/O Overhead: Writing to and reading from disk files involves I/O operations, which can be a performance bottleneck, especially if the data size is large or the storage medium is slow. Transferring data over sockets eliminates this disk I/O overhead.
  • Dynamic Interactions: When transmitting data over a network socket, it’s possible to establish dynamic interactions between the sender and receiver.
    For instance, a client can send data, the server can process it and immediately send back a response or result.

 

First, set up a socket connection:

One Python script acts as the server, waiting for incoming connections.

Another script acts as the client, connecting to the server and sending data.

Then, serialize the DataFrame:

Instead of using to_pickle with a file path, you pass the socket’s file handle to the method.

Let’s walk through a basic example where the server waits for incoming pickled DataFrames and the client sends one.

Server:

import pandas as pd
import socket

# Set up the server socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
    s.bind(('localhost', 65432))
    s.listen()
    conn, addr = s.accept()

    with conn:
        print('Connected by', addr)
        # Load DataFrame directly from socket connection
        df_received = pd.read_pickle(conn.makefile('rb'))
print(df_received)

Client:

import pandas as pd
import socket
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})

# Connect to the server and send the DataFrame
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
    s.connect(('localhost', 65432))
    df.to_pickle(s.makefile('wb'))

Output:

Connected by ('127.0.0.1', 13630)
   A  B
0  1  a
1  2  b
2  3  c

To run the example, start the server script first. Once it’s waiting for connections, run the client script.

The client will serialize the DataFrame and send it over the socket connection, which the server then reads directly into a Pandas DataFrame.

 

Resources

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_pickle.html

https://pandas.pydata.org/docs/reference/api/pandas.Series.to_pickle.html

Leave a Reply

Your email address will not be published. Required fields are marked *