Export Python Pandas DataFrame to CSV file using to_csv

The to_csv function allows you to write data from a Pandas DataFrame to a CSV file.

This is incredibly useful when you need to save, share, or export your data for further use or analysis in a different environment or application.

It comes with several parameters to help you customize the process according to your needs.
For example, you can choose the delimiter to separate your data, decide whether to include headers and indices, select which columns to include, and more.

In this tutorial, we’ll dig deeper into the details of these parameters and how to effectively use them.

 

 

Basic Syntax and Parameters

The basic syntax for using the to_csv function in Pandas is as follows:

DataFrame.to_csv(path_or_buf=None, sep=',', index=True, header=True, columns=None, mode='w', compression='infer', encoding=None, decimal='.', quotechar='"', line_terminator=None, chunksize=None, date_format=None, doublequote=True, escapechar=None, decimal=',')

Let’s discuss the key parameters in detail:

  • path_or_buf: This parameter is used to specify the file path or object where the data should be written. If none is specified, the output is returned as a string.
  • sep: This is the delimiter to use between fields in the resulting output. The default value is a comma (‘,’).
  • index: This is a boolean value used to write row names (index). If True, it writes the index as the first column in the CSV file. The default is True.
  • header: This is used to write out the column names. If True, it writes the column names in the CSV file. The default is True.
  • columns: This parameter allows you to specify which columns to write. Pass a list of column names you wish to include in the output.
  • mode: This is used to specify how the file is opened. The default is ‘w’, which stands for write mode.
  • compression: This specifies the compression mode for the output CSV file. If ‘infer’, then it uses gzip, bz2, zip, or xz if path_or_buf is a string ending in ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ respectively.
  • encoding: This is used to specify the encoding type to be written in the file.
  • decimal: This is a string indicating the character recognized as a decimal separator. The default is ‘.’.
  • quotechar: This is a one-character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters. The default is ‘”‘.
  • lineterminator: This is the character that is used to terminate lines produced in the output.
  • chunksize: If specified, Pandas will write the output in chunks of this size.
  • date_format: This is used to specify the date format for any date data in the file.
  • doublequote: This controls how quotechars inside fields are handled. If True, the character is doubled. If False and escaping is not set, the quotechar is removed.
  • escapechar: This specifies a one-character string used to escape the delimiter when quoting is set to ‘QUOTE_NONE’.
    These parameters control all aspects of the CSV writing process and allow you to manage your output in detail.

 

How to use a custom delimiter?

You can use the sep parameter to specify a different character as the field separator:

import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'],
        'Age': [28, 24, 22],
        'Country': ['USA', 'UK', 'Canada']}
df = pd.DataFrame(data)
df.to_csv('data.csv', sep=';')

The content of data.csv:

,Name,Age,Country
0,John,28,USA
1,Anna,24,UK
2,Peter,22,Canada

In the above code, we first created a DataFrame. We then used the to_csv function to write the DataFrame to a CSV file, but we specified sep=';' to use the semicolon as the field separator instead of the default comma.

 

How to remove the index column?

By default, the to_csv function in Pandas includes the DataFrame’s index as the first column in the output CSV file.

If you want to write your DataFrame to a CSV file without the index, you can do this by setting the index parameter to False. Here is how:

import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'],
        'Age': [28, 24, 22],
        'Country': ['USA', 'UK', 'Canada']}
df = pd.DataFrame(data)
df.to_csv('data_no_index.csv', index=False)

The content of data_no_index.csv:

Name,Age,Country
John,28,USA
Anna,24,UK
Peter,22,Canada

In the code above, we created a DataFrame and then wrote it to a CSV file, setting index=False to exclude the index from the output.

As a result, the CSV file starts directly with the column names and the corresponding data, and the index column has been removed.

 

Customizing Column Headers

When exporting your DataFrame to a CSV file, you may want to change the column headers.

You can do this by passing a list of new column names to the header parameter of the to_csv function:

import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'],
        'Age': [28, 24, 22],
        'Country': ['USA', 'UK', 'Canada']}
df = pd.DataFrame(data)
new_headers = ['First Name', 'Age in Years', 'Country of Origin']
df.to_csv('data_custom_headers.csv', header=new_headers)

The content of data_custom_headers.csv:

,First Name,Age in Years,Country of Origin
0,John,28,USA
1,Anna,24,UK
2,Peter,22,Canada

In this code, we first created a DataFrame, then a list of new column headers. We then exported the DataFrame to a CSV file using the to_csv function, passing the new headers list to the header parameter.
This method allows you to customize the column headers in your CSV file, giving you the flexibility to make the data more understandable and relevant to your needs.

 

How to write specific columns?

Sometimes, you may only want to export certain columns from your DataFrame to a CSV file. Pandas makes this easy with the columns parameter of the to_csv function.

You can pass a list of the column names that you want to include in the output. Here’s an example:

import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'],
        'Age': [28, 24, 22],
        'Country': ['USA', 'UK', 'Canada'],
        'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)
df.to_csv('data_specific_columns.csv', columns=['Name', 'Country'])

The content of data_specific_columns.csv:

,Name,Country
0,John,USA
1,Anna,UK
2,Peter,Canada

In the code above, we created a DataFrame with four columns: ‘Name’, ‘Age’, ‘Country’, and ‘Salary’. When exporting the DataFrame to a CSV file, we only included the ‘Name’ and ‘Country’ columns.

We accomplished this by passing a list of these column names to the columns parameter of the to_csv function.

 

Controlling CSV File Size with Compression

The compression parameter allows you to specify a compression method for the output CSV file. Here is an example of how you can use it:

import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'] * 10000,
        'Age': [28, 24, 22] * 10000,
        'Country': ['USA', 'UK', 'Canada'] * 10000}
df = pd.DataFrame(data)
df.to_csv('data_compressed.zip', compression='zip')

In this code, we first created a large DataFrame. Then, we exported it to a compressed CSV file using the to_csv function, specifying ‘zip’ as the compression method with the compression parameter.

 

How to handle encoding issues?

The encoding parameter allows you to specify the encoding while writing the CSV file.
Here’s an example of how to use it:

import pandas as pd
data = {'Name': ['Jöhn', 'Ánna', 'Péter'],
        'Age': [28, 24, 22],
        'Country': ['USA', 'UK', 'Canada']}
df = pd.DataFrame(data)
df.to_csv('data_utf8.csv', encoding='utf-8')

The content of data_utf8.csv:

,Name,Age,Country
0,Jöhn,28,USA
1,Ánna,24,UK
2,Péter,22,Canada

In the code above, our DataFrame contains special characters in the ‘Name’ column.

When writing the DataFrame to a CSV file, we specified encoding='utf-8' to ensure that these special characters are handled correctly.
In general, ‘utf-8’ is a safe choice for encoding as it can handle any character in the Unicode standard. However, there are many different types of encodings that you could use, depending on your specific requirements.
Remember to always specify the encoding when reading the CSV file back into a DataFrame with pd.read_csv, using the same encoding you used to write the file.

 

Use different decimal separators?

The decimal parameter allows you to set the decimal separator when writing CSV files using the to_csv function.

Here’s an example of how to use the decimal parameter:

import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'],
        'Score': [75.5, 80.6, 90.1]}
df = pd.DataFrame(data)
df.to_csv('data_decimal.csv', sep=';', decimal=',')

The content of data_decimal.csv:

;Name;Score
0;John;75,5
1;Anna;80,6
2;Peter;90,1

In the above code, our DataFrame contains numeric values with decimal points in the ‘Score’ column. When writing the DataFrame to a CSV file, we specified decimal=',' to use the comma as the decimal separator.

Also, we have used a semicolon as the field separator (sep=';') to avoid confusion with the decimal comma.

 

Exporting with Quotes, and Escape Characters

In some cases, your data may contain special characters that need to be handled carefully when writing to a CSV file.

These could include the delimiter character, quotes, or escape characters within the data values.

The to_csv function in pandas provides several parameters to control how these are handled: sep, quotechar, and escapechar.
Here is an example of how you can use these parameters:

import pandas as pd
data = {'Name': ['John, Jr.', 'Anna "The Queen"', 'Peter|The Great'],
        'Age': [28, 24, 22],
        'Country': ['USA', 'UK', 'Canada']}
df = pd.DataFrame(data)
df.to_csv('data_special_chars.csv', sep='|', quotechar='"', escapechar='\\', index=False)

The content of data_special_chars.csv:

Name|Age|Country
"John, Jr."|28|USA
"Anna ""The Queen"""|24|UK
"Peter\|The Great"|22|Canada

In the above code, our DataFrame contains values with commas, quotes, and pipe characters. We used the pipe character (‘|’) as the field separator with sep='|'.

We specified double quotes to enclose fields that contain special characters with quotechar='"'. We also used the backslash as the escape character with escapechar='\\'.

 

How to write CSV files with different line terminators?

The lineterminator parameter allows you to specify how each line will be terminated. Let’s say we want to make an empty line between rows which means two newline characters ('\n\n'):

import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'],
        'Age': [28, 24, 22],
        'Country': ['USA', 'UK', 'Canada']}
df = pd.DataFrame(data)
df.to_csv('data_line_terminator.csv', lineterminator='\n\n')

The content of data_line_terminator.csv (visualized as a raw string):

,Name,Age,Country

0,John,28,USA

1,Anna,24,UK

2,Peter,22,Canada

In this code, we’ve set lineterminator='nn', which adds an empty line after each row when writing the DataFrame to the CSV file.

 

Working with Dates and Times

The date_format parameter allows you to control the format of the date columns the way you want.

import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'],
        'Date': pd.date_range(start='1/1/2022', periods=3, freq='D')}
df = pd.DataFrame(data)
df.to_csv('data_date_format.csv', date_format='%Y-%m-%d %H:%M:%S')

The content of data_date_format.csv:

,Name,Date
0,John,2022-01-01 00:00:00
1,Anna,2022-01-02 00:00:00
2,Peter,2022-01-03 00:00:00

In the above code, our DataFrame contains dates in the ‘Date’ column.

When writing the DataFrame to a CSV file, we specified date_format='%Y-%m-%d %H:%M:%S'. This formats the dates in a year-month-day hour:minute:second format.

 

Exporting Large Datasets

The chunksize parameter allows you to specify the number of rows to be written at a time.
Here’s an example of how you can use the chunksize parameter:

import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'] * 100000,
        'Age': [28, 24, 22] * 100000,
        'Country': ['USA', 'UK', 'Canada'] * 100000}
df = pd.DataFrame(data)
df.to_csv('large_data.csv', chunksize=1000)

In the above code, we’ve created a large DataFrame. When writing the DataFrame to a CSV file, we specified chunksize=1000 to write 1000 rows at a time.

This means Pandas will write the DataFrame to the file in chunks of 1000 rows, helping to keep memory usage under control.

 

Multiple Pandas DataFrames to a Single CSV File

If you want to append a DataFrame to an existing CSV file or write multiple DataFrames to a CSV file, you can open it in append mode by setting the mode parameter to mode='a'.

By default, mode='w', which means the file is opened for writing, and any existing content is overwritten.

Here’s an example of how you can append multiple DataFrames to a single CSV file:

import pandas as pd
data1 = {'Name': ['John', 'Anna', 'Peter'],
         'Age': [28, 24, 22]}
df1 = pd.DataFrame(data1)

data2 = {'Name': ['Michael', 'Sarah', 'Jessica'],
         'Age': [30, 27, 25]}
df2 = pd.DataFrame(data2)

df1.to_csv('data_append.csv', index=False)
df2.to_csv('data_append.csv', mode='a', header=False, index=False)

The content of data_append.csv:

Name,Age
John,28
Anna,24
Peter,22
Michael,30
Sarah,27
Jessica,25

In the above code, we first wrote df1 to a CSV file. We then appended df2 to the same file with mode='a'. We also set header=False when appending df2 to prevent writing the column headers again.

 

How to Write CSV Files to a Remote Location

While to_csv() function doesn’t directly support writing to remote locations, you can accomplish this task by using additional libraries that allow Python to interact with cloud storage, such as boto3 for Amazon S3, gcsfs for Google Cloud Storage, or azure-storage-blob for Azure Blob Storage.
Here is an example of how you can write a DataFrame to a CSV file in an Amazon S3 bucket using boto3:

import pandas as pd
import boto3
from io import StringIO
data = {'Name': ['John', 'Anna', 'Peter'],
        'Age': [28, 24, 22]}
df = pd.DataFrame(data)
csv_buffer = StringIO()
df.to_csv(csv_buffer, index=False)

s3 = boto3.resource('s3', aws_access_key_id='YOUR_ACCESS_KEY',
                    aws_secret_access_key='YOUR_SECRET_KEY')

# write dataframe to a csv file in the s3 bucket
s3.Object('your-bucket-name', 'df_s3.csv').put(Body=csv_buffer.getvalue())

In the above code, we first convert our DataFrame to a CSV format string using the to_csv() method and an instance of StringIO as the file buffer.

We then create an S3 resource object and use the put() method to write the string to a file in our S3 bucket.
Remember to replace 'YOUR_ACCESS_KEY' and 'YOUR_SECRET_KEY' with your actual AWS access key and secret key, and 'your-bucket-name' with the name of your S3 bucket.

 

The wider Compatibility of to_csv

At the telecom company I was working for, data was the heart of our operation – every connection, call, text, or data transfer generated a data point.

With over 60 million subscribers, each generating multiple events per day, you can imagine the enormity of the datasets we were dealing with.

There was a time when we were asked to create a comprehensive report on customer usage trends. This task was enormous, as it involved wrangling and analyzing a dataset with over a billion records. However, the data analysis was just one part of the challenge.

The bigger issue was how to effectively share these findings with the rest of the team, particularly non-technical managers who needed to make strategic decisions based on this data.

I had cleaned and prepared the dataset in a Pandas DataFrame, comprising billions of rows and about a dozen columns, from timestamps of calls to data usage. But when it came time to export this data, I had a decision to make: which format should I choose?

Excel was an option, but given the dataset’s size, it wasn’t feasible. Excel has a limit of about 1 million rows per sheet and our dataset was over a thousand times larger than that.

JSON was another option. However, JSON would have resulted in a complicated nested structure that would be difficult for the managers to parse and understand using their familiar tools.

That’s where Pandas to_csv came into play. Our managers were used to dealing with CSV files. They had a suite of tools, some homegrown, some off-the-shelf like Tableau, which they used for slicing, dicing, and visualizing the data. However, these tools only accepted CSV inputs.

So, I chose to use the to_csv function to export the dataframe. This function handled the size of the data gracefully and allowed me to export the file in chunks. In fact, the CSV file I exported ended up being about 16GB in size.

The choice of CSV proved to be a lifesaver. The managers were able to directly import this file into their tools and extract meaningful insights.

The ease of use and compatibility of CSV files led to a 50% reduction in the time spent in pre-processing the data, a 30% reduction in queries and confusion regarding data format and structure.

Using Pandas to_csv ensured that the insights we derived from our data could be easily understood, utilized, and actioned upon by everyone in the company, even those without a technical background. It was a powerful demonstration of how the right choice of data format can make a world of difference.

 

Further Reading

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html


Leave a Reply

Your email address will not be published. Required fields are marked *