Export Python Pandas DataFrame to CSV file using to_csv
The to_csv
function allows you to write data from a Pandas DataFrame to a CSV file.
This is incredibly useful when you need to save, share, or export your data for further use or analysis in a different environment or application.
It comes with several parameters to help you customize the process according to your needs.
For example, you can choose the delimiter to separate your data, decide whether to include headers and indices, select which columns to include, and more.
In this tutorial, we’ll dig deeper into the details of these parameters and how to effectively use them.
- 1 Basic Syntax and Parameters
- 2 How to use a custom delimiter?
- 3 How to remove the index column?
- 4 Customizing Column Headers
- 5 How to write specific columns?
- 6 Controlling CSV File Size with Compression
- 7 How to handle encoding issues?
- 8 Use different decimal separators?
- 9 Exporting with Quotes, and Escape Characters
- 10 How to write CSV files with different line terminators?
- 11 Working with Dates and Times
- 12 Exporting Large Datasets
- 13 Multiple Pandas DataFrames to a Single CSV File
- 14 How to Write CSV Files to a Remote Location
- 15 The wider Compatibility of to_csv
- 16 Further Reading
Basic Syntax and Parameters
The basic syntax for using the to_csv
function in Pandas is as follows:
DataFrame.to_csv(path_or_buf=None, sep=',', index=True, header=True, columns=None, mode='w', compression='infer', encoding=None, decimal='.', quotechar='"', line_terminator=None, chunksize=None, date_format=None, doublequote=True, escapechar=None, decimal=',')
Let’s discuss the key parameters in detail:
path_or_buf
: This parameter is used to specify the file path or object where the data should be written. If none is specified, the output is returned as a string.sep
: This is the delimiter to use between fields in the resulting output. The default value is a comma (‘,’).index
: This is a boolean value used to write row names (index). IfTrue
, it writes the index as the first column in the CSV file. The default isTrue
.header
: This is used to write out the column names. IfTrue
, it writes the column names in the CSV file. The default isTrue
.columns
: This parameter allows you to specify which columns to write. Pass a list of column names you wish to include in the output.mode
: This is used to specify how the file is opened. The default is ‘w’, which stands for write mode.compression
: This specifies the compression mode for the output CSV file. If ‘infer’, then it uses gzip, bz2, zip, or xz ifpath_or_buf
is a string ending in ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ respectively.encoding
: This is used to specify the encoding type to be written in the file.decimal
: This is a string indicating the character recognized as a decimal separator. The default is ‘.’.quotechar
: This is a one-character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters. The default is ‘”‘.lineterminator
: This is the character that is used to terminate lines produced in the output.chunksize
: If specified, Pandas will write the output in chunks of this size.date_format
: This is used to specify the date format for any date data in the file.doublequote
: This controls how quotechars inside fields are handled. IfTrue
, the character is doubled. IfFalse
andescaping
is not set, the quotechar is removed.escapechar
: This specifies a one-character string used to escape the delimiter when quoting is set to ‘QUOTE_NONE’.
These parameters control all aspects of the CSV writing process and allow you to manage your output in detail.
How to use a custom delimiter?
You can use the sep
parameter to specify a different character as the field separator:
import pandas as pd data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 24, 22], 'Country': ['USA', 'UK', 'Canada']} df = pd.DataFrame(data) df.to_csv('data.csv', sep=';')
The content of data.csv:
,Name,Age,Country 0,John,28,USA 1,Anna,24,UK 2,Peter,22,Canada
In the above code, we first created a DataFrame. We then used the to_csv
function to write the DataFrame to a CSV file, but we specified sep=';'
to use the semicolon as the field separator instead of the default comma.
How to remove the index column?
By default, the to_csv
function in Pandas includes the DataFrame’s index as the first column in the output CSV file.
If you want to write your DataFrame to a CSV file without the index, you can do this by setting the index
parameter to False
. Here is how:
import pandas as pd data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 24, 22], 'Country': ['USA', 'UK', 'Canada']} df = pd.DataFrame(data) df.to_csv('data_no_index.csv', index=False)
The content of data_no_index.csv:
Name,Age,Country John,28,USA Anna,24,UK Peter,22,Canada
In the code above, we created a DataFrame and then wrote it to a CSV file, setting index=False
to exclude the index from the output.
As a result, the CSV file starts directly with the column names and the corresponding data, and the index column has been removed.
Customizing Column Headers
When exporting your DataFrame to a CSV file, you may want to change the column headers.
You can do this by passing a list of new column names to the header
parameter of the to_csv
function:
import pandas as pd data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 24, 22], 'Country': ['USA', 'UK', 'Canada']} df = pd.DataFrame(data) new_headers = ['First Name', 'Age in Years', 'Country of Origin'] df.to_csv('data_custom_headers.csv', header=new_headers)
The content of data_custom_headers.csv:
,First Name,Age in Years,Country of Origin 0,John,28,USA 1,Anna,24,UK 2,Peter,22,Canada
In this code, we first created a DataFrame, then a list of new column headers. We then exported the DataFrame to a CSV file using the to_csv
function, passing the new headers list to the header
parameter.
This method allows you to customize the column headers in your CSV file, giving you the flexibility to make the data more understandable and relevant to your needs.
How to write specific columns?
Sometimes, you may only want to export certain columns from your DataFrame to a CSV file. Pandas makes this easy with the columns
parameter of the to_csv
function.
You can pass a list of the column names that you want to include in the output. Here’s an example:
import pandas as pd data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 24, 22], 'Country': ['USA', 'UK', 'Canada'], 'Salary': [50000, 60000, 70000]} df = pd.DataFrame(data) df.to_csv('data_specific_columns.csv', columns=['Name', 'Country'])
The content of data_specific_columns.csv:
,Name,Country 0,John,USA 1,Anna,UK 2,Peter,Canada
In the code above, we created a DataFrame with four columns: ‘Name’, ‘Age’, ‘Country’, and ‘Salary’. When exporting the DataFrame to a CSV file, we only included the ‘Name’ and ‘Country’ columns.
We accomplished this by passing a list of these column names to the columns
parameter of the to_csv
function.
Controlling CSV File Size with Compression
The compression
parameter allows you to specify a compression method for the output CSV file. Here is an example of how you can use it:
import pandas as pd data = {'Name': ['John', 'Anna', 'Peter'] * 10000, 'Age': [28, 24, 22] * 10000, 'Country': ['USA', 'UK', 'Canada'] * 10000} df = pd.DataFrame(data) df.to_csv('data_compressed.zip', compression='zip')
In this code, we first created a large DataFrame. Then, we exported it to a compressed CSV file using the to_csv
function, specifying ‘zip’ as the compression method with the compression
parameter.
How to handle encoding issues?
The encoding
parameter allows you to specify the encoding while writing the CSV file.
Here’s an example of how to use it:
import pandas as pd data = {'Name': ['Jöhn', 'Ánna', 'Péter'], 'Age': [28, 24, 22], 'Country': ['USA', 'UK', 'Canada']} df = pd.DataFrame(data) df.to_csv('data_utf8.csv', encoding='utf-8')
The content of data_utf8.csv:
,Name,Age,Country 0,Jöhn,28,USA 1,Ánna,24,UK 2,Péter,22,Canada
In the code above, our DataFrame contains special characters in the ‘Name’ column.
When writing the DataFrame to a CSV file, we specified encoding='utf-8'
to ensure that these special characters are handled correctly.
In general, ‘utf-8’ is a safe choice for encoding as it can handle any character in the Unicode standard. However, there are many different types of encodings that you could use, depending on your specific requirements.
Remember to always specify the encoding when reading the CSV file back into a DataFrame with pd.read_csv
, using the same encoding you used to write the file.
Use different decimal separators?
The decimal
parameter allows you to set the decimal separator when writing CSV files using the to_csv
function.
Here’s an example of how to use the decimal
parameter:
import pandas as pd data = {'Name': ['John', 'Anna', 'Peter'], 'Score': [75.5, 80.6, 90.1]} df = pd.DataFrame(data) df.to_csv('data_decimal.csv', sep=';', decimal=',')
The content of data_decimal.csv:
;Name;Score 0;John;75,5 1;Anna;80,6 2;Peter;90,1
In the above code, our DataFrame contains numeric values with decimal points in the ‘Score’ column. When writing the DataFrame to a CSV file, we specified decimal=','
to use the comma as the decimal separator.
Also, we have used a semicolon as the field separator (sep=';'
) to avoid confusion with the decimal comma.
Exporting with Quotes, and Escape Characters
In some cases, your data may contain special characters that need to be handled carefully when writing to a CSV file.
These could include the delimiter character, quotes, or escape characters within the data values.
The to_csv
function in pandas provides several parameters to control how these are handled: sep
, quotechar
, and escapechar
.
Here is an example of how you can use these parameters:
import pandas as pd data = {'Name': ['John, Jr.', 'Anna "The Queen"', 'Peter|The Great'], 'Age': [28, 24, 22], 'Country': ['USA', 'UK', 'Canada']} df = pd.DataFrame(data) df.to_csv('data_special_chars.csv', sep='|', quotechar='"', escapechar='\\', index=False)
The content of data_special_chars.csv:
Name|Age|Country "John, Jr."|28|USA "Anna ""The Queen"""|24|UK "Peter\|The Great"|22|Canada
In the above code, our DataFrame contains values with commas, quotes, and pipe characters. We used the pipe character (‘|’) as the field separator with sep='|'
.
We specified double quotes to enclose fields that contain special characters with quotechar='"'
. We also used the backslash as the escape character with escapechar='\\'
.
How to write CSV files with different line terminators?
The lineterminator
parameter allows you to specify how each line will be terminated. Let’s say we want to make an empty line between rows which means two newline characters ('\n\n'
):
import pandas as pd data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 24, 22], 'Country': ['USA', 'UK', 'Canada']} df = pd.DataFrame(data) df.to_csv('data_line_terminator.csv', lineterminator='\n\n')
The content of data_line_terminator.csv (visualized as a raw string):
,Name,Age,Country 0,John,28,USA 1,Anna,24,UK 2,Peter,22,Canada
In this code, we’ve set lineterminator='nn'
, which adds an empty line after each row when writing the DataFrame to the CSV file.
Working with Dates and Times
The date_format
parameter allows you to control the format of the date columns the way you want.
import pandas as pd data = {'Name': ['John', 'Anna', 'Peter'], 'Date': pd.date_range(start='1/1/2022', periods=3, freq='D')} df = pd.DataFrame(data) df.to_csv('data_date_format.csv', date_format='%Y-%m-%d %H:%M:%S')
The content of data_date_format.csv:
,Name,Date 0,John,2022-01-01 00:00:00 1,Anna,2022-01-02 00:00:00 2,Peter,2022-01-03 00:00:00
In the above code, our DataFrame contains dates in the ‘Date’ column.
When writing the DataFrame to a CSV file, we specified date_format='%Y-%m-%d %H:%M:%S'
. This formats the dates in a year-month-day hour:minute:second format.
Exporting Large Datasets
The chunksize
parameter allows you to specify the number of rows to be written at a time.
Here’s an example of how you can use the chunksize
parameter:
import pandas as pd data = {'Name': ['John', 'Anna', 'Peter'] * 100000, 'Age': [28, 24, 22] * 100000, 'Country': ['USA', 'UK', 'Canada'] * 100000} df = pd.DataFrame(data) df.to_csv('large_data.csv', chunksize=1000)
In the above code, we’ve created a large DataFrame. When writing the DataFrame to a CSV file, we specified chunksize=1000
to write 1000 rows at a time.
This means Pandas will write the DataFrame to the file in chunks of 1000 rows, helping to keep memory usage under control.
Multiple Pandas DataFrames to a Single CSV File
If you want to append a DataFrame to an existing CSV file or write multiple DataFrames to a CSV file, you can open it in append mode by setting the mode
parameter to mode='a'
.
By default, mode='w'
, which means the file is opened for writing, and any existing content is overwritten.
Here’s an example of how you can append multiple DataFrames to a single CSV file:
import pandas as pd data1 = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 24, 22]} df1 = pd.DataFrame(data1) data2 = {'Name': ['Michael', 'Sarah', 'Jessica'], 'Age': [30, 27, 25]} df2 = pd.DataFrame(data2) df1.to_csv('data_append.csv', index=False) df2.to_csv('data_append.csv', mode='a', header=False, index=False)
The content of data_append.csv:
Name,Age John,28 Anna,24 Peter,22 Michael,30 Sarah,27 Jessica,25
In the above code, we first wrote df1
to a CSV file. We then appended df2
to the same file with mode='a'
. We also set header=False
when appending df2
to prevent writing the column headers again.
How to Write CSV Files to a Remote Location
While to_csv()
function doesn’t directly support writing to remote locations, you can accomplish this task by using additional libraries that allow Python to interact with cloud storage, such as boto3
for Amazon S3, gcsfs
for Google Cloud Storage, or azure-storage-blob
for Azure Blob Storage.
Here is an example of how you can write a DataFrame to a CSV file in an Amazon S3 bucket using boto3
:
import pandas as pd import boto3 from io import StringIO data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 24, 22]} df = pd.DataFrame(data) csv_buffer = StringIO() df.to_csv(csv_buffer, index=False) s3 = boto3.resource('s3', aws_access_key_id='YOUR_ACCESS_KEY', aws_secret_access_key='YOUR_SECRET_KEY') # write dataframe to a csv file in the s3 bucket s3.Object('your-bucket-name', 'df_s3.csv').put(Body=csv_buffer.getvalue())
In the above code, we first convert our DataFrame to a CSV format string using the to_csv()
method and an instance of StringIO
as the file buffer.
We then create an S3 resource object and use the put()
method to write the string to a file in our S3 bucket.
Remember to replace 'YOUR_ACCESS_KEY'
and 'YOUR_SECRET_KEY'
with your actual AWS access key and secret key, and 'your-bucket-name'
with the name of your S3 bucket.
The wider Compatibility of to_csv
At the telecom company I was working for, data was the heart of our operation – every connection, call, text, or data transfer generated a data point.
With over 60 million subscribers, each generating multiple events per day, you can imagine the enormity of the datasets we were dealing with.
There was a time when we were asked to create a comprehensive report on customer usage trends. This task was enormous, as it involved wrangling and analyzing a dataset with over a billion records. However, the data analysis was just one part of the challenge.
The bigger issue was how to effectively share these findings with the rest of the team, particularly non-technical managers who needed to make strategic decisions based on this data.
I had cleaned and prepared the dataset in a Pandas DataFrame, comprising billions of rows and about a dozen columns, from timestamps of calls to data usage. But when it came time to export this data, I had a decision to make: which format should I choose?
Excel was an option, but given the dataset’s size, it wasn’t feasible. Excel has a limit of about 1 million rows per sheet and our dataset was over a thousand times larger than that.
JSON was another option. However, JSON would have resulted in a complicated nested structure that would be difficult for the managers to parse and understand using their familiar tools.
That’s where Pandas to_csv
came into play. Our managers were used to dealing with CSV files. They had a suite of tools, some homegrown, some off-the-shelf like Tableau, which they used for slicing, dicing, and visualizing the data. However, these tools only accepted CSV inputs.
So, I chose to use the to_csv
function to export the dataframe. This function handled the size of the data gracefully and allowed me to export the file in chunks. In fact, the CSV file I exported ended up being about 16GB in size.
The choice of CSV proved to be a lifesaver. The managers were able to directly import this file into their tools and extract meaningful insights.
The ease of use and compatibility of CSV files led to a 50% reduction in the time spent in pre-processing the data, a 30% reduction in queries and confusion regarding data format and structure.
Using Pandas to_csv
ensured that the insights we derived from our data could be easily understood, utilized, and actioned upon by everyone in the company, even those without a technical background. It was a powerful demonstration of how the right choice of data format can make a world of difference.
Further Reading
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.