Export Python Pandas DataFrame to Excel using to_excel

The to_excel function allows you to export a Pandas DataFrame to an Excel file. It’s highly beneficial for every data scientist.

This tutorial will give you a comprehensive understanding of how the to_excel function works, and how you can customize its behavior to suit your needs.

 

 

Basic Syntax and Parameters

The basic syntax for the to_excel function in Pandas is as follows:

DataFrame.to_excel(excel_writer, sheet_name='Sheet1', na_rep='', 
                   float_format=None, columns=None, header=True, index=True, 
                   index_label=None, startrow=0, startcol=0, engine=None, 
                   merge_cells=True, encoding=None, inf_rep='inf', 
                   verbose=True, freeze_panes=None)

Let’s briefly discuss the most commonly used parameters:

  • excel_writer: This is the target file name or ExcelWriter object.
  • sheet_name: Name of the sheet in the Excel file. Defaults to ‘Sheet1’.
  • na_rep: This is the string representation of NaN to use.
  • float_format: Format to use for floating point numbers.
  • columns: Columns to write to the Excel file.
  • header: Write out the column names. If a list of strings is given, it is assumed to be aliases for the column names.
  • index: Write row names (index).
  • startrow & startcol: Upper left cell row and column to dump DataFrame.
  • engine: Write engine to use, ‘openpyxl’ or ‘xlsxwriter’.

The to_excel function provides robust capabilities for customizing your data export to Excel. Next, we will dive into examples of how to use it effectively.

 

Exporting Pandas DataFrame to Excel file

First, let’s import Pandas and create a DataFrame as a starting point:

import pandas as pd
# import numpy for sample data
import numpy as np
df = pd.DataFrame({
   'A': 1.,
   'B': pd.Timestamp('20130620'),
   'C': pd.Series(1, index=list(range(4)), dtype='float32'),
   'D': np.array([3] * 4, dtype='int32'),
   'E': pd.Categorical(["test", "train", "test", "train"]),
   'F': 'foo'
})
print(df)

Output:

     A          B    C  D      E    F
0  1.0 2013-06-20  1.0  3   test  foo
1  1.0 2013-06-20  1.0  3  train  foo
2  1.0 2013-06-20  1.0  3   test  foo
3  1.0 2013-06-20  1.0  3  train  foo

Here, we’ve created a DataFrame with a mixture of numeric types, strings, and categorical data.

We can use the to_excel function to write the DataFrame to an Excel sheet like this:

df.to_excel("example.xlsx")

This will write the DataFrame df to an Excel file named example.xlsx. By default, both the header and index are included in the exported Excel file. Here’s what the resulting Excel file looks like:

  |   A          B    C  D      E    F
--|--------------------------------------
0 | 1.0 2013-06-20  1.0  3   test  foo
1 | 1.0 2013-06-20  1.0  3  train  foo
2 | 1.0 2013-06-20  1.0  3   test  foo
3 | 1.0 2013-06-20  1.0  3  train  foo

It’s important to note that the index of the DataFrame has been written as the first column in the Excel file.

The rest of the columns in the Excel file (A to F) correspond to the DataFrame’s columns.
In this case, the to_excel function creates a new file. If the file name that already exists, the function will overwrite the contents of the existing file.

 

Selecting the Engine for Writing

Pandas to_excel function allows you to choose the engine used to write to Excel files.  You can use the engine parameter to specify the engine you want to use.

Here are the engines:

  1. openpyxl
  2. xlsxwriter

Let’s write our DataFrame using the xlsxwriter engine:

df.to_excel("example_xlsxwriter.xlsx", engine='xlsxwriter')

This will create an Excel file named example_xlsxwriter.xlsx using the xlsxwriter engine.

The choice of engine can affect the capabilities available for writing to Excel files, like adding charts, conditional formatting, etc.
Remember, the chosen engine needs to be installed in your Python environment, so make sure to install it using pip if it’s not already installed.

pip install xlsxwriter

 

Write Multiple Pandas DataFrames to Multiple sheets

To write multiple pandas DataFrames to multiple sheets of an Excel file, you can use the ExcelWriter object, which allows you to specify the name of the sheet for each DataFrame.

Let’s create two different DataFrames and write them to two different sheets in the same Excel workbook:

df_sheet1 = pd.DataFrame({
    'A': ['foo', 'bar', 'baz', 'qux'],
    'B': ['alpha', 'beta', 'gamma', 'delta'],
    'C': np.random.rand(4),
    'D': np.random.randint(10, size=4)
})

df_sheet2 = pd.DataFrame({
    'E': ['apple', 'banana', 'cherry', 'date'],
    'F': ['elephant', 'fox', 'giraffe', 'hippo'],
    'G': np.random.rand(4),
    'H': np.random.randint(10, size=4)
})

# create an ExcelWriter object
with pd.ExcelWriter("example_multiple_dataframes.xlsx") as writer:
    df_sheet1.to_excel(writer, sheet_name='FirstSheet')
    df_sheet2.to_excel(writer, sheet_name='SecondSheet')

In the resulting Excel file, you will find two sheets named ‘FirstSheet’ and ‘SecondSheet’, each containing data from the respective DataFrame:

FirstSheet:

   |  A  |   B   |        C       |  D  
-- |---------------------------------------
0  | foo | alpha | 0.618012313202 |  6  
1  | bar | beta  | 0.912971647978 |  8  
2  | baz | gamma | 0.344552180436 |  0  
3  | qux | delta | 0.982718734163 |  9  

SecondSheet:

   |   E   |    F    |        G       |  H  
-- |------------------------------------------
0  | apple | elephant| 0.329873331859 |  7  
1  | banana| fox     | 0.751442750918 |  3  
2  | cherry| giraffe | 0.510241492448 |  2  
3  | date  | hippo   | 0.920831290307 |  0  

 

Handling Missing Data with na_rep parameter

You can use the na_rep parameter to represent the missing values when exporting a DataFrame to an Excel file.

Let’s create a DataFrame with some missing values to demonstrate this:

df_na = pd.DataFrame({
   'A': 1.,
   'B': pd.Timestamp('20130620'),
   'C': pd.Series(1, index=list(range(4)), dtype='float32'),
   'D': np.array([3] * 4, dtype='int32'),
   'E': pd.Categorical(["test", "train", np.nan, "train"]),
   'F': 'foo'
})
print(df_na)

Output:

     A          B    C  D      E    F
0  1.0 2013-06-20  1.0  3   test  foo
1  1.0 2013-06-20  1.0  3  train  foo
2  1.0 2013-06-20  1.0  3    NaN  foo
3  1.0 2013-06-20  1.0  3  train  foo

You can see that there is a NaN value in the ‘E’ column. When we write this DataFrame to an Excel file, we can choose how to represent this NaN value.

df_na.to_excel("example_na_rep.xlsx", na_rep='NULL')

In the Excel file, the NaN value in the ‘E’ column is replaced with ‘NULL’.

 

Formatting Floating Point Numbers

The float_format parameter of the to_excel function provides control over formatting floating point numbers.

This parameter takes a formatting string or a function that will dictate how floating point numbers are converted to strings. Let’s create a DataFrame with floating numbers:

df_float = pd.DataFrame({
    'A': 1.123456789,
    'B': np.array([3.1415926535] * 4, dtype='float64'),
}, index=list('abcd'))
print(df_float)

Output:

          A         B
a  1.123457  3.141593
b  1.123457  3.141593
c  1.123457  3.141593
d  1.123457  3.141593

For example, if we want all floating point numbers in our DataFrame to be rounded to two decimal places, we can use a formatting string as follows:

df_float.to_excel("example_float_format.xlsx", float_format="%.2f")

In the resulting Excel file, all floating point numbers are represented with two decimal places.

The data in the Excel file should look like this:

  |   A    B
--|-----------------
a | 1.12 3.14
b | 1.12 3.14
c | 1.12 3.14
d | 1.12 3.14

 

Choosing DataFrame Columns for Excel Export

You can use the columns parameter of the to_excel function to export specific columns of the DataFrame. Consider a DataFrame with several columns:

df_multi = pd.DataFrame({
   'A': pd.date_range(start='2020-01-01', periods=4),
   'B': pd.Series(range(4), dtype='float32'),
   'C': np.array([3] * 4, dtype='int32'),
   'D': pd.Categorical(["test", "train", "check", "go"]),
   'E': 'foo'
})
print(df_multi)

Output:

           A    B  C      D    E
0 2020-01-01  0.0  3   test  foo
1 2020-01-02  1.0  3  train  foo
2 2020-01-03  2.0  3  check  foo
3 2020-01-04  3.0  3     go  foo

If we want to export only the ‘A’ and ‘D’ columns to the Excel file, we can specify these columns in the to_excel function:

df_multi.to_excel("example_select_columns.xlsx", columns=['A', 'D'])

The resulting Excel file will only contain the ‘A’ and ‘D’ columns from the DataFrame:

  |        A     D
--|-----------------
0 | 2020-01-01 test
1 | 2020-01-02 train
2 | 2020-01-03 check
3 | 2020-01-04 go

 

Customizing Column Headers

The header parameter of the to_excel function allows you to customize the column headers of your Excel file.

You can pass a list of strings to the header parameter, which are assumed to be aliases for the column names.
Let’s consider our previously created DataFrame, and say we want to change the column headers while exporting to an Excel file:

print(df_multi)

Output:

           A    B  C      D    E
0 2020-01-01  0.0  3   test  foo
1 2020-01-02  1.0  3  train  foo
2 2020-01-03  2.0  3  check  foo
3 2020-01-04  3.0  3     go  foo

We can change the column headers as follows:

df_multi.to_excel("example_custom_headers.xlsx", header=['Date', 'Value_1', 'Value_2', 'Category', 'Label'])

In the resulting Excel file, the column headers are replaced with the names provided in the list:

  |       Date Value_1 Value_2 Category Label
--|-------------------------------------------
0 | 2020-01-01     0.0      3     test   foo
1 | 2020-01-02     1.0      3    train   foo
2 | 2020-01-03     2.0      3    check   foo
3 | 2020-01-04     3.0      3       go   foo

 

Manipulating Indexes

The index parameter allows you to control whether the DataFrame’s indexes should be included in the resulting Excel file.

By default, the index is written to the Excel file. If you want to exclude it, you can set index=False.
Let’s take an example DataFrame:

df_index = pd.DataFrame({
    'A': ['foo', 'bar', 'baz', 'qux'],
    'B': ['alpha', 'beta', 'gamma', 'delta'],
    'C': np.random.rand(4)
}, index=['I1', 'I2', 'I3', 'I4'])
print(df_index)

Output:

      A      B         C
I1  foo  alpha  0.647687
I2  bar   beta  0.377563
I3  baz  gamma  0.013572
I4  qux  delta  0.658623

We can export this DataFrame to an Excel file without the index:

df_index.to_excel("example_no_index.xlsx", index=False)

The resulting Excel file will just contain the data, without the index:

  A        B         C
foo    alpha  0.647687
bar    beta   0.377563
baz    gamma  0.013572
qux    delta  0.658623

 

Controlling Excel Output Location

The to_excel function provides options to control where in the Excel sheet your DataFrame starts being written.

You can use the startrow and startcol parameters to adjust the position where your DataFrame is.
Let’s write a DataFrame to an Excel file, but this time we will start at the third row and the second column:

df_position = pd.DataFrame({
    'A': ['foo', 'bar', 'baz'],
    'B': ['alpha', 'beta', 'gamma']
})
df_position.to_excel("example_output_location.xlsx", startrow=2, startcol=1, index=False)

In the resulting Excel file, the data from the DataFrame starts from the third row (rows are 0-indexed) and the second column (columns are 0-indexed):

  |  1  |    2   |    3   |
--|------------------------
1 |     |        |        |
2 |     |        |        |
3 |     |  A     |  B     |
4 |     |  foo   |  alpha |
5 |     |  bar   |  beta  |
6 |     |  baz   |  gamma |

As you can see, the DataFrame df_position is written starting from the cell B3 in Excel notation (row 3, column 2).

 

Controlling Cell Merging using to_excel

By default, cells are merged when dealing with a DataFrame that uses MultiIndex.

However, you can use the merge_cells parameter to control whether cells with the same value are merged in the resulting Excel file or not. Let’s create a DataFrame with MultiIndex:

df_merge = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'C': [9, 10, 11, 12]
}, index=[['a', 'a', 'b', 'b'], ['foo', 'bar', 'foo', 'bar']])
print(df_merge)

Output:

       A  B   C
a foo  1  5   9
  bar  2  6  10
b foo  3  7  11
  bar  4  8  12

If we write this DataFrame to an Excel file with merge_cells=True, the cells with the same index value will be merged:

df_merge.to_excel("example_merge_cells_true.xlsx", merge_cells=True)

However, if we set merge_cells=False, the cells will not be merged:

df_merge.to_excel("example_merge_cells_false.xlsx", merge_cells=False)

Output:

  |    1  |  2  |  3  |  4  |
--|---------------------------
a |  foo  |  1  |  5  |  9  |
a |  bar  |  2  |  6  |  10 |
b |  foo  |  3  |  7  |  11 |
b |  bar  |  4  |  8  |  12 |

In the first case, the ‘a’ and ‘b’ index values will span two rows each in the resulting Excel file, as they are repeated for the ‘foo’ and ‘bar’ sub-index.

In the second case, the ‘a’ and ‘b’ values will appear in each row without being merged.

 

Further Reading

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html

Leave a Reply

Your email address will not be published. Required fields are marked *