Read Excel files using Pandas read_excel

Pandas read_excel is a function in the Python Pandas library that allows us to read Excel files in Python and convert them into a DataFrame object.

The read_excel function can import Excel files with different extensions such as .xls, .xlsx, .xlsm, and .ods.

 

 

The engine parameter

The engine parameter in the read_excel function specifies the engine to use for reading the Excel file. The options are ‘xlrd’, ‘openpyxl’, ‘odf’, and ‘xlsb’.

The ‘xlrd’ supports old-style Excel files (.xls), openpyxl supports newer Excel file formats (.xlsx), ‘odf’ supports OpenDocument format (.ods), and ‘xlsb’ supports binary Excel format’.

If you tried to read an Excel file without installing the required engine, you’ll get an ImportError.
First, you should install the engine you need, in our case, we have a (.xlsx) file, so we need to install openpyxl.

pip install openpyxl

If you need to read a (.ods) file, you should install odfpy:

pip install odfpy

Now, let’s read our file:

df = pd.read_excel('school_data.xlsx', engine='openpyxl')
print(df)

Output:

    Student_ID   Name   Grade  Score
0            1  Alice      10     85
1            2    Bob       9     92
2            3  Carol       8     78
...

 

Read an Excel Sheet

In some cases, we may want to read a single sheet from an Excel file with multiple sheets. To do this, we specify the sheet_name parameter in the read_excel function:

df = pd.read_excel('school_data.xlsx', sheet_name='Students')
print(df)

Output:

    Student_ID   Name   Grade  Score
0            1  Alice      10     85
1            2    Bob       9     92
2            3  Carol       8     78
...

In the above example, we read the first sheet ‘Students’ from our school data Excel file.

 

 

Read multiple sheets

To read multiple sheets from an Excel file, you pass a list of sheets to the sheet_name parameter. The result will be a dictionary where the keys are the sheet names and the values are the DataFrames.

all_sheets_df = pd.read_excel('school_data.xlsx', sheet_name=['Students', 'Teachers'])
print(all_sheets_df['Students'])
print(all_sheets_df['Teachers'])

Output:

Students DataFrame:

    Student_ID   Name   Grade  Score
0            1  Alice      10     85
1            2    Bob       9     92
2            3  Carol       8     78
...

   Teacher_ID    Name   Subject
0           1    Alex   English
1           2  Brenda      Math
2           3  Charles   Science
...

Here, we’ve read two sheets ‘Students’ and ‘Teachers’ from our school data Excel file.

 

Selecting Headers

By default, pandas assumes that the first row is the header. However, we can use the ‘header’ parameter to specify the row(s) that make up the header:

df = pd.read_excel('school_data.xlsx', header=0)
print(df)

Output:

    Student_ID   Name   Grade  Score
0            1  Alice      10     85
1            2    Bob       9     92
2            3  Carol       8     78
...

Here, we specified header=0, which indicates that we’ve chosen the first row (Pandas is 0-indexed) as the header row.

Read without Header

If you want the first row to be treated not as a header, you can set the header=None.

df = pd.read_excel('school_data.xlsx', header=None) 
print(df)

Output:

             0      1       2      3
    Student_ID   Name   Grade  Score
0            1  Alice      10     85
1            2    Bob       9     92
2            3  Carol       8     78
...

That means the first row will be considered as a data row, not as a header.

 

Setting the DataFrame index_col

The ‘index_col’ parameter allows us to specify a column to use as the row labels for the DataFrame. Let’s set the ‘Student_ID’ column as the index:

df = pd.read_excel('school_data.xlsx', index_col='Student_ID')
print(df)

Output:

            Name   Grade  Score
Student_ID                     
1          Alice      10     85
2            Bob       9     92
3          Carol       8     78
...

In this example, we set ‘Student_ID’ as our DataFrame’s index, which now serves as the row labels.

 

Skip Rows

You can use the skiprows parameter to skip rows when reading an Excel file. You can pass an integer to skip a certain number of rows from the top, or a list of row indices to skip specific rows:

df = pd.read_excel('school_data.xlsx', skiprows=1)
print(df)

Output:

    1  Alice   10  85
0   2    Bob    9  92
1   3  Carol    8  78
...

In this case, we skipped the first row of the DataFrame by setting skiprows=1. Therefore, what was previously the second row is now treated as the header row.

df = pd.read_excel('school_data.xlsx',skiprows=[1,3])
print(df)
In the above code, the first row and the third row were skipped.

 

Conditional Skip

The skiprows parameter also accepts a callable function, allowing for condition-based row skipping.

The function is evaluated against the rows index, and rows for which the function returns True are skipped:

# Function to skip even rows
skip_func = lambda x: x % 2 == 0

df = pd.read_excel('school_data.xlsx', skiprows=skip_func)
print(df)

Output:

   Student_ID   Name  Grade  Score
0           1  Alice     10     85
1           3  Carol      8     78
...

We’ve skipped every even row by defining a function skip_func that returns True if the row index is even.

 

Limiting Rows to Read

You can use the nrows parameter when you want to read a certain number of rows from the Excel file. For instance, to read only the first five rows:

df = pd.read_excel('school_data.xlsx', nrows=5)
print(df)

Output:

   Student_ID   Name  Grade  Score
0           1  Alice     10     85
1           2    Bob      9     92
2           3  Carol      8     78
3           4  David     10     88
4           5  Emily      9     81

By setting nrows=5, we limit the DataFrame to only the first five rows of the Excel data.

 

Extracting Specific Columns (Skip columns)

You can use the usecols parameter if you’re interested in specific columns from the Excel file. This can be a list of column names or a callable function returning True for columns to parse:

df = pd.read_excel('school_data.xlsx', usecols=['Name', 'Score'])
print(df)

Output:

    Name  Score
0  Alice     85
1    Bob     92
2  Carol     78
...

In this case, we only read the ‘Name’ and ‘Score’ columns from the Excel file. This can save a lot of memory if you’re dealing with large datasets and only need a subset of the columns.

Also, you can pass a function to usecols like this:

def column_filter(col_name):
    return col_name.startswith('S')

df = pd.read_excel('school_data.xlsx', usecols=column_filter)

Output:

   Student_ID  Score
0           1     85
1           2     92
2           3     78
...

Here we only returned columns starting with the ‘S’ character.

 

Controlling Data Type

By default, Pandas tries to infer the best data type for each column. But sometimes, you may want to enforce a specific type, which can be done using the ‘dtype’ parameter.

This can be a single type or a dictionary mapping column names to data types:

df = pd.read_excel('school_data.xlsx', dtype={'Score': 'int'})
print(df.dtypes)

Output:

Student_ID     int64
Name          object
Grade          int64
Score          int32
dtype: object

Here, we specified that the ‘Score’ column should be treated as an integer (int32). The resulting DataFrame verifies this data type.

 

Handle Boolean Data

Sometimes, Excel files may contain Boolean data represented as non-standard values like ‘Yes/No’ or ‘Y/N’.

The ‘true_values’ and ‘false_values’ parameters allow you to specify which values should be treated as True or False:

df = pd.read_excel('school_data.xlsx', true_values=['Y'], false_values=['N'])
print(df)

Output:

   Student_ID   Name  Grade  Score  Passed
0           1  Alice     10     85    True
1           2    Bob      9     92    True
2           3  Carol      8     78   False
...

In the example above, ‘Y’ in the ‘Passed’ column is interpreted as True, and ‘N’ is interpreted as False.

 

Converters parameter

The converters parameter lets you apply a function to specific columns during the import process. This is useful for performing transformations on data as it’s being loaded:

# Function to capitalize names
capitalize = lambda x: x.capitalize()

df = pd.read_excel('school_data.xlsx', converters={'Name': capitalize})
print(df)

Output:

   Student_ID   Name  Grade  Score
0           1  Alice     10     85
1           2    Bob      9     92
2           3  Carol      8     78
...

Here, the ‘Name’ column values are capitalized during the import process, thanks to the applied ‘capitalize’ function.

 

Parsing Date Columns

The ‘parse_dates’ parameter helps convert date-like data into datetime objects:

df = pd.read_excel('school_data.xlsx', parse_dates=['DOB'])
print(df.dtypes)

Output:

Student_ID             int64
Name                  object
Grade                  int64
DOB           datetime64[ns]
dtype: object

In this example, we’ve parsed the ‘DOB’ column as a datetime object. This can be very useful in some cases like searching between two dates or finding rows where the date is greater than or less than a specific date.

 

Handling non-US date formats

Pandas’ read_excel function uses a US date format (mm/dd/yyyy) by default when parsing dates. However, you might encounter Excel files where dates are in a non-US format, such as dd/mm/yyyy.
For correctly parsing non-US date formats, we must first load the date as string type, and then use pd.to_datetime with the correct format:

df = pd.read_excel('students.xlsx', dtype={'Enrollment Date': str})
df['Enrollment Date'] = pd.to_datetime(df['Enrollment Date'], format='%d/%m/%Y')
print(df['Enrollment Date'])

Output:

0   2022-09-01
1   2022-09-02
2   2022-09-03
3   2022-09-04
Name: Enrollment Date, dtype: datetime64[ns]

In this example, ‘Enrollment Date’ is correctly parsed as a date column in the ‘dd/mm/yyyy’ format after loading the data.

 

Handle NaN Values

By default, the read_excel function interprets empty cells as NaN., but what will happen if your cells contain another value instead of empty?

You can specify how Pandas should handle cells with missing values or NaNs by using the ‘na_values’ parameter:

df = pd.read_excel('school_data.xlsx', na_values='N/A')
print(df)

Output:

   Student_ID   Name  Grade  Score
0           1  Alice     10   85.0
1           2    Bob      9   92.0
2           3  Carol      8    NaN
...

In this example, wherever ‘N/A’ is found in the Excel data, it’s replaced with a NaN value in the DataFrame.

These are just a few of the many powerful features that the ‘pandas read_excel’ function provides.

Each parameter gives you finer control over how data is imported from your Excel files into a pandas DataFrame.

Leave a Reply

Your email address will not be published. Required fields are marked *