Read Excel files using Pandas read_excel
Pandas read_excel
is a function in the Python Pandas library that allows us to read Excel files in Python and convert them into a DataFrame object.
The read_excel function can import Excel files with different extensions such as .xls, .xlsx, .xlsm, and .ods.
- 1 The engine parameter
- 2 Read an Excel Sheet
- 3 Read multiple sheets
- 4 Selecting Headers
- 5 Setting the DataFrame index_col
- 6 Skip Rows
- 7 Conditional Skip
- 8 Limiting Rows to Read
- 9 Extracting Specific Columns (Skip columns)
- 10 Controlling Data Type
- 11 Handle Boolean Data
- 12 Converters parameter
- 13 Parsing Date Columns
- 14 Handling non-US date formats
- 15 Handle NaN Values
The engine parameter
The engine parameter in the read_excel function specifies the engine to use for reading the Excel file. The options are ‘xlrd’, ‘openpyxl’, ‘odf’, and ‘xlsb’.
The ‘xlrd’ supports old-style Excel files (.xls), openpyxl supports newer Excel file formats (.xlsx), ‘odf’ supports OpenDocument format (.ods), and ‘xlsb’ supports binary Excel format’.
If you tried to read an Excel file without installing the required engine, you’ll get an ImportError.
First, you should install the engine you need, in our case, we have a (.xlsx) file, so we need to install openpyxl.
pip install openpyxl
If you need to read a (.ods) file, you should install odfpy:
pip install odfpy
Now, let’s read our file:
df = pd.read_excel('school_data.xlsx', engine='openpyxl') print(df)
Output:
Student_ID Name Grade Score 0 1 Alice 10 85 1 2 Bob 9 92 2 3 Carol 8 78 ...
Read an Excel Sheet
In some cases, we may want to read a single sheet from an Excel file with multiple sheets. To do this, we specify the sheet_name parameter in the read_excel function:
df = pd.read_excel('school_data.xlsx', sheet_name='Students') print(df)
Output:
Student_ID Name Grade Score 0 1 Alice 10 85 1 2 Bob 9 92 2 3 Carol 8 78 ...
In the above example, we read the first sheet ‘Students’ from our school data Excel file.
Read multiple sheets
To read multiple sheets from an Excel file, you pass a list of sheets to the sheet_name parameter. The result will be a dictionary where the keys are the sheet names and the values are the DataFrames.
all_sheets_df = pd.read_excel('school_data.xlsx', sheet_name=['Students', 'Teachers']) print(all_sheets_df['Students']) print(all_sheets_df['Teachers'])
Output:
Students DataFrame: Student_ID Name Grade Score 0 1 Alice 10 85 1 2 Bob 9 92 2 3 Carol 8 78 ... Teacher_ID Name Subject 0 1 Alex English 1 2 Brenda Math 2 3 Charles Science ...
Here, we’ve read two sheets ‘Students’ and ‘Teachers’ from our school data Excel file.
Selecting Headers
By default, pandas assumes that the first row is the header. However, we can use the ‘header’ parameter to specify the row(s) that make up the header:
df = pd.read_excel('school_data.xlsx', header=0) print(df)
Output:
Student_ID Name Grade Score 0 1 Alice 10 85 1 2 Bob 9 92 2 3 Carol 8 78 ...
Here, we specified header=0
, which indicates that we’ve chosen the first row (Pandas is 0-indexed) as the header row.
Read without Header
If you want the first row to be treated not as a header, you can set the header=None.
df = pd.read_excel('school_data.xlsx', header=None) print(df)
Output:
0 1 2 3 Student_ID Name Grade Score 0 1 Alice 10 85 1 2 Bob 9 92 2 3 Carol 8 78 ...
That means the first row will be considered as a data row, not as a header.
Setting the DataFrame index_col
The ‘index_col’ parameter allows us to specify a column to use as the row labels for the DataFrame. Let’s set the ‘Student_ID’ column as the index:
df = pd.read_excel('school_data.xlsx', index_col='Student_ID') print(df)
Output:
Name Grade Score Student_ID 1 Alice 10 85 2 Bob 9 92 3 Carol 8 78 ...
In this example, we set ‘Student_ID’ as our DataFrame’s index, which now serves as the row labels.
Skip Rows
You can use the skiprows parameter to skip rows when reading an Excel file. You can pass an integer to skip a certain number of rows from the top, or a list of row indices to skip specific rows:
df = pd.read_excel('school_data.xlsx', skiprows=1) print(df)
Output:
1 Alice 10 85 0 2 Bob 9 92 1 3 Carol 8 78 ...
In this case, we skipped the first row of the DataFrame by setting skiprows=1
. Therefore, what was previously the second row is now treated as the header row.
df = pd.read_excel('school_data.xlsx',skiprows=[1,3]) print(df)
Conditional Skip
The skiprows parameter also accepts a callable function, allowing for condition-based row skipping.
The function is evaluated against the rows index, and rows for which the function returns True are skipped:
# Function to skip even rows skip_func = lambda x: x % 2 == 0 df = pd.read_excel('school_data.xlsx', skiprows=skip_func) print(df)
Output:
Student_ID Name Grade Score 0 1 Alice 10 85 1 3 Carol 8 78 ...
We’ve skipped every even row by defining a function skip_func
that returns True if the row index is even.
Limiting Rows to Read
You can use the nrows parameter when you want to read a certain number of rows from the Excel file. For instance, to read only the first five rows:
df = pd.read_excel('school_data.xlsx', nrows=5) print(df)
Output:
Student_ID Name Grade Score 0 1 Alice 10 85 1 2 Bob 9 92 2 3 Carol 8 78 3 4 David 10 88 4 5 Emily 9 81
By setting nrows=5
, we limit the DataFrame to only the first five rows of the Excel data.
Extracting Specific Columns (Skip columns)
You can use the usecols parameter if you’re interested in specific columns from the Excel file. This can be a list of column names or a callable function returning True for columns to parse:
df = pd.read_excel('school_data.xlsx', usecols=['Name', 'Score']) print(df)
Output:
Name Score 0 Alice 85 1 Bob 92 2 Carol 78 ...
In this case, we only read the ‘Name’ and ‘Score’ columns from the Excel file. This can save a lot of memory if you’re dealing with large datasets and only need a subset of the columns.
Also, you can pass a function to usecols like this:
def column_filter(col_name): return col_name.startswith('S') df = pd.read_excel('school_data.xlsx', usecols=column_filter)
Output:
Student_ID Score 0 1 85 1 2 92 2 3 78 ...
Here we only returned columns starting with the ‘S’ character.
Controlling Data Type
By default, Pandas tries to infer the best data type for each column. But sometimes, you may want to enforce a specific type, which can be done using the ‘dtype’ parameter.
This can be a single type or a dictionary mapping column names to data types:
df = pd.read_excel('school_data.xlsx', dtype={'Score': 'int'}) print(df.dtypes)
Output:
Student_ID int64 Name object Grade int64 Score int32 dtype: object
Here, we specified that the ‘Score’ column should be treated as an integer (int32). The resulting DataFrame verifies this data type.
Handle Boolean Data
Sometimes, Excel files may contain Boolean data represented as non-standard values like ‘Yes/No’ or ‘Y/N’.
The ‘true_values’ and ‘false_values’ parameters allow you to specify which values should be treated as True or False:
df = pd.read_excel('school_data.xlsx', true_values=['Y'], false_values=['N']) print(df)
Output:
Student_ID Name Grade Score Passed 0 1 Alice 10 85 True 1 2 Bob 9 92 True 2 3 Carol 8 78 False ...
In the example above, ‘Y’ in the ‘Passed’ column is interpreted as True, and ‘N’ is interpreted as False.
Converters parameter
The converters parameter lets you apply a function to specific columns during the import process. This is useful for performing transformations on data as it’s being loaded:
# Function to capitalize names capitalize = lambda x: x.capitalize() df = pd.read_excel('school_data.xlsx', converters={'Name': capitalize}) print(df)
Output:
Student_ID Name Grade Score 0 1 Alice 10 85 1 2 Bob 9 92 2 3 Carol 8 78 ...
Here, the ‘Name’ column values are capitalized during the import process, thanks to the applied ‘capitalize’ function.
Parsing Date Columns
The ‘parse_dates’ parameter helps convert date-like data into datetime objects:
df = pd.read_excel('school_data.xlsx', parse_dates=['DOB']) print(df.dtypes)
Output:
Student_ID int64 Name object Grade int64 DOB datetime64[ns] dtype: object
In this example, we’ve parsed the ‘DOB’ column as a datetime object. This can be very useful in some cases like searching between two dates or finding rows where the date is greater than or less than a specific date.
Handling non-US date formats
Pandas’ read_excel
function uses a US date format (mm/dd/yyyy) by default when parsing dates. However, you might encounter Excel files where dates are in a non-US format, such as dd/mm/yyyy.
For correctly parsing non-US date formats, we must first load the date as string type, and then use pd.to_datetime
with the correct format:
df = pd.read_excel('students.xlsx', dtype={'Enrollment Date': str}) df['Enrollment Date'] = pd.to_datetime(df['Enrollment Date'], format='%d/%m/%Y') print(df['Enrollment Date'])
Output:
0 2022-09-01 1 2022-09-02 2 2022-09-03 3 2022-09-04 Name: Enrollment Date, dtype: datetime64[ns]
In this example, ‘Enrollment Date’ is correctly parsed as a date column in the ‘dd/mm/yyyy’ format after loading the data.
Handle NaN Values
By default, the read_excel function interprets empty cells as NaN., but what will happen if your cells contain another value instead of empty?
You can specify how Pandas should handle cells with missing values or NaNs by using the ‘na_values’ parameter:
df = pd.read_excel('school_data.xlsx', na_values='N/A') print(df)
Output:
Student_ID Name Grade Score 0 1 Alice 10 85.0 1 2 Bob 9 92.0 2 3 Carol 8 NaN ...
In this example, wherever ‘N/A’ is found in the Excel data, it’s replaced with a NaN value in the DataFrame.
These are just a few of the many powerful features that the ‘pandas read_excel’ function provides.
Each parameter gives you finer control over how data is imported from your Excel files into a pandas DataFrame.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.