Read HTML tables using Pandas read_html function
You can use theread_html
function to scrape HTML tables directly from a website and convert them into DataFrames.
The read_html
function takes a string that contains a URL or a file path leading to an HTML file, extracts all the tables contained within that HTML page, and returns a list of DataFrames.
Each DataFrame corresponds to a table on the HTML page.
By default, the read_html
function uses the lxml, BeautifulSoup, and html5lib libraries to parse the HTML page.
This means you need to have these libraries installed in your Python environment to use this function.
!pip install lxml beautifulsoup4 html5lib
- 1 Syntax and Parameters
- 2 Extract HTML Tables using Pandas read_html
- 3 Using Table HTML Attributes
- 4 Scrape Using Regular Expressions
- 5 Handling Inconsistent Table Structures
- 6 Setting Custom NA Values
- 7 Set the Index
- 8 Setting the Header
- 9 Skipping Rows
- 10 Transforming Content
- 11 Extracting Links
- 12 pandas.read_html() Limitations
- 13 Further Reading
Syntax and Parameters
The read_html
function in Pandas is quite flexible and allows several parameters to control the data extraction. The syntax of the function is as follows:
pandas.read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=', ', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True)
Here’s a brief explanation of the parameters:
io
: This parameter takes a string which could either be a URL, a file path, or HTML content.match
: This is a regular expression that the table should match to be extracted. The default value is ‘.+’, which means that all tables are extracted.flavor
: The parsing engine to use under the hood. The default isNone
which uses ‘lxml’ and ‘beautiful soup’ but you can also specify ‘html5lib’.header
: This parameter takes an integer or a sequence of integers that define the row(s) to use as the column names. The default isNone
which means the column names are inferred from the table header row.index_col
: This takes an integer or sequence of integers defining the column(s) to set as the index(MultiIndex).skiprows
: This parameter takes an integer or a sequence of integers for skipping the specified row(s).attrs
: This is a dictionary of HTML attributes that the table tags should contain.parse_dates
: This parameter is for automatically parsing dates in the table(s). It defaults toFalse
.converters
,na_values
These are other parameters for transforming data and NA values respectively.
It is important to note that the read_html
function returns a list of DataFrames.
Extract HTML Tables using Pandas read_html
The pandas.read_html
function allows you to extract tables from a local HTML file, URL, or any file-like object that contains HTML.
Extracting Table from a Local HTML File
Assuming you have an HTML file (sample1.html) in the same directory as your Python file and it contains two simple tables.
read_html
:import pandas as pd file_path = "sample.html" tables = pd.read_html(file_path)
Let’s check the number of tables and display the first one:
print("Number of tables in file: ", len(tables)) print("First table:") print(tables[0])
Output:
Number of tables in file: 2 First table: Header 1 Header 2 Header 3 0 Data 1.1 Data 1.2 Data 1.3 1 Data 2.1 Data 2.2 Data 2.3 2 Data 3.1 Data 3.2 Data 3.3
The output will display the number of tables in the HTML file and the content of the first table.
Extracting Data from a URL
We’ll extract data from the ‘List of countries and dependencies by population’ from Wikipedia.
import pandas as pd url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population" tables = pd.read_html(url)
The pd.read_html(url)
function returns a list of DataFrames. Here’s how we can check the number of tables and display the first table:
print("Number of tables on site: ", len(tables)) print("First table:") print(tables[0])
Output:
Number of tables on site: 3 First table: 0 1 0 NaN This article needs to be updated. The reason g...
As you can see, the output is the first table from the page. You might notice it’s not the table we’re interested in (the countries by population).
This is because the page we are scraping contains multiple tables. You can specify the proper index, but there’s another way you can specify the table you want to extract. This can be done by using the HTML attribute.
Using Table HTML Attributes
If tables have HTML attributes, such as id
or class
, you can use these to extract a particular table.
The attrs
parameter in read_html
accepts a dictionary of attributes to match. The function then only extracts tables with matching attributes.
Let’s see an example of how to use attrs
:
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population" tables = pd.read_html(url, attrs={"class": "wikitable"}) df = tables[0]
It’s a more targeted way of table extraction compared to extracting all tables and then picking out the one you need.
Remember, the attribute names and values are case-sensitive and must match exactly with those in the HTML source, but what if the table has no attribute and we want a targeted way?
Scrape Using Regular Expressions
Pandas read_html
function provides the match
parameter that allows us to use regular expressions to match specific tables based on their content.
This is especially useful when the webpage or file has numerous tables all without attributes, and we want to filter the tables based on certain criteria.
If you take a look at the Wikipedia page we’re scraping, you’ll notice that the table we need to scrape is the only table that contains the word ‘United States’:
import pandas as pd url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population" tables = pd.read_html(url, match='United States')
To check the number of tables that matched the regular expression and display the first table:
print("Number of matched tables on site: ", len(tables)) print("First matched table:") print(tables[0])
Output:
Number of matched tables on site: 1 First matched table: Rank ... Notes ...
This is the table that contains countries and their population details.
You need to use a valid index to access the specific DataFrame. If no table matches the regular expression, a ValueError
will be raised.
Handling Inconsistent Table Structures
When dealing with HTML tables, you might encounter tables with merged cells or nested tables (tables within tables). Let’s see how the read_html
function will handle these.
Merged Cells
For horizontally merged cells (cells spanning multiple columns) or vertically merged cells (cells spanning multiple rows), Pandas will repeat the cell value across the spanned columns.
Here’s an HTML table with a horizontally merged cell:
<table> <tr> <td colspan="2">Merged</td> <td>Regular</td> </tr> <tr> <td>A</td> <td>B</td> <td>C</td> </tr> </table>
When we use read_html
to parse this HTML:
df = pd.read_html('sample.html')[0] print(df)
Output:
0 1 2 0 Merged Merged Regular 1 A B C
As you can see, the merged cell value (“Merged”) has been repeated across the two spanned columns.
And this is an HTML table with a vertically merged cell:
<table> <tr> <td rowspan="2">Merged</td> <td>Regular</td> </tr> <tr> <td>C</td> </tr> </table>
When we use read_html
to parse this HTML:
df = pd.read_html(sample.html)[0] print(df)
Output:
0 1 0 Merged Regular 1 Merged C
In this case, the merged cell value (“Merged”) repeated across the two spanned rows.
Nested Tables
For nested tables, read_html
automatically separates the outer and inner tables. Each table becomes a separate DataFrame in the returned list.
Assume you have nested HTML tables (Nested HTML sample).
When we run read_html
against it, it returns two tables, the outer table and the inner table:
tables = pd.read_html('nested_tables.html') print(tables[0])
Output:
Header 1 Header 2 Header 3 0 Data 1 Data 2 Nested Header 1 Nested Header 2 Nested Data ... 1 Data 3 Data 4 Data 5
This is the outer table.
For the inner table:
tables = pd.read_html('nested_tables.html') print(tables[1])
Output:
Nested Header 1 Nested Header 2 0 Nested Data 1 Nested Data 2 1 Nested Data 3 Nested Data 4
Setting Custom NA Values
The na_values
parameter in the read_html
function allows you to specify such custom NA values. The function will replace these with NaN
values in the resulting DataFrames.
Consider an HTML table where the null cells are represented by the words: ‘Empty’ and ‘Blank’ (na_values sample)
Here’s an example of how to use na_values
:
na_values = ['Empty', 'Blank'] tables = pd.read_html('na_values.html', na_values=na_values)
Output:
Header 1 Header 2 Header 3 0 Data 1 NaN Data 3 1 Data 4 NaN Data 6 2 Data 7 Data 8 Data 9
Now, when Pandas encounters any of the specified values in na_values
while parsing the HTML tables, it will treat them as NaN
.
Set the Index
When using read_html
, you can specify the index_col
parameter to set a column as the index of the DataFrame directly during the table extraction process.
This can save you the extra step of calling set_index
afterwards.
The index_col
parameter takes an integer or a sequence of integers representing the column number(s) to set as the index. Column numbers start from 0.
Here’s how you use it:
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population" tables = pd.read_html(url, index_col=1, match='Notes') df = tables[0]
Here we used the match='Notes'
because the table we want to work with is the only table that contains that word.
Now, ‘Country’ is set as the index of the DataFrame directly at the time of table extraction. You can use country names to access specific rows:
print(df.loc['United States'])
Output:
Country / Dependency Country / Dependency Rank Rank 3 Population Numbers 334933000 % of the world NaN Date Date 25 Jun 2023 Source (official or from the United Nations) Source (official or from the United Nations) National population clock[7] Notes Notes [d]
By using index_col
in read_html
, you can streamline your data extraction and preparation process.
Setting the Header
In read_html
, you can use the header
parameter to specify which row(s) in the table to use as the column header. This can be particularly useful when the table header is not in the first row or when the table has multiple header rows.
The header
parameter accepts an integer or a list of integers representing the row number(s) to use as the header. Remember, row numbers start from 0.
Let’s assume that the table’s header is in the second row (index 1):
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population" tables = pd.read_html(url, header=1, match='Notes') df = tables[0]
This will make the second row in the table as a header.
In the case of multiple headers, you can pass a list of integers:
tables = pd.read_html(url, header=[0, 1]) df = tables[0]
Skipping Rows
The skiprows
parameter in read_html
lets you specify which rows to skip.
It takes an integer or a sequence of integers representing the row number(s) to skip.
Here’s an example:
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population" tables = pd.read_html(url, skiprows=0, match='Notes') df = tables[0]
The resulting DataFrame will not include the first row of the table.
You can also skip multiple non-consecutive rows by passing a list:
tables = pd.read_html(url, skiprows=[0, 2])
Transforming Content
You might want to remove a unit of measurement, convert strings to integers, or apply some other transformation.
You can achieve this using the converters
parameter in read_html
.
The converters
parameter accepts a dictionary where the keys are either column names or column numbers, and the values are functions that will be applied to each value in the respective column.
Consider an HTMl table with percentages (converters sample):
Let’s define a function to remove the ‘%’ from the Percentage column:
def convert_percent(val): new_val = val.replace('%', '') return float(new_val) tables = pd.read_html('converters.html', converters={'Percentage': convert_percent}) df = tables[0]
Output:
Header 1 Header 2 Percentage 0 Data 1 Data 2 25.0 1 Data 3 Data 4 50.0 2 Data 5 Data 6 75.0
Extracting Links
The extract_links
parameter in the read_html
function allows for extracting hyperlinks from the HTML table data.
The parameter accepts several values: None
, "all"
, "header"
, "body"
, and "footer"
.
- None: This is the default option. No links are extracted, and the link texts are included as regular cell values.
- “all”: Links are extracted from all parts of the table. Each cell containing a link is represented as a tuple where the first element is the cell text and the second element is the link.
- “header”: Only links in the table header (inside
<th>
elements) are extracted. - “body”: Only links in the table body (inside
<td>
elements) are extracted. - “footer”: Only links in the table footer (inside
<tfoot>
elements) are extracted.
Consider an HTML table which contains link in the header, body, and footer (extract links sample):
tables = pd.read_html('extract_links.html', extract_links='header') df = tables[0]
Output:
(Header Link, https://example.com) (Header 2, None) (Header 3, None) 0 Body Link Data 2 Data 3 1 Data 4 Data 5 Data 6 2 Data 7 Data 8 Footer Link
To extract links from the body links only:
tables = pd.read_html('extract_links.html', extract_links='body') df = tables[0]
Output:
Header Link Header 2 Header 3 0 (Body Link, https://example.com) (Data 2, None) (Data 3, None) 1 (Data 4, None) (Data 5, None) (Data 6, None) 2 Data 7 Data 8 Footer Link
To extract footer links only:
tables = pd.read_html('extract_links.html', extract_links='footer') df = tables[0]
Output:
Header Link Header 2 Header 3 0 Body Link Data 2 Data 3 1 Data 4 Data 5 Data 6 2 (Data 7, None) (Data 8, None) (Footer Link, https://example.com)
To extract all links:
(Header Link, https://example.com) ... (Header 3, None) 0 (Body Link, https://example.com) ... (Data 3, None) 1 (Data 4, None) ... (Data 6, None) 2 (Data 7, None) ... (Footer Link, https://example.com)
Please note, this feature is available starting from Pandas version 1.3.0.
pandas.read_html() Limitations
Here are a few other potential limitations of pandas.read_html
:
- No dynamic content:
read_html
cannot handle dynamically loaded content. If the content of the website changes based on user interaction or is loaded after the initial page load (like infinite scrolling), thenread_html
won’t be able to capture this data. - No form submission or authentication: If the page requires user interaction, like form submissions or authentication,
read_html
cannot perform these actions. - Limited CSS Selector Support: It doesn’t support specific or complex CSS selectors. So if you need to select very specific parts of the webpage that aren’t strictly tables, then
read_html
may not be sufficient. - No multi-page scraping: If the data you are interested in is spread across multiple pages (pagination),
read_html
won’t be able to automatically traverse these pages. - Cannot handle non-tabular data: If the data you’re interested in is not in a tabular format (e.g. paragraphs, headings, lists, etc.),
read_html
won’t be able to extract it. - Headers and Cookies: Web scraping libraries like requests, Beautiful Soup, and Selenium provide more control over HTTP headers and cookies, which can be essential when dealing with sites that require certain headers or cookies to be set.
- Robots.txt Respect: Web scraping tools like Scrapy respect the rules set in the robots.txt file of the website. This is an important aspect of ethical web scraping, but is not something handled by
pandas.read_html
.
Further Reading
https://pandas.pydata.org/docs/reference/api/pandas.read_html.html
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.