Read HTML tables using Pandas read_html function

You can use theread_html function to scrape HTML tables directly from a website and convert them into DataFrames.

The read_html function takes a string that contains a URL or a file path leading to an HTML file, extracts all the tables contained within that HTML page, and returns a list of DataFrames.

Each DataFrame corresponds to a table on the HTML page.
By default, the read_html function uses the lxml, BeautifulSoup, and html5lib libraries to parse the HTML page.

This means you need to have these libraries installed in your Python environment to use this function.

!pip install lxml beautifulsoup4 html5lib

 

 

Syntax and Parameters

The read_html function in Pandas is quite flexible and allows several parameters to control the data extraction. The syntax of the function is as follows:

pandas.read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=', ', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True)

Here’s a brief explanation of the parameters:

  • io: This parameter takes a string which could either be a URL, a file path, or HTML content.
  • match: This is a regular expression that the table should match to be extracted. The default value is ‘.+’, which means that all tables are extracted.
  • flavor: The parsing engine to use under the hood. The default is None which uses ‘lxml’ and ‘beautiful soup’ but you can also specify ‘html5lib’.
  • header: This parameter takes an integer or a sequence of integers that define the row(s) to use as the column names. The default is None which means the column names are inferred from the table header row.
  • index_col: This takes an integer or sequence of integers defining the column(s) to set as the index(MultiIndex).
  • skiprows: This parameter takes an integer or a sequence of integers for skipping the specified row(s).
  • attrs: This is a dictionary of HTML attributes that the table tags should contain.
  • parse_dates: This parameter is for automatically parsing dates in the table(s). It defaults to False.
  • converters, na_values These are other parameters for transforming data and NA values respectively.

It is important to note that the read_html function returns a list of DataFrames.

 

Extract HTML Tables using Pandas read_html

The pandas.read_html function allows you to extract tables from a local HTML file, URL, or any file-like object that contains HTML.

Extracting Table from a Local HTML File

Assuming you have an HTML file (sample1.html) in the same directory as your Python file and it contains two simple tables.

Now, let’s read the HTML tables from that page using read_html:
import pandas as pd
file_path = "sample.html"
tables = pd.read_html(file_path)

Let’s check the number of tables and display the first one:

print("Number of tables in file: ", len(tables))
print("First table:")
print(tables[0])

Output:

Number of tables in file:  2
First table:
   Header 1  Header 2  Header 3
0  Data 1.1  Data 1.2  Data 1.3
1  Data 2.1  Data 2.2  Data 2.3
2  Data 3.1  Data 3.2  Data 3.3

The output will display the number of tables in the HTML file and the content of the first table.

Extracting Data from a URL

We’ll extract data from the ‘List of countries and dependencies by population’ from Wikipedia.

import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
tables = pd.read_html(url)

The pd.read_html(url) function returns a list of DataFrames. Here’s how we can check the number of tables and display the first table:

print("Number of tables on site: ", len(tables))
print("First table:")
print(tables[0])

Output:

Number of tables on site:  3
First table:
    0                                                  1
0 NaN  This article needs to be updated. The reason g...

As you can see, the output is the first table from the page. You might notice it’s not the table we’re interested in (the countries by population).

This is because the page we are scraping contains multiple tables. You can specify the proper index, but there’s another way you can specify the table you want to extract. This can be done by using the HTML attribute.

 

Using Table HTML Attributes

If tables have HTML attributes, such as id or class, you can use these to extract a particular table.
The attrs parameter in read_html accepts a dictionary of attributes to match. The function then only extracts tables with matching attributes.
Let’s see an example of how to use attrs:

url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
tables = pd.read_html(url, attrs={"class": "wikitable"})
df = tables[0]

It’s a more targeted way of table extraction compared to extracting all tables and then picking out the one you need.
Remember, the attribute names and values are case-sensitive and must match exactly with those in the HTML source, but what if the table has no attribute and we want a targeted way?

 

Scrape Using Regular Expressions

Pandas read_html function provides the match parameter that allows us to use regular expressions to match specific tables based on their content.

This is especially useful when the webpage or file has numerous tables all without attributes, and we want to filter the tables based on certain criteria.
If you take a look at the Wikipedia page we’re scraping, you’ll notice that the table we need to scrape is the only table that contains the word ‘United States’:

import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
tables = pd.read_html(url, match='United States')

To check the number of tables that matched the regular expression and display the first table:

print("Number of matched tables on site: ", len(tables))
print("First matched table:")
print(tables[0])

Output:

Number of matched tables on site:  1
First matched table:
    Rank  ... Notes   
...

This is the table that contains countries and their population details.
You need to use a valid index to access the specific DataFrame. If no table matches the regular expression, a ValueError will be raised.

 

Handling Inconsistent Table Structures

When dealing with HTML tables, you might encounter tables with merged cells or nested tables (tables within tables). Let’s see how the read_html function will handle these.

Merged Cells

For horizontally merged cells (cells spanning multiple columns) or vertically merged cells (cells spanning multiple rows), Pandas will repeat the cell value across the spanned columns.

Here’s an HTML table with a horizontally merged cell:

<table>
<tr>
<td colspan="2">Merged</td>
<td>Regular</td>
</tr>
<tr>
<td>A</td>
<td>B</td>
<td>C</td>
</tr>
</table>

When we use read_html to parse this HTML:

df = pd.read_html('sample.html')[0]
print(df)

Output:

   0       1       2
0  Merged  Merged  Regular
1  A       B       C

As you can see, the merged cell value (“Merged”) has been repeated across the two spanned columns.

And this is an HTML table with a vertically merged cell:

<table>
<tr>
<td rowspan="2">Merged</td>
<td>Regular</td>
</tr>
<tr>
<td>C</td>
</tr>
</table>

When we use read_html to parse this HTML:

df = pd.read_html(sample.html)[0]
print(df)

Output:

        0        1
0   Merged  Regular
1   Merged     C

In this case, the merged cell value (“Merged”) repeated across the two spanned rows.

Nested Tables

For nested tables, read_html automatically separates the outer and inner tables. Each table becomes a separate DataFrame in the returned list.

Assume you have nested HTML tables (Nested HTML sample).

When we run read_html against it, it returns two tables, the outer table and the inner table:

tables = pd.read_html('nested_tables.html')
print(tables[0])

Output:

  Header 1 Header 2                                           Header 3
0   Data 1   Data 2  Nested Header 1  Nested Header 2  Nested Data ...
1   Data 3   Data 4                                             Data 5

This is the outer table.

For the inner table:

tables = pd.read_html('nested_tables.html')
print(tables[1])

Output:

  Nested Header 1 Nested Header 2
0   Nested Data 1   Nested Data 2
1   Nested Data 3   Nested Data 4

 

Setting Custom NA Values

The na_values parameter in the read_html function allows you to specify such custom NA values. The function will replace these with NaN values in the resulting DataFrames.

Consider an HTML table where the null cells are represented by the words: ‘Empty’ and ‘Blank’ (na_values sample)

Here’s an example of how to use na_values:

na_values = ['Empty', 'Blank']
tables = pd.read_html('na_values.html', na_values=na_values)

Output:

  Header 1 Header 2 Header 3
0   Data 1      NaN   Data 3
1   Data 4      NaN   Data 6
2   Data 7   Data 8   Data 9

Now, when Pandas encounters any of the specified values in na_values while parsing the HTML tables, it will treat them as NaN.

 

Set the Index

When using read_html, you can specify the index_col parameter to set a column as the index of the DataFrame directly during the table extraction process.

This can save you the extra step of calling set_index afterwards.
The index_col parameter takes an integer or a sequence of integers representing the column number(s) to set as the index. Column numbers start from 0.
Here’s how you use it:

url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
tables = pd.read_html(url, index_col=1, match='Notes')
df = tables[0]

Here we used the match='Notes' because the table we want to work with is the only table that contains that word.

Now, ‘Country’ is set as the index of the DataFrame directly at the time of table extraction. You can use country names to access specific rows:

print(df.loc['United States'])

Output:

Country / Dependency                          Country / Dependency                        
Rank                                          Rank                                                                       3
Population                                    Numbers                                                            334933000
                                              % of the world                                                           NaN
Date                                          Date                                                             25 Jun 2023
Source (official or from the United Nations)  Source (official or from the United Nations)    National population clock[7]
Notes                                         Notes                                                                    [d]

By using index_col in read_html, you can streamline your data extraction and preparation process.

 

Setting the Header

In read_html, you can use the header parameter to specify which row(s) in the table to use as the column header. This can be particularly useful when the table header is not in the first row or when the table has multiple header rows.
The header parameter accepts an integer or a list of integers representing the row number(s) to use as the header. Remember, row numbers start from 0.
Let’s assume that the table’s header is in the second row (index 1):

url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
tables = pd.read_html(url, header=1, match='Notes')
df = tables[0]

This will make the second row in the table as a header.

In the case of multiple headers, you can pass a list of integers:

tables = pd.read_html(url, header=[0, 1])
df = tables[0]

 

Skipping Rows

The skiprows parameter in read_html lets you specify which rows to skip.
It takes an integer or a sequence of integers representing the row number(s) to skip.
Here’s an example:

url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
tables = pd.read_html(url, skiprows=0, match='Notes')
df = tables[0]

The resulting DataFrame will not include the first row of the table.

You can also skip multiple non-consecutive rows by passing a list:

tables = pd.read_html(url, skiprows=[0, 2])

 

Transforming Content

You might want to remove a unit of measurement, convert strings to integers, or apply some other transformation.

You can achieve this using the converters parameter in read_html.
The converters parameter accepts a dictionary where the keys are either column names or column numbers, and the values are functions that will be applied to each value in the respective column.
Consider an HTMl table with percentages (converters sample):

Let’s define a function to remove the ‘%’ from the Percentage column:

def convert_percent(val):
    new_val = val.replace('%', '')
    return float(new_val)

tables = pd.read_html('converters.html', converters={'Percentage': convert_percent})
df = tables[0]

Output:

  Header 1 Header 2  Percentage
0   Data 1   Data 2        25.0
1   Data 3   Data 4        50.0
2   Data 5   Data 6        75.0

 

Extracting Links

The extract_links parameter in the read_html function allows for extracting hyperlinks from the HTML table data.

The parameter accepts several values: None, "all", "header", "body", and "footer".

  1. None: This is the default option. No links are extracted, and the link texts are included as regular cell values.
  2. “all”: Links are extracted from all parts of the table. Each cell containing a link is represented as a tuple where the first element is the cell text and the second element is the link.
  3. “header”: Only links in the table header (inside <th> elements) are extracted.
  4. “body”: Only links in the table body (inside <td> elements) are extracted.
  5. “footer”: Only links in the table footer (inside <tfoot> elements) are extracted.

Consider an HTML table which contains link in the header, body, and footer (extract links sample):

To extract links from the header only:
tables = pd.read_html('extract_links.html', extract_links='header')
df = tables[0]

Output:

  (Header Link, https://example.com) (Header 2, None) (Header 3, None)
0                          Body Link           Data 2           Data 3
1                             Data 4           Data 5           Data 6
2                             Data 7           Data 8      Footer Link

To extract links from the body links only:

tables = pd.read_html('extract_links.html', extract_links='body')
df = tables[0]

Output:

                        Header Link        Header 2        Header 3
0  (Body Link, https://example.com)  (Data 2, None)  (Data 3, None)
1                    (Data 4, None)  (Data 5, None)  (Data 6, None)
2                            Data 7          Data 8     Footer Link

To extract footer links only:

tables = pd.read_html('extract_links.html', extract_links='footer')
df = tables[0]

Output:

      Header Link        Header 2                            Header 3
0       Body Link          Data 2                              Data 3
1          Data 4          Data 5                              Data 6
2  (Data 7, None)  (Data 8, None)  (Footer Link, https://example.com)

To extract all links:

  (Header Link, https://example.com)  ...                    (Header 3, None)
0   (Body Link, https://example.com)  ...                      (Data 3, None)
1                     (Data 4, None)  ...                      (Data 6, None)
2                     (Data 7, None)  ...  (Footer Link, https://example.com)

Please note, this feature is available starting from Pandas version 1.3.0.

 

pandas.read_html() Limitations

Here are a few other potential limitations of pandas.read_html:

  1. No dynamic content: read_html cannot handle dynamically loaded content. If the content of the website changes based on user interaction or is loaded after the initial page load (like infinite scrolling), then read_html won’t be able to capture this data.
  2. No form submission or authentication: If the page requires user interaction, like form submissions or authentication, read_html cannot perform these actions.
  3. Limited CSS Selector Support: It doesn’t support specific or complex CSS selectors. So if you need to select very specific parts of the webpage that aren’t strictly tables, then read_html may not be sufficient.
  4. No multi-page scraping: If the data you are interested in is spread across multiple pages (pagination), read_html won’t be able to automatically traverse these pages.
  5. Cannot handle non-tabular data: If the data you’re interested in is not in a tabular format (e.g. paragraphs, headings, lists, etc.), read_html won’t be able to extract it.
  6. Headers and Cookies: Web scraping libraries like requests, Beautiful Soup, and Selenium provide more control over HTTP headers and cookies, which can be essential when dealing with sites that require certain headers or cookies to be set.
  7. Robots.txt Respect: Web scraping tools like Scrapy respect the rules set in the robots.txt file of the website. This is an important aspect of ethical web scraping, but is not something handled by pandas.read_html.

 

Further Reading

https://pandas.pydata.org/docs/reference/api/pandas.read_html.html

Leave a Reply

Your email address will not be published. Required fields are marked *