Overcoming pandas.read_html Limitations using Python web scraping

Pandas read_html function is an extremely useful tool for quickly extracting HTML tables from webpages.

It allows you to pull tabular data from HTML content in just one line of code.
However, read_html has some limitations.

This tutorial will guide you through some of these challenges and provide solutions to overcome them.

For the purpose of this tutorial, we’ll use this sample HTML file to extract data from it.

 

 

Navigating Dynamic Content (JavaScript Interaction)

Web pages often contain dynamic content, where the structure of the web page changes over time.

In this case, we can use a combination of BeautifulSoup and Selenium to interact with this dynamic content.

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://localhost/test.html'
driver = webdriver.Firefox()
driver.get(url)
button = driver.find_element('id', 'loadMoreButton')
button.click()
html_content = driver.page_source
soup = BeautifulSoup(html_content, "lxml")
table = soup.find_all('table')[0]
dfs = pd.read_html(str(table))

print(dfs[0])

Output

   Column1 Column2 Column3
0       a1      b1      c1
1       a2      b2      c2
2       a3      b3      c3
3       a4      b4      c4
4       a5      b5      c5

In this script, Selenium is not just used to open the webpage, but also to interact with it.

The loadMoreButton element is clicked to load additional data into the table, which is then extracted using BeautifulSoup and pandas.read_html.

 

Form Submission and Authentication

Another limitation of pandas.read_html is that it does not support form submissions or authentication. Both of these tasks can be performed using Selenium and BeautifulSoup.
Here is an example of form submission and authentication:

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

url = 'http://localhost/test.html'
driver = webdriver.Firefox()
driver.get(url)
username = driver.find_element('id', 'username')
password = driver.find_element('id', 'password')
username.send_keys('test_user')
password.send_keys('test_password')
login_button = driver.find_element('id', 'login')
login_button.click()
html_content = driver.page_source
soup = BeautifulSoup(html_content, "lxml")
table = soup.find_all('table')[0]
dfs = pd.read_html(str(table))
print(dfs[0])

Output

   Column1 Column2 Column3
0       a1      b1      c1
1       a2      b2      c2
2       a3      b3      c3

In the above script, Selenium is used to automate the process of logging into the website. The script finds the elements for the username and password fields, inputs the login credentials, then clicks the login button.

Once the script is authenticated, it fetches the page’s HTML content and extracts the HTML table data.

 

Complex CSS Selectors

While pandas.read_html works great with well-structured tables, it falls short when dealing with complex CSS selectors.

BeautifulSoup comes in handy for this, allowing us to use CSS selectors to navigate and search the parse tree.
Here’s an example of using CSS selectors with BeautifulSoup:

from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'http://localhost/test.html'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
table = soup.select('div.content > table.table-class')[0]
dfs = pd.read_html(str(table))
print(dfs[0])

Output

   Column1 Column2 Column3
0       a1      b1      c1
1       a2      b2      c2
2       a3      b3      c3

In this script, BeautifulSoup’s select function is used with a CSS selector that targets the first table (table.table-class) within a div with the class content.

This level of precision in targeting specific parts of the HTML content is not achievable with pandas.read_html alone.

 

Multi-page Scraping

A common issue with web scraping is dealing with paginated content. pandas.read_html does not have built-in support for automatically navigating through multiple pages.

We can use Scrapy, a powerful Python scraping library, to handle this:

import scrapy
import pandas as pd
import pandas as pd
class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://localhost/test.html']
    def parse(self, response):
        dfs = pd.read_html(response.text)
        print(dfs[0])
        next_page = response.css('a.next-page::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Output

Page 1
   Column1 Column2 Column3
0       a1      b1      c1
1       a2      b2      c2
2       a3      b3      c3
...
Page 2
   Column1 Column2 Column3
0       a4      b4      c4
1       a5      b5      c5
2       a6      b6      c6
...
...

In the above script, a Scrapy spider is created to navigate through multiple pages. The parse method is called for the initial URL and for each subsequent page.

It extracts the HTML table from the current page using pandas.read_html and then finds the link to the next page using a CSS selector. If a next page is found, the spider follows the link and the process repeats.

 

Handle Non-tabular Data

Another limitation of pandas.read_html is that it only extracts tabular data. If the data you are interested in is stored in another HTML structure, like a list, you will need another tool.

Here’s an example of using BeautifulSoup to extract a list of items:

from bs4 import BeautifulSoup
import requests
url = 'http://localhost/test.html'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
items = soup.select('ul.item-list > li')
items_text = [item.get_text() for item in items]
print(items_text)

Output

['Item 1', 'Item 2', 'Item 3', 'Item 4', 'Item 5']

In this script, BeautifulSoup’s select function is used with a CSS selector that targets all li elements within a ul with the class item-list. The text of each item is extracted into a list.

 

HTTP Headers and Cookies

One more limitation of pandas.read_html is that it doesn’t allow controlling HTTP headers or cookies, which are often necessary for accessing certain web pages. We can use the requests library to handle this:

import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://localhost/test.html'
headers = {'User-Agent': 'Mozilla/5.0'}
cookies = {'session_id': '1234567890'}
response = requests.get(url, headers=headers, cookies=cookies)
soup = BeautifulSoup(response.text, "lxml")
table = soup.find_all('table')[0]
dfs = pd.read_html(str(table))
print(dfs[0])

Output

   Column1 Column2 Column3
0       a1      b1      c1
1       a2      b2      c2
2       a3      b3      c3

In this script, requests.get is used to fetch the HTML content, but this time we pass a dictionary of HTTP headers and a dictionary of cookies.

This allows us to, for example, pretend to be a certain type of browser or maintain a session across multiple requests.

The HTML content is then parsed with BeautifulSoup and the table is extracted with pandas.read_html.

 

Respecting robots.txt

Web scraping should always be performed in a respectful and ethical manner. Most websites include a robots.txt file that states which parts of the website should not be crawled or scraped.

To respect these rules, you can manually check the robots.txt file (typically found at the website’s root, like http://localhost/robots.txt), or use a library like robotexclusionrulesparser to automatically respect the rules:

from robotexclusionrulesparser import RobotExclusionRulesParser
url = 'http://localhost'
rp = RobotExclusionRulesParser()
rp.fetch(url + '/robots.txt')
can_fetch_page = rp.can_fetch('*', url + '/test.html')
print(can_fetch_page)

Output

True

In this script, we use the RobotExclusionRulesParser to fetch and parse the robots.txt file from the target website.

The can_fetch method is used to check if a specific page can be scraped (according to the robots.txt rules).

The '*' argument means that we’re checking the rules that apply to all web crawlers. If the output is True, it means that we’re allowed to scrape the page.
Remember, it’s crucial to respect these rules not only out of respect for the website’s operators, but also to avoid potential legal issues.

21 thoughts on “Overcoming pandas.read_html Limitations using Python web scraping
  1. getting HTTP error 403: forbidden when scrapping likegeeks – Scrape HTML Tags using Class Attribute

    1. You should try it on a different website.
      This is because of the tight security on the server.
      I’ve changed the example to another URL.

      1. I think if you try to add a ‘user-agent’ while using the ‘requests’ library such as:
        REQUEST_HEADER = {‘User-Agent’:”Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 \
        (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36″},

        you can avoid the HTTP 403 error.

  2. Is this a Python3 tutorial?

    from urllib.request import urlopen
    ImportError Traceback (most recent call last)
    in ()
    —-> 1 from urllib.request import urlopen
    ImportError: No module named request

    It’s sad you have to use Windows for this tutorial. Sad.

    1. I’ve tested on Windows, but you should use Python 3.x unless you know the code changes so you can update it.

  3. import from urllib.request import urlopen

    Does work in python3. You need to specify python3 in your instructions.

    1. Yes, it’s a python 3.x code.
      If you are using Python 2.x, you can import it like this:

      from urllib2 import Request
  4. And by the way, NONE of this is going to work unless you have Chrome browser installed.

  5. dear this is very informative but how to solve reCaptcha have any code or trick to bypass reCaptch. hope reply

    1. There is no legal way to bypass ReCaptcha.
      Even illegal ways which cost more money get caught.

  6. I am getting an exception:
    bs4.FeatureNotFound: Couldn’t find a tree builder with the features you requested: html5lib. Do you need to install a parser library?
    What do I need to do to make it work?
    Please, help.

    1. BeautifulSoup by default supports HTML parser. All written code is tested very well and it’s working perfectly.
      However, try to use html.parser instead of html5lib.
      So your code will be like this:
      res = BeautifulSoup(html.read(),"html.parser");

      Regards,

Leave a Reply

Your email address will not be published. Required fields are marked *