Python

Modern Python web scraping using (Beautiful Soup & Selenium)

In this tutorial, we will talk about Python web scraping and how to scrape web pages using multiple libraries such as Beautiful Soup, Selenium, and some other magic tools like PhantomJS.

You’ll learn how to scrape static web pages, dynamic pages (Ajax loaded content), iframes, get specific HTML elements, how to handle cookies and much more stuff. You will learn also about scraping traps and how to avoid them.


We will use Python 3.x in this tutorial, so let’s get started.

 

What is Web Scraping

Web scraping generally is the process of extracting data from the web, you can analyze the data and extract useful information

Also, you can store the scraped data in a database or any kind of tabular format such as CSV, XLS, etc, so you can access that information easily.

The scraped data can be passed to a library like NLTK for further processing to understand what the page is talking about.

 

Benefits of Web Scraping

You might wonder, why I should scrape the web and I have Google? Well, we don’t reinvent the wheel here. It is not for creating search engines only.

You can scrape your competitor’s web pages and analyze the data and see what kind of products your competitor’s clients are happy with their responses. All this for FREE.

A successful SEO tool like Moz that scraps and crawls the entire web and process the data for you so you can see people’s interest and how to compete with others in your field to be on the top.

These are just some simple uses. The scraped data means making money :).

 


Install Beautiful Soup

I assume that you have some background in Python basics, so let’s install our first Python scraping library which is Beautiful Soup.

To install Beautiful Soup, you can use pip or you can install it from the source.

I’ll install it using pip like this:

$ pip install beautifulsoup4

To check if it’s installed or not, open your editor and type the following:

from bs4 import BeautifulSoup

Then run it:

$ python myfile.py

If it runs without errors, that means Beautiful Soup is installed successfully. Now, let’s see how to use Beautiful Soup.

 


Using Beautiful Soup

Take a look at this simple example, we will extract the page title using Beautiful Soup:


The result is:

Get page title

We use the urlopen library to connect to the web page we want then we read the returned HTML using html.read()  method.

The returned HTML is transformed into a Beautiful Soup object which has a hieratical structure.

That means if you need to extract any HTML element, you just need to know the surrounding tags to get it as we will see later.

Handling HTTP Exceptions

For any reason, urlopen may return an error. It could be 404 if the page is not found or 500 if there is an internal server error, so we need to avoid script crashing by using exception handling like this:


Great, what if the server is down or you typed the domain incorrectly?

Handling URL Exceptions

We need to handle this kind of exceptions also. This exception is URLError, so our code will be like this:


Well, the last thing we need to check for is the returned tag, you may type incorrect tag or try to scrape a tag that is not found on the scraped page and this will return None object, so you need to check for None object.

This can be done using a simple if statement like this:


Great, our scraper is doing a good job. Now and we are able to scrape the whole page or scrape a specific tag.

 

Scrape HTML Tags using Class Attribute

Now let’s try to be selective by scraping some HTML elements based on their CSS classes.

The Beautiful Soup object has a function called findAll which extracts or filters elements based on their attributes

We can filter all h2 elements whose class is “widget-title” like this:

tags = res.findAll("h2", {"class": "widget-title"})

Then we can use for loop to iterate over them and do whatever with them.

So our code will be like this:


This code returns all h2 tags with a class called widget-title where these tags are the home page post titles.

We use getText function to print only the inner content of the tag, but if you didn’t use getText, you’ll end up with the tags with everything inside them.

Check the difference:

This when we use getText():

Scrape using gettext

And this without using getText():

Scrape without gettext

 


Scrape HTML Tags using findAll

We saw how findAll function filters tags by class, but this is not everything.

To filter a list of tags, replace the highlighted line of the above example with the following line:

tags = res.findAll("span", "a" "img")

This code gets all span, anchor, and image tags from the scraped HTML.

Also, you can extract tags that have these classes:

tags = res.findAll("a", {"class": ["url", "readmorebtn"]})

This code extracts all anchor tags that have “readmorebtn” and “url” class.

You can filter the content based on the inner text itself using the text argument like this:

tags = res.findAll(text="Python Programming Basics with Examples")

The findAll function returns all elements that match the specified attributes, but if you want to return one element only, you can use the limit parameter or use the find function which returns the first element only.

 

Find nth Child Using Beautiful Soup

Beautiful Soup object has many powerful features, you can get children elements directly like this:

tags = res.span.findAll("a")

This line will get the first span element on the Beautiful Soup object then scrape all anchor elements under that span.

What if you need to get the nth-child?

You can use the select function like this:

tag = res.find("nav", {"id": "site-navigation"}).select("a")[3]

This line gets the nav element with id “site-navigation” then we grab the fourth anchor tag from that nav element.

Beautiful Soup is a powerful library!!

 

Find Tags using Regex

On a previous tutorial, we talked about regular expressions and we saw how powerful it’s to use regex to identify common patterns such as emails, URLs, and much more.

Luckily, Beautiful Soup has this feature, you can pass regex patterns to match specific tags.

Imagine that you want to scrape some links that match a specific pattern like internal links or specific external links or scrape some images that reside in a specific path.

Regex engine makes it so easy to achieve such jobs.


These lines will scrape all PNG images on ../uploads/ and start with photo_

This is just a simple example to show you the power of regular expressions combined with Beautiful Soup.

 


Scraping JavaScript

Suppose that the page you need to scrape has another loading page that redirects you to the required page and the URL doesn’t change or there are some pieces of your scraped page that loads its content using Ajax.

Our scraper won’t load any content of these since the scraper doesn’t run the required JavaScript to load that content.

Your browser runs JavaScript and loads any content normally and actually, that what we will do using our second scraping library which is called Selenium.

Selenium library doesn’t include its own browser, you need to install a third-party browser (or Web driver) in order to work. This besides the browser itself of course.

You can choose from Chrome, Firefox, Safari, or Edge.

If you install any of these drivers let’s say Chrome, it will open an instance of the browser and loads your page then you can scrape or interact with your page.

Using ChromeDriver with Selenium

First, you should install selenium library like this:

$ pip install selenium

Then you should download Chrome driver from here and it to your system PATH.

Now you can load your page like this:


The output looks like this:

Scrape Pages Using Selenium Chrome Driver

Pretty simple, right?

We didn’t interact with page elements, so we didn’t see the power of selenium yet, just wait for it.

 

Using Selenium+PhantomJS

You might like working with browsers drivers, but there are much more people like running code in the background without seeing running in action.

For this purpose, there is an awesome tool called PhantomJS that loads your page and runs your code without opening any browsers.

PhantomJS enables you to interact with scraped page cookies and JavaScript without a headache.

Also, you can use it like Beautiful Soup to scrape pages and elements inside those pages.

Download PhantomJS from here and put it in your PATH so we can use it as a web driver with selenium.

Now, let’s scrape the web using selenium with PhantomJS the same way as we did with Chrome web driver.


The result is:

Selenium Web Scraping

Awesome!! It works very well.

You can access elements in many ways such as:


All of these functions return only one element, you can return multiple elements by using elements like this:

Selenium page_source

You can use the power of Beautiful Soup on the returned content from selenium by using page_source like this:


The result is:

Selenium page_source

As you can see, PhantomJS makes it super easy when scraping HTML elements. Let’s see more.

 

Scrape iframe Content Using Selenium

Your scraped page may contain an iframe that contains data.

If you try to scrape a page that contains an iframe, you won’t get the iframe content, you need to scrape the iframe source.

You can use Selenium to scrape iframes by switching to the frame you want to scrape.


The result is:

Scraping iframe Content Using Selenium

Check the current URL, it’s the iframe URL, not the original page.

 


Scrape iframe Content Using Beautiful Soup

You can get the URL of the iframe by using the find function, then you can scrap that URL.


Awesome!! Here we use another technique where we scrape the iframe content from within a page.

 

Handle Ajax Calls Using (Selenium+ PhantomJS)

You can use selenium to scrape content after you make your Ajax calls.

Like clicking a button that gets the content that you need to scrape. Check the following example:


The result is:

Handle Ajax Calls

Here we scrape a page that contains a button and we click that button which makes the Ajax call and gets the text, then we save a screenshot of that page.

There is one little thing here, it’s about the wait time.

We know that the page load cannot exceed 2 seconds to fully load, but that is not a good solution, the server can take more time or your connection could be slow, there are many reasons.

 

Wait for Ajax Calls to Complete Using PhantomJS

The best solution is to check for the existence of an HTML element on the final page, if it exists, that means the Ajax call is finished successfully.

Check this example:


The result is:

Wait for Ajax Calls to Complete

Here we click on an Ajax button which makes REST call and returns the JSON result.

We check for div element text if it’s “HTTP 200 OK” with 10 seconds timeout, then we save the result page as an image as shown.

You can check for many things like:

URL change using EC.url_changes()

New opened window using EC.new_window_is_opened()

Changes in title using EC.title_is()

If you have any page redirections, you can see if there is a change in title or URL to check for it.

There are many conditions to check for, we just take an example to show you how much power you have.

Cool!!

 

Handling Cookies

Sometimes, when you write your scraping code, it’s very important to take care of cookies for the site you are scraping.

Maybe you need to delete the cookies or maybe you need to save it in a file and use it for later connections.

A lot of scenarios out there, so let’s see how to handle cookies.

To retrieve cookies for the currently visited site, you can call get_cookies() function like this:


The result is:

Python web scraping Handling Cookies

To delete cookies, you can use delete_all_cookies() functions like this:


 

Traps to Avoid

The most disappointing thing while scraping a website is the data not visible during viewing the output even it’s clearly visible in the browser. Or the web server denies a submitted form that sounds perfectly fine. Or even worse your IP gets blocked by a website for anonymous reasons.

Previously are the most famous and difficult errors to be solved, not because they are unexpected but because they don’t have any error messages or traces to be used.

Next, we will discuss the most famous obstacles that may face you while using Scrapy consider this information as useful cause it may help you solving an error or even preventing a problem before you get in it.

Act Like a Human

The basic challenge for the websites that are trying not to be scrapped, is that they are already can figure out how to differentiate between real humans and scrapers, by various ways like using CAPTCHAS.

In spite of those websites are using hard techniques and are difficult to be fooled but also with a few changes you can make to your script it can look more like a human.

Headers Adjustment

One of the best methods for setting headers is using the Requests library. The HTTP headers are a group of configurations or attributes sent by your code every time you are trying to perform a request to a web server.

HTTP locates a group of fuzzy header types most of which are not widely used. Yet the next seven fields are regularly used by almost the major browsers when initializing any connection:


Next, are the default headers used by usual Python scraper library urllib:


Basically, these two headers are the only settings that truly matters. So, it will be a nice idea to keep them set to anything other than the defaults.

JavaScript and Cookies Handling

One of the important methods of solving a lot of scraping issues is handling cookies correctly. Websites that are using cookies to track your progress through the site may also use cookies to stop scrapers with abnormal behavior (like for example browsing too many pages, or submitting forms quickly) and prevent them from scraping the website.

In case your browser cookie is passing your identity to the website, then solutions, like changing your IP address, or even closing and reopening your connection to the website, maybe useless and time wasting.

During scraping a website, cookies are important because it will be required to hold and pass cookie from page to another while staying logged in a website. Some websites will ask for a new version of the cookie every time instead of asking to re-login again.

Just in case you are trying to scrape a single or a few numbers of websites, its recommended to examine and test cookies that were generated by those websites and decide which one to be handled by your scraper.

EditThisCookie is one of the most popular Chrome extensions that can show you how cookies are handled and being set as you browse and visit a website.

It’s All About Time

If you are a kind of people who do everything too quickly that might not work while scraping. A group of highly protected websites may ban you from submitting forms, downloading information, or even browsing the website if you are doing it remarkably faster than a normal person. Sometimes in order to go fast, you have to slow down.

In order to avoid getting noticed and blocked, you need to keep requests and page loads to the minimum. And if you have the chance try to extend the time between each request and the next one by a few seconds this may solve your problems, you can add in your code extra two lines like the following:

Common Form Security Features

If your code is trying to create a lot of user accounts and spamming all of the website members, then you are in a big problem.

Web forms that are dealing with account logins and creation, shows a high threat to security if they are an easy target for casual scraping. So, for many website owners, they can use these forms to limit scraper access to their websites.

Input Fields with Hidden Value

Sometimes in HTML forms, there are Hidden fields that allow the value in the field to be viewed by the browser but unseen to the user, unless the user looked at the website’s source code. Sometimes, these hidden fields can protect from spam.

One of the uses of the Hidden fields are to block web scraping through one of the following two methods:

  1. The Hidden field can be filled with a randomly generated variable which the server in the backend is expecting to be sent to the form processing page.
    Now if this value is not found in the form, then the server can assume that the form submission was not primarily from the website page, but was sent directly from a scraper to the processing page.
    You can overcome this situation by scraping the form page first, get randomly generated variable values, and finally send to the processing page from this point.
  2. Check if a form page has a hidden field with a name like a Username or an Email then an unwell scraping code may fill out the filed with any data and try to send it regardless of whether the field is hidden to the user or not. In this case, any hidden field with real value or value that is different from expected may be neglected and the user may even be banned from the website.

For example, check the bellow Facebook login page. Even though the form has three only visible fields which are Username, Password, and a Submit button it also notifies the backend servers a lot of information.

Honeypots Avoiding

When it comes to identifying useful and non-useful information, CSS makes life incredibly easy and sometimes it can be a big problem for web scrapers.

When a field in a website form is marked as hidden from the user via CSS, then almost an ordinary user visiting the website will not be able to populate this field because it does not appear in the browser.

Now, if the form is populated with data, then there is a big probability that it was done by a web scraper and the sent form will be blocked.

This is also applicable for links, files, images and any other field on the website that can be read by a scraper but it is marked hidden from the ordinary user who is visiting the website via a browser.

If you are trying to visit a hidden link on a website this will cause a server-side script to be fired to block your IP, you will be logged out of the website, or the page can take some other sever actions to stop any further access.

Human Checklist

If you have done all the previous tips and still you keep getting banned by the websites and you have no reason why is this happening then try to follow the next checklist to solve your problem:

  • JavaScript issue: if you are receiving a blank page from the web server, unexpected data (or not like what you have seen in your browser), or missing information it is most probably that is caused by JavaScript being executed on the website to build the site page.
  • Request sent correctly: if you are trying to submit a form or make a post request to a website, check the website page to ensure that everything you submit is being expected by the website and in the right format.
    Chrome Inspector Panel is a tool to view a real POST request sent to the website to ensure that a human request looks the same as the one your scraper is trying to send.
  • Cookies issue: If you are trying to log into a website and something wrong happens like got stuck while login or the website is in a strange state.
    Then check your cookies and ensure that they are being transferred correctly between each page and that they are sent to the website for each request.
  • HTTP Errors: if you are receiving from the client HTTP errors, like 403 Forbidden errors this may show that the website has marked your IP address as a scraper and will not accept any more requests from your IP.
    One solution is to wait for your IP to be removed from the list, or get a new IP (like by moving to another location).

You can follow the next few tips to avoid getting blocked again:

  • As we have mentioned previously, ensure that your scraper is not moving through the website too quickly. You can add delays to your scraper and let them run overnight.
    • Change your HTTP headers.
    • Act like a human and do not click or access anything that is a human will not be able to access it.
    • If you find it difficult to gain access to the website, sometimes website administrator can give you permission to use your scrapers, so try emailing [email protected]<domain name> or [email protected]<domain name> and ask their permission.

 

Web Scraping VS Web Crawling

We saw how to parse web pages, now some people get confused about web scraping and web crawling.

Web Scraping is about parsing web pages and extracting data from it for any purpose as we saw.

Web crawling is about harvesting every link you find and crawl every one of them without a scale, and this for the purpose of indexing, like what Google and other search engines do.

 

I hope you find the tutorial useful. Keep coming back.

Thank you.

Mokhtar Ebrahim
I'm working as a Linux system administrator since 2010. I'm responsible for maintaining, securing, and troubleshooting Linux servers for multiple clients around the world. I love writing shell and Python scripts to automate my work.

17 thoughts on “Modern Python web scraping using (Beautiful Soup & Selenium)

  1. getting HTTP error 403: forbidden when scrapping likegeeks – Scrape HTML Tags using Class Attribute

    1. You should try it on a different website.
      This is because of the tight security on the server.
      I’ve changed the example to another URL.

      1. I think if you try to add a ‘user-agent’ while using the ‘requests’ library such as:
        REQUEST_HEADER = {‘User-Agent’:”Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 \
        (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36″},

        you can avoid the HTTP 403 error.

  2. Is this a Python3 tutorial?

    from urllib.request import urlopen
    ImportError Traceback (most recent call last)
    in ()
    —-> 1 from urllib.request import urlopen
    ImportError: No module named request

    It’s sad you have to use Windows for this tutorial. Sad.

    1. I’ve tested on Windows, but you should use Python 3.x unless you know the code changes so you can update it.

  3. import from urllib.request import urlopen

    Does work in python3. You need to specify python3 in your instructions.

    1. Yes, it’s a python 3.x code.
      If you are using Python 2.x, you can import it like this:

  4. dear this is very informative but how to solve reCaptcha have any code or trick to bypass reCaptch. hope reply

    1. There is no legal way to bypass ReCaptcha.
      Even illegal ways which cost more money get caught.

Leave a Reply

Your email address will not be published. Required fields are marked *