Create your first Python web crawler using Scrapy
In this tutorial, the focus will be on one of the best frameworks for web crawling called Scrapy. You will learn the basics of Scrapy and how to create your first web crawler or spider. Furthermore, the tutorial gives a demonstration of extracting and storing the scraped data.
Scrapy is a Python web framework that you can use to crawl websites and efficiently extract data.
You can use the extracted data for further processing, data mining, and storing the data in spreadsheets or any other business need.
Scrapy Architecture
The architecture of Scrapy contains five main components:
- Scrapy Engine
- Scheduler
- Downloader
- Spiders
- Item Pipelines
Scrapy Engine
The Scrapy engine is the main component of Scrapy, which controls the data flow between all other components. The engine generates requests and manages events against an action.
Scheduler
The scheduler receives the requests sent by the engine and queues them.
Downloader
The objective of the downloader is to fetch all the web pages and send them to the engine. The engine then sends the web pages to the spider.
Spiders
Spiders are the codes you write to parse websites and extract data.
Item Pipeline
The item pipeline processes the items side by side after the spiders extract them.
Installing Scrapy
You can simply install Scrapy along with its dependencies by using the Python Package Manager (pip).
Run the following command to install Scrapy in Windows:
pip install scrapy
However, the official Installation guide recommends installing Scrapy in a virtual environment because the Scrapy dependencies may conflict with other Python system packages, which will affect other scripts and tools.
Therefore, we will create a virtual environment to provide an encapsulated development environment.
In this tutorial, we will install a virtual environment first and then continue with the installation of Scrapy.
- Run the following command in the Python Scripts folder to install the virtual environment:
pip install virtualenv
- Now install virtualenvwrapper-win which lets us create an isolated Python virtual environments.
pip install virtualenvwrapper-win
- Set the path within the scripts folder, so you can globally use the Python commands:
set PATH=%PATH%;C:\Users\hp\appdata\local\programs\python\python37-32\scripts
- Create a virtual environment:
mkvirtualenv ScrapyTut
Where ScrapyTut is the name of our environment:
- Create your project folder and connect it with the virtual environment:
- Bind the virtual environment with the current working directory:
setprojectdir .
- If you want to turn off the virtual environment mode, simply use deactivate as below:
deactivate
- If you want to work again on the project, use the workon command along with the name of your project:
workon ScrapyTut
Now we have our virtual environment; we can continue the installation of Scrapy.
- For installation in Windows, you have to download OpenSSL and install it. Choose the regular version that matches your version of Python. Also, install the Visual C++ 2008 redistributables. Otherwise, you will get an error when installing dependencies.
- Add C:\OpenSSL-Win32\bin to the system PATH.
- When installing Scrapy, there are a number of packages that Scrapy depends on, and you have to install them. These packages include pywin32, twisted, zope.interface, lxml and pyOpenSSL.
- In the ScrapyTut directory run the following pip command to install Scrapy:
pip install scrapy
Note that when installing Twisted, you may encounter an error as:
Microsoft visual c++ 14.0 is required
To fix this error, you will have to install the following from Microsoft build Tools:
After this installation, if you get another error like the following:
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\link.exe' failed with exit status 1158
Simply download the wheel for Twisted that matches your version of Python. Paste this wheel into your current working directory as:
Now run the following command:
pip install Twisted-18.9.0-cp37-cp37m-win32.whl
Now, everything is ready to create our first crawler, so let’s do it.
Create a Scrapy Project
Before writing a Scrapy code, you will have to create a Scrapy project using the startproject command like this:
scrapy startproject myFirstScrapy
That will generate the project directory with the following contents:
The spider folder contains the spiders.
Here the scrapy.cfg file is the configuration file. Inside the myFirstScrapy folder, we will have the following files:
Create a Spider
After creating the project, navigate to the project directory and generate your spider along with the website URL that you want to crawl by executing the following command:
scrapy genspider jobs www.python.org
The result will be like the following:
Our “jobs” spider folder will be like this:
In the Spiders folder, we can have multiple spiders within the same project.
Now let’s go through the content of our newly created spider. Open the jobs.py file which contains the following code:
import scrapy class JobsSpider(scrapy.Spider): name = 'jobs' allowed_domains = ['www.python.org'] start_urls = ['http://www.python.org/'] def parse(self, response): pass
Here the AccessoriesSpider is the subclass of scrapy.Spider. The ‘jobs’ is the name of the spider. The ‘allowed_domains’ is the domain accessible by this spider.
The start_urls is the URL from where the web crawling will be started, or you can say it is the initial URL where web crawling begins. Then we have the parse method, which parses through the content of the page.
To crawl the accessories page of our URL, we need to add one more link in the start_urls property as below:
start_urls = ['http://www.python.org/', 'https://www.python.org/jobs/']
As we want to crawl more than one page, it is recommended to subclass the spider from the CrawlSpider class instead of the scrapy.spider class. For this, you will have to import the following module:
from scrapy.spiders import CrawlSpider
Our class will look like the following:
class JobsSpider(CrawlSpider): …
The next step is to initialize the rules variable. The rules variable defines the navigation rules that the spider will follow when crawling the site. To use the rules object, import the following class:
from scrapy.spiders import Rule
The rules variable further contains rule objects such as:
- link_extractor, which is an object of the Link Extractor class. The link_extractor object specifies how to extract links from the crawled URL. For this, you will have to import the Link Extractor class like this:
from scrapy.linkextractors import LinkExtractor
The rule variable will look like the following:
rules = ( Rule(LinkExtractor(allow=(), restrict_css=('.list-recent-jobs',)), callback="parse_item", follow=True),)
- callback is a string that is called when a link is extracted. It specifies the methods that will be used when accessing the elements of the page.
- follow is a Boolean which specifies if the extracted link should be followed or not after this rule.
Here we used allow to specify the link we will extract. But in our example, we have restricted by CSS class. So only extract the pages with the class we specified.
The callback parameter specifies the method that will be called when parsing the page. The .list-recent-jobs is the class for all the jobs listed on the page. You can check the class of an item by right-clicking on that item and select inspect on the web page.
In the example, we called the spider’s parse_item method instead of parse.
The content of the parse_item method is as follows:
def parse_item(self, response): print('Extracting…' + response.url)
This will print Extracting… along with the URL currently being extracted. For example, a link https://www.python.org/jobs/3698/ is extracted. So on the output screen,
Extracting…https://www.python.org/jobs/3698/
To run the spider, navigate to your project folder and type in the following command:
scrapy crawl jobs
The output will be like the following:
In this example, we set follow=true, which means the crawler will crawl the pages until the rule becomes false. That means when the list of jobs ends.
If you want to get only the print statement, you can use the following command:
scrapy crawl –nolog jobs
The output will be like the following:
Congratulations! You’ve built your first web crawler.
Scrapy Basics
Now we can crawl web pages. Let’s play with the crawled content for a little.
Selectors
You can use selectors to select some parts of data from the crawled HTML. The selectors select data from HTML by using XPath and CSS through response.xpath() and response.css() respectively. Just like in the previous example, we used the css class to select the data.
Consider the following example where we declared a string with HTML tags. Using the selector class, we extracted the data in the h1 tag using the Selector.xpath:
>>> from scrapy.selector import Selector >>> body = '<html><body><h1>Heading 1</h1></body></html>' >>> Selector(text = body).xpath('//h1/text()').get() 'Heading 1'
Items
Scrapy uses Python dicts to return the extracted data.
To extract data, Scrapy provides the Item class, which provides item objects. We can use these item objects as containers for the scraped data.
Items provide a simple syntax to declare fields. The syntax is like the following:
>>> import scrapy >>> class Job(scrapy.Item): company = scrapy.Field()
The Field object specifies the Metadata for each field.
You may notice when we created the Scrapy project, Scrapy creates items.py file in our project directory. We can modify this file to add our items as follows:
import scrapy class MyfirstscrapyItem(scrapy.Item): # define the fields for your item here like: location = scrapy.Field()
Here we have added one item. You can call this class from your spider file to initialize the items as follows:
def parse_item(self, response): item_links = response.css('.text > .listing-company > .listing-location > a::text'').extract() for x in item_links: yield scrapy.Request(x, callback=self.MyfirstscrapyItem)
In the above code, we have used the css method of response to extract the data.
In our web page, we have a div with class text, inside this div, we have a heading with class listing-company, inside this heading, we have a span tag with class listing-location, and finally, we have a tag a that contains some text. We extract this text using the extract() method.
Finally, we will loop through all the items extracted and call the items class.
Instead of doing all this in the crawler, we can also test our crawler by using only one statement while working in the Scrapy shell. We will demonstrate Scrapy shell in a later section.
Item Loaders
The data or items scrapped by the Item object is loaded or populated by using the Item Loader. You can use the item loader to extend the parsing rules.
After extracting items, we can populate the items in the item loader with the help of selectors.
The syntax for Item loader is as follows:
from scrapy.loader import ItemLoader from jobs.items import Job def parse(self, response): l = ItemLoader(item=Job(), response=response) l.add_css(‘name’, ‘//li[@class = ‘listing-company’]’) l.load_item()
Scrapy Shell
Scrapy shell is a command line tool that lets the developers test the parser without going through the crawler itself. With Scrapy shell, you can debug your code easily. The main purpose of Scrapy shell is to test the data extraction code.
We use the Scrapy shell to test the data extracted by CSS and XPath expression when performing crawl operations on a website.
You can activate the Scrapy shell from the current project using the shell command:
scrapy shell
if you want to parse a web page, so you will use the shell command along with the link of the page:
scrapy shell https://www.python.org/jobs/3659/
To extract the location of the job, simply run the following command in the shell:
response.css('.text > .listing-company > .listing-location > a::text').extract()
The result will be like this:
Similarly, you can extract any data from the website.
To get the current working URL, you can use the command below:
response.url
This is how you extract all the data in Scrapy. In the next section, we will save this data into a CSV file.
Storing the data
Let’s use the response.css in our actual code. We will store the value returned by this statement into a variable, and after that, we will store this into a CSV file. Use the following code:
def parse_detail_page(self, response): location = response.css('.text > .listing-company > .listing-location > a::text').extract() item = MyfirstscrapyItem() item['location'] = location item['url'] = response.url yield item
Here we stored the result of response.css into a variable called location. Then we assigned this variable to the location object of the item in the MyfirstscrapyItem() class.
Execute the following command to run your crawler and store the result into a CSV file:
scrapy crawl jobs -o ScrappedData.csv
The will generate a CSV file in the project directory:
Scrapy is a very easy framework to crawl web pages. That was just the beginning. If you liked the tutorial and hungry for more, tell us on the comments blew what the next Scrapy topic you would like to read about is.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.
More illustrated example required, understood just starting project
We will prepare another article.
Can write a tutorial on creating a web crawler with pycurl?
Yes we can, but it’s about the demand. You can say that it’s all about the readers.
What do I do now?
(ScrapyTut) (base) C:\scrapytut>pip install Twisted-18.9.0-cp37-cp37m-win32.whl
WARNING: Requirement ‘Twisted-18.9.0-cp37-cp37m-win32.whl’ looks like a filename, but the file does not exist
ERROR: Twisted-18.9.0-cp37-cp37m-win32.whl is not a supported wheel on this platform.
Are you sure about your architecture? Here I’m using 32bit.
You probably have a win_amd64 machine so follow the provided link and switch to an appropriate one from the list.
Hi, running the jobs produces no results. Please help.
Did you follow the steps the same way?
Also, make sure you use the same Python version used.