Web Scraping with Scrapy and Python - Guide with Real world Example
6 min read
What is Scrapy and what are it's features
Scrapy is a super powerful Open Source Web Scraping tool that can help you extract data with ease.
It works with Python.
I, a data Extraction Expert, have been using Scrapy for years and it's been my go-to when Scraping data from any website.
I combine it with other Technologies like Selenium, Splash, and more to extract large amount of data with ease.
So the main features that make Scrapy a super Powerful open-source tool are:
Allows you to send concurrent requests to a website and extract data asynchronously (sending multiple requests in parallel)
Connects seamlessly with other Technologies like Selenium, Beautiful Soup, Splash, and more. It's the power of Python. Simply import the libraries and start using them
Export the Output easily to CSV, Excel, JSON or database like MongoDB, MYSQL or others.
The features are endless, so now let's get started with using this Powerful Open Source Tool with Python.
Installation
Before we install Scrapy, make sure you have Python downloaded and Installed on your PC. You can download it from here: python.org
Now, to install the Scrapy library and other essential libraries that make it easy to identify any errors if they come, run the below command in cmd:
pip install scrapy pylint autopep8
pylint analyzes your code without running it
autopep8 formats python code
Creating a Scraper folder with Scrapy
Now Lets get the interesting Part: Creating the Spider folder with Scrapy.
In this Article, we will scrape https://books.toscrape.com/ website and get the names of all the books, their prices and their links.
We will also navigate through multiple pages.
To Create a project in Scrapy, run the following command in CMD or Terminal:
scrapy startproject Book_scraping
This will create a folder named Book_Scraping.
Now to go inside the folder, run the command:
cd Book_scraping
Now, before we understand the folder strucuture, let's create a scrapy spider as well which is where we will code our Program that scrapes the data from the website.
The code for creating the spider is :
scrapy genspider spider_name website_domain
To create the Scrapy spider for the Books website, run the following command:
scrapy genspider books_scraper books.toscrape.com
This will create a spider named books_scraper in the Book_scraping/spiders folder for the domain books.toscrape.com
Now, here is the complete structure of our Web scraping project:
.
├── Book_scraping
│ ├── spiders
│ │ ├── __init__.py
│ │ └── books_scraper.py
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ └── settings.py
└── scrapy.cfg
The Main files you will interact the most here are:
books_scraper.py : Coding our main Scraper program here
settings.py : Handling the settings like Speed or our bot, Making our bot undetectable and extra configurations.
Here's a short info about the others as well:
scrapy.cfg: Contains settings such as project name, spider details and other configurations related to deployment.
__init__.py: indicates that spiders directory should be treated as a Python package
items.py: Can be used to collect the scraped data and manage it.
middlewares.py: Allows you to process request before and after they reach the spider.
pipelines.py: extra operations to be performed on the scraped data such as validation, cleaning & storage.
Now enough of the theory, let's get our hand dirty!
Main Scraper Program
Open the main Bot folder in any of your favorite Code Editor. In this demo, we will use Visual Studio code.
And open the file Books_scraping/spiders/books_scraper.py in the editor.
This is how the file should like:
Now it already has some initial code which we can tweak according to us.
start_urls is the starting page you want to scrape data from. We can replace it with another function named start_requests to make it easy for us to pass multiple urls to scrape data.
allowed_domains includes the list of domains that you want to scrape. You can even remove this line to scrape data from multiple websites.
So you can change the code to below:
import scrapy
class BooksScraperSpider(scrapy.Spider):
name = "books_scraper"
allowed_domains = ["books.toscrape.com"]
# Function to send initial Request
def start_requests(self):
# Way to send Request to the first page of the website and call parse function after receiving response
yield scrapy.Request(url='https://books.toscrape.com/', callback=self.parse)
def parse(self, response):
# Printing the body of the page
print(response.body)
So in the above program, we send the request to the first page of the website and then the response is received in the parse function.
Now to run the above program, run the following command in the terminal:
scrapy crawl books_scraper
If everything went correctly, you should get the HTML content of the page in the terminal like below:
Amazing!
Now let's parse the HTML content to just get the names of the book, their prices and their links.
To do this, we need to locate each of the data on the pages.
We can do that with Xpath or CSS Selector. Let's do it with Xpath:
Scraping Data with Xpath
To locate xpath, you can simply create the Xpath by opening inspect element on the page you want to scrape, locate the element and then create the xpath for it.
So after finding xpath for each of the data point you want to extract, here is how the code should look like:
import scrapy
class BooksScraperSpider(scrapy.Spider):
name = "books_scraper"
allowed_domains = ["books.toscrape.com"]
# Function to send initial Request
def start_requests(self):
# Way to send Request to the first page of the website and call parse function after receiving response
yield scrapy.Request(url='https://books.toscrape.com/', callback=self.parse)
def parse(self, response):
book_name = response.xpath("//a[@title]/text()").getall()
book_price = response.xpath("//div[@class='product_price']/p[1]/text()").getall()
book_link = response.xpath("//a[@title][@href]/@href").getall()
print(book_name, book_price, book_link)
Now run the program again by running:
scrapy crawl books_scraper
and you will get the name of each of the book, it's price and it's book link. It will be in the list format.
Now to output the request, we can use the yield feature of the scrapy function:
def parse(self, response):
book_name = response.xpath("//a[@title]/text()").getall()
book_price = response.xpath("//div[@class='product_price']/p[1]/text()").getall()
book_link = response.xpath("//a[@title][@href]/@href").getall()
for book_name, book_price, book_link in zip(book_name, book_price, book_link):
yield{
'Book Name':book_name,
'Book Price':book_price,
'Book Link':book_link
}
If you run the Scraper again now, you will get the data for each of the book properly defined as follows:
Now, you can even output this data easily to a CSV file with inbuilt scrapy command.
Just add -o Output_file_name.csv to the current command:
scrapy crawl books_scraper -o Output_Sample.csv
This will create a Output sample file in the folder which contains the needed data.
Amazing, isn't it?
Navigating through Multiple Pages
Now to navigate through multiple pages all at once, we can either:
identify the xpath for the next page and iterate through each of the url to get the needed data
Get the URL which contains the page number and Increment the page number of each URL and get the data from each of them
Here's how the second way would work like:
import scrapy
class BooksScraperSpider(scrapy.Spider):
name = "books_scraper"
allowed_domains = ["books.toscrape.com"]
# Function to send initial Request
def start_requests(self):
for page_num in range(1, 10):
page_url = f"https://books.toscrape.com/catalogue/page-{page_num}.html"
# Iterate through each page number
yield scrapy.Request(url=page_url, callback=self.parse)
def parse(self, response):
book_name = response.xpath("//a[@title]/text()").getall()
book_price = response.xpath("//div[@class='product_price']/p[1]/text()").getall()
book_link = response.xpath("//a[@title][@href]/@href").getall()
for book_name, book_price, book_link in zip(book_name, book_price, book_link):
yield{
'Book Name':book_name,
'Book Price':book_price,
'Book Link':book_link,
}
Next Steps
Now that you have learned how Scrapy works, you can try scraping more websites with the help of scrapy.
And if you get blocked when scraping data, you can create an Antibot bypass mechanism mentioned here:
https://www.getodata.com/blog/7-steps-to-prevent-getting-blocked-when-web-scraping
Or use a Powerful Web scraping API like GetOData to get data from any website without getting blocked: