GetOData

Web Scraping using Selenium and Python - 2024 Roadmap

4 min read

What is Selenium and what are it's features

Selenium is a powerful Web Scraping & Automation Library that supports various languages like Python, Java, etc and we will be working on Selenium with Python Today.

I, a data Extraction Expert, have been using Selenium for years and it's been my go-to when Scraping data from websites that need JavaScript to see the elements on their website and extract data or automate stuff.

So the main features that make Selenium a super Powerful library is:

  • Allows extracting and automating stuff on Dynamic websites (that need Javascript)

  • Use the Browser of your choice for automation (Chrome, Firefox, Safari, Edge, etc.)

  • Allows user Interaction: You can perform any steps that a real user can perform with Selenium. Some examples include: clicks, filling forms, typing, login and more.

  • Multi-Language Support: Use the language of your choice. It supports Python, Java, C#, and more.

The features are endless, so now let's get started with using this Powerful Library with Python.

Installation

Before we install Selenium, make sure you have Python downloaded and Installed on your PC. You can download it from here: python.org

Now, to install the Selenium library, run the below command in cmd:

pip install selenium webdriver-manager

This will also install another library called "webdriver-manager", which simplifies our usage of Selenium.

Without webdriver-manager, you would have had to download a seperate Chromedriver binary package which matches with your chrome browser and all that hassle. Webdriver does that automatically for us behind the scenes.

Quickstart

Now to check if everything has been installed correctly, create a Test.py in any of the folders and run the below test script

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
import time

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
driver.get("https://www.google.com")
time.sleep(5)

driver.close()

After running, this will open a new Chrome window and go to the website google.com

Now you can experiment and go to your favorite websites.

Going one step ahead, let's collect the HTML content of the webpage.

We can do that with "driver.page_source" property as shown in the below code:

page_source = driver.page_source
print(page_source)

Amazing, isn't it?

You can get a whole lot of other fields from the page like Title, current url and more.

Scraping Data

Now that we have seen how to get the html content of the webpage, we have to extract the content from the page.

For this, we can use another Python library called Beautiful Soup.

You can install it with the below command:

pip install beautifulsoup4

After getting the page source from any of your favorite page, we can pass it through the Beautiful Soup library to parse it and then use the command find & find_all to get specific elements of the page like paragraphs as shown below:

soup = BeautifulSoup(page_source, 'html.parser')

paragraphs = soup.find_all('p')
for paragraph in paragraphs:
    print(paragraph.text)

Similarly to get a specific text from an element like div with specific class name, we can use the "find" method as shown below:

specific_element = soup.find('div', {'class': 'specific-class'})
if specific_element:
    print("Specific Element Text:", specific_element.text)

And this is how you extract the data from one page. To do it across various pages, iterate through different URLs and store the data in an Output file.

Extra Features (Taking Screenshot, Filling forms and more)

Let's explore some more amazing features of Python Selenium.

Taking Screenshot:

driver.save_screenshot('Page_click.png')

Maximizing window:

driver.maximize_window()

Fill out Forms or Text elements by locating the element.

Eg: Let's do a google search for Restaurants in Paris:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
import time

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
driver.get("https://www.google.com")

search_box = driver.find_element("name", "q")
search_query = "Restaurants in Paris"
search_box.send_keys(search_query)
search_box.send_keys(Keys.RETURN)

time.sleep(5)

driver.close()

Here's how it will look after the search:

The possibilities are endless and remember that you can do every task with Selenium that you can do manually.

Bypass Antibot Mechanisms

One aspect that terrorizes even the best Data Extraction experts is getting blocked.

But no worries, we have a solution:

If you get blocked, One of the best way to still do the Automation and get the needed data is to use an API that manages the antibot bypass mechanism for us.

GetOData API allows you to extract large amount of data without getting blocked.

It also has JS Actions feature which allow you to perform User Interactions just like the way you do it in Selenium. Check out the service from here:

GetOData: The Most Advanced Web Scraping API

And it's done, Thanks for reading and Hope you find this article useful in your Scraping journey.