4 Step Method of Web scraping data from any website (From Data Extraction Expert)
4 min read
Got Blocked Again while web scraping or want to find the best way to scrape data from any website?
No worries. Here are the best tips which will get you unblocked even if the website has used the Best Antibot mechanism available.
I have scraped hundreds of websites as a Data Extraction expert and I have been using this Steps for almost every Data Extraction project that I have got.
Bonus Tip: If you are someone who don't wanna learn Web scraping from scratch and just wanna start getting Data without creating the Antibot mechanism, then jump directly to the 4th Step.
So let's get started:
Analyze the Website
The first step is to Analyze the website you want to scrape.
There are usually three ways a website showcases data:
Static Website (JS is not required to view the data)
Dynamic Data (JS is required to view the data)
Backend API that get's the data and is utilized by frontend
To see if the data is dynamic or static, you can just disable Javascript on the website through Inspect element and see if the data is still shown or not.
You have to find the easiest way to get the data.
Here is my Tech Stack depending on how the data is shown by website:
For Static Websites:
Use either beautiful Soup or
Scrapy
For Dynamic Websites:
Use Selenium with Beautiful soup or scrapy
Splash With Scrapy
Create the Scraping Program and Mimic Human Behavior
You can create a simple program for now that get's just the HTML content from the website and go through multiple url's at the same time.
To prevent from getting blocked, The initial set up you can do with every project is:
Add Real User Agent to your program. Get it from google by typing "My User Agent"
Scrape slowly.
After running the program, the next steps depends on what type of status the Website returns.
If it's getting the HTML content for every webpage, then you don't need to continue to the next steps. You can just output the data and use it as needed.
But ya, with many of the websites, you will get blocked.
Now some issues like 429 error from the website, which means "too many requests" in short amount of time, can be solved by just making your scraper slow.
But for other errors where the website detected that you are a bot and blocked you can be solved by mimicking human behavior.
In such cases, here are some more advance steps you can take to unblock the program:
Add Real Request headers in your program (all of them). You can use Postman to test which request headers are required for the website.
Use different Tech: You can switch from Selenium to Puppeteer or to Splash and seeing which one is able to correctly get the data.
Use Behind the scenes API that get the data for the website, instead of directly hitting the Frontend URL. You can check if they exist by opening Inspect element on the website and heading to the networks tab.
Using Proxy rotation services
Use Premium proxies
Rotate your User Agents
If you are using interactions like Click or type in your program, then add random delays between your requests.
Final Step: Using API's like GetOData API that do every of the above steps for us and you can directly get the data without worrying about getting Blocked or Captchas.
Handling Captchas and Complex Antibot Mechanisms
Captchas are considered one of the biggest huddles when web scraping data.
But no worries, let's see how we can solve this issue effortlessly.
There are two types of captchas:
Soft Captchas: These ones show only if the website detects that you are a bot or a Program.
This ones can be solved by again going to the second step mentioned above.
Hard Captchas: These are the ones that do not care if you are a human or a bot. They are displayed every time to access the content.
The best way to solve these Hard Captchas is by using Real Human workers or AI solution that solve the captcha for us and at a very cheaper rate.
Here are the best two that I usually use for bypassing the captchas:
2Captcha Service : https://2captcha.com/
Cap Solver: https://www.capsolver.com/
Last Resolution
Sometimes it may happen that even after doing everything you can, you may still get blocked by the website.
This can happen because websites are able to collect Thousands of your Data points like:
Device Type
The Location you are acccessing the data from
The time you are accessing
Your mouse moments, typing speed
and so much more...
And if they find even one issue in them, they can block your request.
So in this case, the best way to still get the needed data is use an API that manages the antibot mechanism for us.
Here are the most powerful ones in the market:
Hope you find this article useful in your Scraping journey!
Feel free to ask me any question here or on Twitter : https://twitter.com/SwapBuilds
and I will get back asap. Thanks for Reading!