Scraping Websites with Python, Selenium and Tor

Ashhadul Islam on 2021-11-19

Scraping Websites with Python, Selenium, and Tor

The |Big Data| Heist

Your beautiful scraper was running fine when you left. You came back and found this

Google guards its data notoriously well and it is not uncommon to see a page like this when you try to summon too many consecutive requests to google.

How many times have you automated a code to run multiple times in order to scrape a website, take a coffee back and then come back to find that the website has blocked your advances? It sets you back by hours, causes frustration, and delays your progress.

Most websites have a defense mechanism against back-to-back requests coming from the same IP address. This is done in order to quash Denial-Of-Service attacks before they can slow the website down.

One way to circumnavigate this problem is to space out our requests across hours or days. But in most cases, we do not have that kind of time.

Tor

It directs Internet traffic through a free, worldwide, volunteer overlay network, consisting of more than six thousand relays,[7] for concealing a user’s location and usage from anyone conducting network surveillance or traffic analysis.

The Tor Browser allows us to connect to the internet and send requests from different IP addresses. So the destination never knows the actual origin of the request. Every time you connect through the Tor, it is as if you are assigned a new IP address.

The www is your oyster

We just need to control the Tor browser through code. That is where Selenium comes into the picture.

Selenium Banner

Primarily it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should) also be automated as well. — Selenium

Let us now see how we can use Selenium, Python, and Tor to access different websites in a macOS environment.

Firefox