Scraping Websites with Python, Selenium, and Tor
The |Big Data| Heist
Google guards its data notoriously well and it is not uncommon to see a page like this when you try to summon too many consecutive requests to google.
How many times have you automated a code to run multiple times in order to scrape a website, take a coffee back and then come back to find that the website has blocked your advances? It sets you back by hours, causes frustration, and delays your progress.
Most websites have a defense mechanism against back-to-back requests coming from the same IP address. This is done in order to quash Denial-Of-Service attacks before they can slow the website down.
One way to circumnavigate this problem is to space out our requests across hours or days. But in most cases, we do not have that kind of time.
It directs Internet traffic through a free, worldwide, volunteer overlay network, consisting of more than six thousand relays, for concealing a user’s location and usage from anyone conducting network surveillance or traffic analysis.
The Tor Browser allows us to connect to the internet and send requests from different IP addresses. So the destination never knows the actual origin of the request. Every time you connect through the Tor, it is as if you are assigned a new IP address.
We just need to control the Tor browser through code. That is where Selenium comes into the picture.
Primarily it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should) also be automated as well. — Selenium
Let us now see how we can use Selenium, Python, and Tor to access different websites in a macOS environment.