Sometimes one needs to crawl certain information online as part of their project. However, websites do not like crawlers much because of obvious reasons. As a result, websites would implement a mechanism for blocking crawlers. In this post, I will explain how to crawl websites without exposing your information and, in case the crawler got blocked, it is capable of changing its identity and bypass any blockage. In doing so, we will be using Tor network and proxy it to our crawler using Privoxy. The crawler we’ll implement is a simple crawler using Scrapy.
Installing and Configuring Tor with Privoxy
Now, let’s install Tor and Privoxy. On Debian/Ubuntu, you should be able to install it using the commands below:
sudo apt-get update sudo apt-get install tor tor-geoipdb privoxy
Configuring Tor
If you just want to set up tor, you don’t need to perform any edits. However, in case you’d like to be able to automatically control Tor from a script, you’d need to set the control port and password. First, generate a hash for your secure password using (replace PASSWORDHERE with your password):
tor --hash-password PASSWORDHERE
Next, copy the generated hash and add the below lines to the end of /etc/tor/torrc
(replace GENERATEDHASH with the hash generated):
ControlPort 9051 HashedControlPassword GENERATEDHASH
Configuring Privoxy
With your favorite editor, add the below lines at the end of /etc/privoxy/config
forward-socks5t / 127.0.0.1:9050 . # Optional keep-alive-timeout 600 default-server-timeout 600 socket-timeout 600
Now that everything is configured, all you have to do is start the services by running:
sudo service privoxy start sudo service tor start
To test that everything is working properly, curl
http://ifconfig.me to get your current IP and test it using tor and the privoxy proxy. The current IP must be different than the rest.
curl http://ifconfig.me # get your current IP torify curl http://ifconfig.me # test Tor curl -x 127.0.0.1:8118 https://ifconfig.me # test privoxy
Crawling using Scrapy with Tor
Scrapy is a great Python framework for building crawlers, it is easy to use and offers great customizations. We will be using it in this post; however, the method is generally still usable in other languages. Let’s create a new project and spider using scrapy:
python3 -m venv venv # create a virtual environment source venv/bin/activate # activate it pip install -U scrapy stem requests[socks] # install dependencies scrapy startproject mokha; cd mokha # create a project scrapy genspider ifconfig ifconfig.me # create a spider mkdir middlewares; touch middlewares/__init__.py # create a middleware
To activate the middleware, add the following lines at the end of settings.py
:
DOWNLOADER_MIDDLEWARES = { 'mokha.middlewares.ProxyMiddleware.ProxyMiddleware': 543, }
Create ProxyMiddleware.py
inside the middlewares
folder and place the following code in it. Simply, the function new_tor_identity
sends a signal to Tor controller to issue us a new identity. Make sure to change the passowrd PASSWORDHERE
to the one you used earlier when configuring tor. You can call the function in either process_request or process_response. If you want to issue a new identity for every request, place it in the former function. But if you’d like to get a new identity in certain situations (e.g. you have been blocked), call it from process_response
after verifying that you have been blocked.
from stem import Signal from stem.control import Controller from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware def new_tor_identity(): with Controller.from_port(port=9051) as controller: controller.authenticate(password='PASSWORDHERE') controller.signal(Signal.NEWNYM) class ProxyMiddleware(HttpProxyMiddleware): def process_response(self, request, response, spider): # Get a new identity depending on the response if response.status != 200: new_tor_identity() return request return response def process_request(self, request, spider): # Set the Proxy # A new identity for each request # Comment out if you want to get a new Identity only through process_response new_tor_identity() request.meta['proxy'] = 'http://127.0.0.1:8118'
Lastly, implement the crawler (ifconfig.py
). This is really a very simple proof-of-concept crawler as all what it does is querying a single page and logging out our IP.
import scrapy class IfconfigSpider(scrapy.Spider): name = 'ifconfig' allowed_domains = ['ifconfig.me'] start_urls = ['http://ifconfig.me/'] def parse(self, response): self.log('IP : %s' % response.css('#ip_address').get())
Running the command scrapy crawl ifconfig
twice will report two different IPs, indicating that everything worked smoothly as intended. Hope you have found this post helpful.