Scrapy: a simple example of how to use TTProxy

If you read the first two blog posts about Scrapy - <Scrapy: A Python web scraping framework> and <Web crawler development - Get Started with Scrapy>, you must have a general understanding of Scrapy.
Then you would also like to know how to use proxy to crawl data in a Scrapy project.
Now, let's use a simple example to show you how to use TTProxy proxy IP in Scrapy.
Preparation
First, you need to purchase a proxy certificate in the Management Console of the TTProxy. If you are a newly registered user, TTProxy will give you a 100MB free traffic certificate for testing, you can also use this certificate to crawl data first.
Then you need to have a Scrapy project, which can be a new one or a project you already have. You can initiate a new Scrapy project with the following command:
scrapy startproject tutorial
Add credential configuration
In settings.py, we add two configurations to save TTProxy credential information.
TTPROXY_LICENSE = "your license"
TTPROXY_SECRET = "your license's secret"
And enable TutorialDownloaderMiddleware:
DOWNLOADER_MIDDLEWARES = {
'tutorial.middlewares.TutorialDownloaderMiddleware': 543,
}
Modify the download middleware
The downloader middleware is a framework of hooks into Scrapy’s request/response processing. It’s a light, low-level system for globally altering Scrapy’s requests and responses.
Before using the proxy, the proxy IP must be obtained, we can use the requests library to obtain. Define a get_proxy method in TutorialDownloadMiddleware in middlewares.py to obtain the proxy, the code is as follows:
def get_proxy(self,license,secret):
params = {
"license": license,
"time": int(time.time()),
"cnt": 1,
}
params["sign"] = hashlib.md5((params["license"] + str(params["time"]) + secret).encode('utf-8')).hexdigest()
try:
response = requests.get(
url="https://api.ttproxy.com/v1/obtain",
params=params,
headers={
"Content-Type": "text/plain; charset=utf-8",
},
data="1"
)
res = json.loads(response.content)
return res["data"]["proxies"][0]
except requests.exceptions.RequestException:
print('HTTP Request failed')
Then we can call get_proxy in the process_request method and set the obtained proxy to the request:
request.meta["proxy"] = "http://" + self.get_proxy(spider.settings["TTPROXY_LICENSE"],spider.settings["TTPROXY_SECRET"])
Complete TutorialDownloaderMiddleware code is as follows:
import base64
from scrapy import signals
import hashlib
import requests
import time
import json
# ...
class TutorialDownloaderMiddleware(object):
def get_proxy(self,license,secret):
params = {
"license": license,
"time": int(time.time()),
"cnt": 1,
}
params["sign"] = hashlib.md5((params["license"] + str(params["time"]) + secret).encode('utf-8')).hexdigest()
try:
response = requests.get(
url="https://api.ttproxy.com/v1/obtain",
params=params,
headers={
"Content-Type": "text/plain; charset=utf-8",
},
data="1"
)
res = json.loads(response.content)
return res["data"]["proxies"][0]
except requests.exceptions.RequestException:
print('HTTP Request failed')
@classmethod
def from_crawler(cls, crawler):
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
request.meta["proxy"] = "http://" + self.get_proxy(spider.settings["TTPROXY_LICENSE"],spider.settings["TTPROXY_SECRET"])
return None
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
A simple crawler
Use the following command to create a simple crawler whose function is to verify whether the proxy IP is set successfully:
scrapy genspider myip api.myip.com
Then modify myip.py to get the current IP address by requesting http://api.myip.com/ :
# -*- coding: utf-8 -*-
import scrapy
class MyipSpider(scrapy.Spider):
name = 'myip'
allowed_domains = ['api.myip.com']
start_urls = ['http://api.myip.com/']
def parse(self, response):
print(response.body)
Run
Now you can run the myip crawler using the following command:
scrapy crawl myip --nolog
Then you will see the IP address checked by http://api.myip.com/. By comparing with your local public network ip you can easily judge whether the ttproxy proxy is set successfully.
{"ip":"158.140.163.14","country":"Indonesia","cc":"ID"}
Conclusion
Through this simple example, I believe you already know how to use TTProxy in Scrapy.