Scrapy: a simple example of how to use TTProxy

avatar
Azura Liu
...

If you read the first two blog posts about Scrapy - <Scrapy: A Python web scraping framework> and <Web crawler development - Get Started with Scrapy>, you must have a general understanding of Scrapy.

Then you would also like to know how to use proxy to crawl data in a Scrapy project.

Now, let's use a simple example to show you how to use TTProxy proxy IP in Scrapy.

Preparation

First, you need to purchase a proxy certificate in the Management Console of the TTProxy. If you are a newly registered user, TTProxy will give you a 100MB free traffic certificate for testing, you can also use this certificate to crawl data first.

Then you need to have a Scrapy project, which can be a new one or a project you already have. You can initiate a new Scrapy project with the following command:

scrapy startproject tutorial

Add credential configuration

In settings.py, we add two configurations to save TTProxy credential information.

TTPROXY_LICENSE = "your license"
TTPROXY_SECRET = "your license's secret"

And enable TutorialDownloaderMiddleware:

DOWNLOADER_MIDDLEWARES = {
  'tutorial.middlewares.TutorialDownloaderMiddleware': 543,
}

Modify the download middleware

The downloader middleware is a framework of hooks into Scrapy’s request/response processing. It’s a light, low-level system for globally altering Scrapy’s requests and responses.

Before using the proxy, the proxy IP must be obtained, we can use the requests library to obtain. Define a get_proxy method in TutorialDownloadMiddleware in middlewares.py to obtain the proxy, the code is as follows:

def get_proxy(self,license,secret):
    params = {
        "license": license,
        "time": int(time.time()),
        "cnt": 1,
    }
    params["sign"] = hashlib.md5((params["license"] + str(params["time"]) + secret).encode('utf-8')).hexdigest()
    try:  
        response = requests.get(
            url="https://api.ttproxy.com/v1/obtain",
            params=params,
            headers={
                "Content-Type": "text/plain; charset=utf-8",
            },
            data="1"
        )

        res = json.loads(response.content)
        return res["data"]["proxies"][0]
    except requests.exceptions.RequestException:
        print('HTTP Request failed')

Then we can call get_proxy in the process_request method and set the obtained proxy to the request:

request.meta["proxy"] = "http://" + self.get_proxy(spider.settings["TTPROXY_LICENSE"],spider.settings["TTPROXY_SECRET"])

Complete TutorialDownloaderMiddleware code is as follows:

import base64
from scrapy import signals

import hashlib
import requests
import time
import json

# ...

class TutorialDownloaderMiddleware(object):

    def get_proxy(self,license,secret):
        params = {
           "license": license,
           "time": int(time.time()),
           "cnt": 1,
        }
        params["sign"] = hashlib.md5((params["license"] + str(params["time"]) + secret).encode('utf-8')).hexdigest()
        try:  
            response = requests.get(
                url="https://api.ttproxy.com/v1/obtain",
                params=params,
                headers={
                    "Content-Type": "text/plain; charset=utf-8",
                },
                data="1"
            )
        
            res = json.loads(response.content)
            return res["data"]["proxies"][0]
        except requests.exceptions.RequestException:
            print('HTTP Request failed')
            
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        request.meta["proxy"] = "http://" + self.get_proxy(spider.settings["TTPROXY_LICENSE"],spider.settings["TTPROXY_SECRET"])
        return None

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

A simple crawler

Use the following command to create a simple crawler whose function is to verify whether the proxy IP is set successfully:

scrapy genspider myip api.myip.com

Then modify myip.py to get the current IP address by requesting http://api.myip.com/ :

# -*- coding: utf-8 -*-
import scrapy

class MyipSpider(scrapy.Spider):
    name = 'myip'
    allowed_domains = ['api.myip.com']
    start_urls = ['http://api.myip.com/']

    def parse(self, response):
        print(response.body)

Run

Now you can run the myip crawler using the following command:

scrapy crawl myip --nolog

Then you will see the IP address checked by http://api.myip.com/. By comparing with your local public network ip you can easily judge whether the ttproxy proxy is set successfully.

 {"ip":"158.140.163.14","country":"Indonesia","cc":"ID"}

Conclusion

Through this simple example, I believe you already know how to use TTProxy in Scrapy.