Requests – the library of Python web crawler

avatar
Azura Liu
...

Requests is a very useful Python HTTP client library, which is often used when writting crawler and test server response data. It can be said that requests fully meet the needs of today's network.

Installation

Install with PIP:

$ pip install requests

Or use easy_install:

$ easy_install requests

The installation can be completed by the above two methods.

Introduction

First, let's introduce a small example to feel it.

import requests

r = requests.get('http://cuiqingcai.com')
print type(r)
print r.status_code
print r.encoding
#print r.text
print r.cookies

In the above code, we requested the URL of the website, and then printed out the type of return results, status code, encoding, Cookies and other content.

The results are as follows:

<class 'requests.models.Response'>
200
UTF-8
<RequestsCookieJar[]>

Isn't it convenient? Don't worry, more convenient in the back.

Basic request

The requests library provides all the basic HTTP requests. For example:

r = requests.post("http://httpbin.org/post")
r = requests.put("http://httpbin.org/put")
r = requests.delete("http://httpbin.org/delete")
r = requests.head("http://httpbin.org/get")
r = requests.options("http://httpbin.org/get")

Basic GET request

The most basic GET request can be made directly using the GET method.

r = requests.get("http://httpbin.org/get")

If you want to add parameters, you can use the params parameter:

import requests

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get("http://httpbin.org/get", params=payload)
print r.url

Run results:

http://httpbin.org/get?key2=value2&key1=value1

If you want to request a JSON file, you can parse it using the json() method.

For example, write a JSON file named a.json. The content is as follows:

["foo", "bar", {
  "foo": "bar"
}]

Then use the following program to request and parse:

import requests

r = requests.get("a.json")
print r.text
print r.json()

The results are as follows, one is to output the content directly, the other is to parse it using the json() method and feel the difference.

["foo", "bar", {
 "foo": "bar"
 }]
[u'foo', u'bar', {u'foo': u'bar'}]

If you want to get the original socket response from the server, you can get r.raw. However, you need to set stream=True in the initial request.

r = requests.get('https://github.com/timeline.json', stream=True)
r.raw
<requests.packages.urllib3.response.HTTPResponse object at 0x101194810>
r.raw.read(10)
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

In this way, the original socket content of the web page is obtained.

If you want to add headers, you can pass the headers parameter

import requests

payload = {'key1': 'value1', 'key2': 'value2'}
headers = {'content-type': 'application/json'}
r = requests.get("http://httpbin.org/get", params=payload, headers=headers)
print r.url

You can add custom headers to the request header through the headers parameter.

Basic POST request

For POST requests, we generally need to add some parameters to it. Then the most basic parameter passing method can use the data parameter.

import requests

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("http://httpbin.org/post", data=payload)
print r.text

Run results:

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.9.1"
  }, 
  "json": null, 
  "url": "http://httpbin.org/post"
}

You can see that the parameters are passed successfully, and then the server returns the data we passed.

Sometimes the information we need to transmit is not in the form of a form. We need to transmit the data in JSON format, so we can use the json.dumps() method to serialize the form data.

import requests

url = 'http://httpbin.org/post'
payload = {'some': 'data'}
r = requests.post(url, data=json.dumps(payload))
print r.text

Run results:

{
  "args": {}, 
  "data": "{\"some\": \"data\"}", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "16", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.9.1"
  }, 
  "json": {
    "some": "data"
  },  
  "url": "http://httpbin.org/post"
}

Through the above method, we can POST data in JSON format.

If you want to upload a file, just use the file parameter directly.

Create a new a.txt file and write Hello World!

import requests

url = 'http://httpbin.org/post'
files = {'file': open('test.txt', 'rb')}
r = requests.post(url, files=files)
print r.text

You can see the results as follows:

{
  "args": {}, 
  "data": "", 
  "files": {
    "file": "Hello World!"
  }, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "156", 
    "Content-Type": "multipart/form-data; boundary=7d8eb5ff99a04c11bb3e862ce78d7000", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.9.1"
  }, 
  "json": null, 
  "url": "http://httpbin.org/post"
}

In this way, we successfully completed the upload of a file.

requests support streaming uploads, which allows you to send large data streams or files without having to read them into memory first. To use streaming upload, just provide a file-like object for your request body.

with open('massive-body') as f:
    requests.post('http://some.url/streamed', data=f)

This is a very practical and convenient function.

Cookies

If a response contains cookies, we can use the cookies variable to get it.

import requests

url = 'http://example.com'
r = requests.get(url)
print r.cookies
print r.cookies['example_cookie_name']

The above program is just a sample, you can use the cookies variable to get the cookies of the site.

In addition, you can use the cookies variable to send cookies to the server.

import requests

url = 'http://httpbin.org/cookies'
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies)
print r.text

Run results:

'{"cookies": {"cookies_are": "working"}}'

Cookies can be successfully sent to the server.

Timeout settings

You can use the timeout variable to configure the maximum request time

requests.get('http://github.com', timeout=0.001)

Note: timeout is only valid for the connection process and has nothing to do with the download of the response body.

In other words, this time only limits the time of the request. Even if the returned response is large and takes some time to download, it doesn't help.

Session object

In the above requests, each request is effectively a new request. This is equivalent to the effect that each request is opened separately with a different browser. That is, it does not refer to a session, even if the request is for the same URL. Such as:

requests.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = requests.get("http://httpbin.org/cookies")
print(r.text)

The result is:

{
  "cookies": {}
}

Obviously, this is not in a session and cookies cannot be obtained, so what do we need to maintain a persistent session at some sites? Just like using a browser to browse taobao, jumping between different tabs, this is actually establishing a long-term session.

The solution is as follows:

import requests

s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")
print(r.text)

Here we have requested cookies twice, once to set cookies and once to get cookies.

Run results:

{
  "cookies": {
    "sessioncookie": "123456789"
  }
}

Found that cookies can be successfully obtained, this is to establish a session and then use it.

So since the session is a global variable, we can definitely use it for global configuration.

import requests

s = requests.Session()
s.headers.update({'x-test': 'true'})
r = s.get('http://httpbin.org/headers', headers={'x-test2': 'true'})
print r.text

The headers variable is set by the s.headers.update method. Then we set a header in the request, so what happens?

Quite simply, both variables are passed.

Run results:

{
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.9.1", 
    "X-Test": "true", 
    "X-Test2": "true"
  }
}

What if the headers passed by the GET method are also X-test?

r = s.get('http://httpbin.org/headers', headers={'x-test': 'true'})

Well, it will override the global configuration:

{
   "headers": {
     "Accept": "/", 
     "Accept-Encoding": "gzip, deflate", 
     "Host": "httpbin.org", 
     "User-Agent": "python-requests/2.9.1", 
     "X-Test": "true"
   }
 }

What if you don't want a variable in the global configuration? It's easy, just set it to None.

r = s.get('http://httpbin.org/headers', headers={'x-test': None})

Run results:

{
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.9.1"
  }
}

Well, that's the basic usage of session.

SSL Certification Verification

Websites URL starting with HTTPS are now everywhere, and requests can verify SSL certificates for HTTPS Requests, just like web browsers. To check a host's SSL certificate, you can use the verify parameter

Now the 12306 certificate is invalid. Let's test it:

import requests

r = requests.get('https://kyfw.12306.cn/otn/', verify=True)
print r.text

Result:

requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)

Sure enough. Let's try github's.

import requests

r = requests.get('https://github.com', verify=True)
print r.text

Well, normal request, I won't print the content.

If we want to skip the certificate validation for 12306, just set verify to False.

import requests

r = requests.get('https://kyfw.12306.cn/otn/', verify=False)
print r.text

Discovery is OK for normal request. Verify is True by default, so you need to manually set this variable if you need to.

Proxies

If you need to use a proxy, you can configure individual requests with the proxies argument to any request method:

import requests

proxies = {
  "https": "http://41.118.132.69:4433"
}
r = requests.post("http://httpbin.org/post", proxies=proxies)
print r.text

You can also configure proxies by setting the environment variables HTTP_PROXY and HTTPS_PROXY.

export HTTP_PROXY="http://10.10.1.10:3128"
export HTTPS_PROXY="http://10.10.1.10:1080"

Through the avove examples, you can easily set up the proxy. Here's a example about Request how to use TTProxy proxies.

More APIs

The above describes the most commonly used parameters in requests. If you need to use more, please refer to the official API documentation.

Conclusion

The above summarizes the basic usage of requests. If you have a basic knowledge of crawlers, you will be able to use them quickly.