Python Web Crawler – Analysis of Robots Protocol
1. Robots Protocol
The robots protocol is also called a crawler protocol and a robot protocol. Its full name is called Robots Exclusion Protocol, which is used to tell crawlers and search engines which pages can be crawled and which cannot. It is usually a text file called robots.txt, which is usually placed in the root directory of the website.
When a search crawler visits a site, it will first check whether there is a robots.txt file in the root directory of the site. If it exists, the search crawler will crawl according to the crawling range defined therein. If this file is not found, the search crawler will visit all directly accessible pages.
Let's take a look at a sample of robots.txt:
User-agent: * Disallow: / Allow: /public/
This implements the function that only crawls the public directory for all search crawlers, saves the above content as robots.txt file, and put it in the root directory of the website, and the entry files of the website (such as index.php, index.html, index.jsp, etc.) together.
The User-agent above describes the name of the search crawler. and setting it to * here means that the protocol is valid for any crawler. For example, we can set:
This means that the rules we set are effective for Baidu crawler. If there are multiple User-Agent records, there will be multiple crawlers subject to crawling restrictions, but at least one needs to be specified.
Disallow specifies the directories that are not allowed to be crawled. For example, if setting / in the above example means that all pages are not allowed to be crawled.
Allow is generally used together with disallow, and is generally not used alone to exclude certain restrictions. Now we set it to / public /, which means that all pages are not allowed to be crawled, but the public directory can be crawled.
Let's look at a few more examples. The code that prohibits all crawlers from accessing any directory is as follows:
User-agent: * Disallow: /
The code that allows all crawlers to access any directory is as follows:
User-agent: * Disallow:
In addition, it is also possible to leave the robots.txt file blank.
The code that prohibits all crawlers from accessing certain directories on the website is as follows:
User-agent: * Disallow: /private/ Disallow: /tmp/
The code that only allows a certain crawler ito access is as follows:
User-agent: WebCrawler Disallow: User-agent: * Disallow: /
These are some common ways of writing robots.txt.
2. Crawler name
You may be wondering, where did the crawler name come from? Why is it called this name? In fact, it has a fixed name, such as BaiduSpider. The following table lists the names of some common search crawler names and their corresponding websites.
Crawler name Name Website BaiduSpider Baidu www.baidu.com Googlebot Google www.google.com 360Spider 360 search. www.so.com YodaoBot Youdao www.youdao.com ia_archiver Alexa www.alexa.cn Scooter altavista www.altavista.com
After understanding the Robots protocol, we can use the robotparser module to parse robots.txt. This module provides a class RobotFileParser, which can determine whether a crawler has permission to crawl this webpage based on a websites's robots.txt file.
This class is very simple to use, just need to pass in the links of robots.txt in the constructor method. First look at its statement:
Of course, it can also be declared without passing in, the default is empty, and finally, you use the set_url() method to set it.
The following are a few common methods for this class.
set_url () : Used to set the link to the robots.txt file. If a link is passed in while creating the RobotFileParser object, then there is no need to use this method to set it. read(): Read the robots.txt file and analyze it. Note that this method performs a read and analyze operation. If this method is not called, the next judgment will be False, so remember to call this method. This method does not return anything, but the read operation is performed. parse(): Used to parse the robots.txt file. The parameter passed in is the contents of certain lines of robots.txt, it will analyze these contents according to the syntax rules of robots.txt. can_fetch(): This method passes in two parameters, the first is User-agent, and the second is the URL to be fetched. The returned content is whether the search engine can grab the URL, and the returned result is True or False. mtime(): The time of the last crawl and analysis of robots.txt is returned. This is necessary for search crawlers that analyze and crawl for a long time. You may need to check regularly to crawl the latest robots.txt. modified(): It is also very helpful for search crawlers that analyze and crawl for a long time. Set the current time to the time of last crawl and analysis of robots.txt.
Let's take a look at an example below:
from urllib.robotparser import RobotFileParser rp = RobotFileParser() rp.set_url('http://www.jianshu.com/robots.txt') rp.read() print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d')) print(rp.can_fetch('*', "http://www.jianshu.com/search?q=python&page=1&type=collections"))
Take Jianshu as an example, first create the RobotFileParser object, and then set the links of robots.txt through the set_url() method. Of course, if you do not use this method, you can directly use the following methods to set when declare it:
rp = RobotFileParser('http://www.jianshu.com/robots.txt')
Then we use the can_fetch() method to determine whether the web page can be crawled.
The operation results are as follows:
You can also use the parse() method to perform reading and analysis, examples are as follows:
from urllib.robotparser import RobotFileParser from urllib.request import urlopen rp = RobotFileParser() rp.parse(urlopen('http://www.jianshu.com/robots.txt').read().decode('utf-8').split('\n')) print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d')) print(rp.can_fetch('*', "http://www.jianshu.com/search?q=python&page=1&type=collections"))
The running result is the same:
This blog post introduces the basic usage and examples of the robotparser module. With it, we can easily determine which pages can be crawled and which pages cannot be crawled.