Python crawler 8 commonly used crawler skills analysis summary

I have been using python for more than a year. The most common scenarios for python are web rapid development, crawling, and automated operation and maintenance: I have written simple websites, automatic posting scripts, email sending and receiving scripts, and simple verification code recognition scripts.

Crawlers also have a lot of reuse in the development process. Here is a summary, and you can save some things in the future.

1. Basically crawl web pages

get method

import urllib2url "http://"respons = urllib2.urlopen(url)print response.read()

post method

import urllibimport urllib2url = "http://abcde.com"form = {'name':'abc','password':'1234'}form_data = urllib.urlencode(form)request = urllib2.Request(url,form_data) response = urllib2.urlopen(request)print response.read()

2. Use proxy IP

In the process of developing crawlers, IP is often blocked. At this time, proxy IP needs to be used;

There is a ProxyHandler class in the urllib2 package, through which you can set up a proxy to access web pages, as shown in the following code snippet:

import urllib2proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})opener = urllib2.build_opener(proxy)urllib2.install_opener(opener)response = urllib2.urlopen('http://')print response .read()

3. Cookies processing

Cookies are data (usually encrypted) that some websites store on the user's local terminal in order to identify the user's identity and perform session tracking. Python provides the cookielib module to process cookies. The main function of the cookielib module is to provide objects that can store cookies. , So that it can be used in conjunction with the urllib2 module to access Internet resources.

code segment:

import urllib2, cookielibcookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())opener = urllib2.build_opener(cookie_support)urllib2.install_opener(opener)content = urllib2.urlopen('http://XXXX').read()

The key is CookieJar(), which is used to manage HTTP cookie values, store cookies generated by HTTP requests, and add cookie objects to outgoing HTTP requests. The entire cookie is stored in the memory, and the cookie will be lost after the CookieJar instance is garbage collected. All processes do not need to be operated separately.

Manually add cookies

cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg="request.add_header("Cookie", cookie)

4. Disguise as a browser

Some websites dislike the crawler's visit, so they refuse all requests to the crawler. Therefore, using urllib2 to directly access the website often results in HTTP Error 403: Forbidden.

Pay special attention to some headers, the server side will check these headers

User-Agent Some Server or Proxy will check this value to determine whether it is a Request initiated by the browser

Content-Type When using the REST interface, the Server will check this value to determine how to parse the content in the HTTP Body.

At this time, it can be achieved by modifying the header in the http package. The code snippet is as follows:

import urllib2headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}request = urllib2.Request( url ='http://my.oschina.net/jhao104/blog?catalog=3463517', headers = headers)print urllib2.urlopen(request).read()

5. Page analysis

Of course, the most powerful page parsing is regular expressions, which are different for different users on different websites, so there is no need to explain too much, and two better URLs are attached:

Getting started with regular expressions: http://

Regular expression online test: http://tool.oschina.net/regex/

The second is the parsing library. There are two commonly used lxml and BeautifulSoup. Two better websites are introduced for the use of these two:

lxml: http://my.oschina.net/jhao104/blog/639448

BeautifulSoup: http://cuiqingcai.com/1319.html

For these two libraries, my evaluation is that they are both HTML/XML processing libraries. Beautifulsoup is implemented in pure python, which is inefficient, but has practical functions. For example, the source code of a certain HTML node can be obtained through search results; lxmlC language coding, Efficient, support Xpath

6. Processing of verification codes

For some simple verification codes, simple identification can be performed. I have only performed some simple verification code identification. However, some anti-human verification codes, such as 12306, can be manually coded through the coding platform. Of course, this is a fee.

7, gzip compression

Have you ever encountered some webpages, no matter how you transcode it, it is a mess of garbled codes. Haha, that means you don't know that many web services have the ability to send compressed data, which can reduce the large amount of data transmitted on the network line by more than 60%. This is especially suitable for XML web services, because the compression rate of XML data can be very high.

But the general server will not send you compressed data unless you tell the server that you can handle the compressed data.

So you need to modify the code like this:

import urllib2, httplibrequest = urllib2.Request('http://xxxx.com')request.add_header('Accept-encoding','gzip') 1opener = urllib2.build_opener()f = opener.open(request)

This is the key: Create a Request object and add an Accept-encoding header to tell the server that you can accept gzip compressed data

Then is to decompress the data:

import StringIO import gzipcompresseddata = f.read() compressedstream = StringIO.StringIO(compresseddata)gzipper = gzip.GzipFile(fileobj=compressedstream) print gzipper.read()

8. Multi-threaded concurrent crawling

If a single thread is too slow, you need multiple threads. Here is a simple thread pool template. This program simply prints 1-10, but it can be seen that it is concurrent.

Although python's multithreading is very tasteless, it can still improve the efficiency to a certain extent for the frequent type of crawlers.

from threading import Threadfrom Queue import Queuefrom time import sleep# q is the task queue #NUM is the total number of concurrent threads#JOBS is how many tasks q = Queue()NUM = 2JOBS = 10#Specific processing function, responsible for processing a single task def do_somthing_using( arguments): print arguments#This is the working process, responsible for continuously fetching data from the queue and processing def working(): while True: arguments = q.get() do_somthing_using(arguments) sleep(1) q.task_done()#fork NUM Thread waiting queue for i in range(NUM): t = Thread(target=working) t.setDaemon(True) t.start()#Put JOBS into the queue for i in range(JOBS): q.put(i )#Wait for all JOBS to complete q.join()

ZGAR Foggy Box 7000

ZGAR Foggy Box 7000

ZGAR electronic cigarette uses high-tech R&D, food grade disposable pod device and high-quality raw material. All package designs are Original IP. Our designer team is from Hong Kong. We have very high requirements for product quality, flavors taste and packaging design. The E-liquid is imported, materials are food grade, and assembly plant is medical-grade dust-free workshops.

Our products include disposable e-cigarettes, rechargeable e-cigarettes, rechargreable disposable vape pen, and various of flavors of cigarette cartridges. From 600puffs to 5000puffs, ZGAR bar Disposable offer high-tech R&D, E-cigarette improves battery capacity, We offer various of flavors and support customization. And printing designs can be customized. We have our own professional team and competitive quotations for any OEM or ODM works.

We supply OEM rechargeable disposable vape pen,OEM disposable electronic cigarette,ODM disposable vape pen,ODM disposable electronic cigarette,OEM/ODM vape pen e-cigarette,OEM/ODM atomizer device.

ZGAR FB7000 Vape,ZGAR Foggy Box 7000 disposable electronic cigarette, FB7000 vape pen atomizer , FB7000 E-cig,Foggy Box 7000 disposable electronic cigarette,ZGAR Foggy Box 7000 disposable vape,zgar foggy box

ZGAR INTERNATIONAL(HK)CO., LIMITED , https://www.zgarette.com