如何全面掌握Python爬虫开发中urllib模块的使用技巧及实例？

2026-05-29 04:391阅读0评论SEO教程

内容介绍
文章标签
相关推荐

本文共计1706个文字，预计阅读时间需要7分钟。

爬虫所需的功能，基本在urllib中都能找到。学习这个标准库，可以更深入地理解其原理。在此基础上，进一步学习requests库将更加方便。首先，在Python 2.x中使用import urllib2——在Python 3.x中则会使用import requests。

爬虫所需要的功能，基本上在urllib中都能找到，学习这个标准库，可以更加深入的理解后面更加便利的requests库。

首先

在Pytho2.x中使用import urllib2——-对应的，在Python3.x中会使用import urllib.request，urllib.error

在Pytho2.x中使用import urllib——-对应的，在Python3.x中会使用import urllib.request，urllib.error，urllib.parse

在Pytho2.x中使用import urlparse——-对应的，在Python3.x中会使用import urllib.parse

在Pytho2.x中使用import urlopen——-对应的，在Python3.x中会使用import urllib.request.urlopen

在Pytho2.x中使用import urlencode——-对应的，在Python3.x中会使用import urllib.parse.urlencode

在Pytho2.x中使用import urllib.quote——-对应的，在Python3.x中会使用import urllib.request.quote

在Pytho2.x中使用cookielib.CookieJar——-对应的，在Python3.x中会使用主机名:[端口]/路径

data：附加参数必须是字节流编码格式的内容(bytes类型)，可通过bytes()函数转化，如果要传递这个参数，请求方式就不再是GET方式请求，而是POST方式

timeout: 超时单位为秒

get请求

import urllib r = urllib.urlopen('//www.jb51.net/') datatLine = r.readline() #读取html页面的第一行 data=file.read() #读取全部 f=open("./1.html","wb") # 网页保存在本地 f.write(data) f.close()

urlopen返回对象提供方法：

read() , readline() ,readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样 info()：返回一个passport.jb51.net/user/signin?' post = { 'username': 'xxx', 'password': 'xxxx' } postdata = urllib.parse.urlencode(post).encode('utf-8') req = urllib.request.Request(url, postdata) r = urllib.request.urlopen(req)

我们在进行注册、登录等操作时，会通过POST表单传递信息

这时，我们需要分析页面结构，构建表单数据post，使用urlencode()进行编码处理，返回字符串，再指定'utf-8'的编码格式，这是因为POSTdata只能是bytes或者file object。最后通过Request()对象传递postdata，使用urlopen()发送请求。

2、urllib.request.Request

urlopen()方法可以实现最基本请求的发起，但这几个简单的参数并不足以构建一个完整的请求，如果请求中需要加入headers（请求头）等信息模拟浏览器，我们就可以利用更强大的Request类来构建一个请求。

import urllib.request import urllib.parse url = 'passport.jb51.net/user/signin?' post = { 'username': 'xxx', 'password': 'xxxx' } postdata = urllib.parse.urlencode(post).encode('utf-8') req = urllib.request.Request(url, postdata) r = urllib.request.urlopen(req)

3、urllib.request.BaseHandler

在上面的过程中，我们虽然可以构造Request ，但是一些更高级的操作，比如 Cookies处理，代理该怎样来设置？

接下来就需要更强大的工具 Handler 登场了基本的urlopen()函数不支持验证、cookie、代理或其他HTTP高级功能。要支持这些功能，必须使用build_opener()函数来创建自己的自定义opener对象。

首先介绍下 urllib.request.BaseHandler ，它是所有其他 Handler 的父类，它提供了最基本的 Handler 的方法。

HTTPDefaultErrorHandler 用于处理HTTP响应错误，错误都会抛出 HTTPError 类型的异常。

HTTPRedirectHandler 用于处理重定向

HTTPCookieProcessor 用于处理 Cookie 。

ProxyHandler 用于设置代理，默认代理为空。

HTTPPasswordMgr用于管理密码，它维护了用户名密码的表。

HTTPBasicAuthHandler 用于管理认证，如果一个链接打开时需要认证，那么可以用它来解决认证问题。

代理服务器设置

def use_proxy(proxy_addr,url): import urllib.request #构建代理 proxy=urllib.request.ProxyHandler({'blog.jb51.net') except error.HTTPError as e: if hasattr(e,'code'): print('the server couldn\'t fulfill the request') print('Error code:',e.code) elif hasattr(e,'reason'): print('we failed to reach a server') print('Reason:',e.reason) else: print('no exception was raised') # everything is ok

下面为大家列出几个urllib模块很有代表性的实例

1、引入urllib模块

import urllib.request response = urllib.request.urlopen('jb51.net/') html = response.read()

2、使用 Request

import urllib.request req = urllib.request.Request('jb51.net/') response = urllib.request.urlopen(req) the_page = response.read()

3、发送数据

#! /usr/bin/env python3 import urllib.parse import urllib.request url = 'localhost/login.php' user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' values = { 'act' : 'login', 'login[email]' : 'yzhang@i9i8.com', 'login[password]' : '123456' } data = urllib.parse.urlencode(values) req = urllib.request.Request(url, data) req.add_header('Referer', '//www.jb51.net/') response = urllib.request.urlopen(req) the_page = response.read() print(the_page.decode("utf8"))

4、发送数据和header

#! /usr/bin/env python3 import urllib.parse import urllib.request url = 'localhost/login.php' user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' values = { 'act' : 'login', 'login[email]' : 'yzhang@i9i8.com', 'login[password]' : '123456' } headers = { 'User-Agent' : user_agent } data = urllib.parse.urlencode(values) req = urllib.request.Request(url, data, headers) response = urllib.request.urlopen(req) the_page = response.read() print(the_page.decode("utf8"))

5、www.jb51.net /" password_mgr.add_password(None, top_level_url, 'rekfan', 'xxxxxx') handler = urllib.request.HTTPBasicAuthHandler(password_mgr) # create "opener" (OpenerDirector instance) opener = urllib.request.build_opener(handler) # use the opener to fetch a URL a_url = "www.jb51.net /" x = opener.open(a_url) print(x.read()) # Install the opener. # Now all calls to urllib.request.urlopen use our opener. urllib.request.install_opener(opener) a = urllib.request.urlopen(a_url).read().decode('utf8') print(a)

9、使用代理

#! /usr/bin/env python3 import urllib.request proxy_support = urllib.request.ProxyHandler({'sock5': 'localhost:1080'}) opener = urllib.request.build_opener(proxy_support) urllib.request.install_opener(opener) a = urllib.request.urlopen("//www.jb51.net ").read().decode("utf8") print(a)

10、超时

#! /usr/bin/env python3 import socket import urllib.request # timeout in seconds timeout = 2 socket.setdefaulttimeout(timeout) # this call to urllib.request.urlopen now uses the default timeout # we have set in the socket module req = urllib.request.Request('//www.jb51.net /') a = urllib.request.urlopen(req).read() print(a)

11.自己创建build_opener

header=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36')] #创建opener对象 opener=urllib.request.build_opener() opener.addheaders=header #设置opener对象作为urlopen()使用的全局opener urllib.request.install_opener(opener) response =urllib.request.urlopen('//www.jb51.net/') buff = response.read() html = buff .decode("utf8") response.close() print(the_page)

12.urlib.resquest.urlretrieve远程下载

header=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36')] #创建opener对象 opener=urllib.request.build_opener() opener.addheaders=header #设置opener对象作为urlopen()使用的全局opener urllib.request.install_opener(opener) #下载文件到当前文件夹 urllib.request.urlretrieve('//www.jb51.net/','baidu.html') #清除urlretrieve产生的缓存 urlib.resquest.urlcleanup()

13.post请求

import urllib.request import urllib.parse url='//www.jb51.net/mypost/' #将数据使用urlencode编码处理后，使用encode()设置为utf-8编码 postdata=urllib.parse.urlencode({name:'测试名',pass:"123456"}).encode('utf-8') #urllib.request.quote()接受字符串， #urllib.parse.urlencode()接受字典或者列表中的二元组[(a,b),(c,d)],将URL中的键值对以连接符&划分 req=urllib.request.Request(url,postdata) #urllib.request.Request(url, data=None, header={}, origin_req_host=None, unverifiable=False, #method=None) #url：包含URL的字符串。 #data：www.jb51.net/find-ip-address').read() data = request.urlopen( 'www.ipip.net/' ).read().decode('utf-8') # data=gzip.decompress(data).decode('utf-8','ignore') endtime = time.time() delay = endtime-startime print(data)

有时在urlopen的data数据直接decode(‘utf-8')会失败，必须要使用gzip.decompress(‘utf-8','ignore')才能打开，猜测应该是header的问题，换一个有时会好

本文主要讲解了python爬虫模块urllib详细使用方法与实例全解，更多关于python爬虫模块urllib详细使用方法与实例请查看下面的相关链接

标签：Python 爬虫开发之 urllib

本文共计1706个文字，预计阅读时间需要7分钟。

爬虫所需要的功能，基本上在urllib中都能找到，学习这个标准库，可以更加深入的理解后面更加便利的requests库。

首先

在Pytho2.x中使用import urllib2——-对应的，在Python3.x中会使用import urllib.request，urllib.error

在Pytho2.x中使用import urllib——-对应的，在Python3.x中会使用import urllib.request，urllib.error，urllib.parse

在Pytho2.x中使用import urlparse——-对应的，在Python3.x中会使用import urllib.parse

在Pytho2.x中使用import urlopen——-对应的，在Python3.x中会使用import urllib.request.urlopen

在Pytho2.x中使用import urlencode——-对应的，在Python3.x中会使用import urllib.parse.urlencode

在Pytho2.x中使用import urllib.quote——-对应的，在Python3.x中会使用import urllib.request.quote

在Pytho2.x中使用cookielib.CookieJar——-对应的，在Python3.x中会使用主机名:[端口]/路径

data：附加参数必须是字节流编码格式的内容(bytes类型)，可通过bytes()函数转化，如果要传递这个参数，请求方式就不再是GET方式请求，而是POST方式

timeout: 超时单位为秒

get请求

urlopen返回对象提供方法：

我们在进行注册、登录等操作时，会通过POST表单传递信息

2、urllib.request.Request

3、urllib.request.BaseHandler

在上面的过程中，我们虽然可以构造Request ，但是一些更高级的操作，比如 Cookies处理，代理该怎样来设置？

首先介绍下 urllib.request.BaseHandler ，它是所有其他 Handler 的父类，它提供了最基本的 Handler 的方法。

HTTPDefaultErrorHandler 用于处理HTTP响应错误，错误都会抛出 HTTPError 类型的异常。

HTTPRedirectHandler 用于处理重定向

HTTPCookieProcessor 用于处理 Cookie 。

ProxyHandler 用于设置代理，默认代理为空。

HTTPPasswordMgr用于管理密码，它维护了用户名密码的表。

HTTPBasicAuthHandler 用于管理认证，如果一个链接打开时需要认证，那么可以用它来解决认证问题。

代理服务器设置

下面为大家列出几个urllib模块很有代表性的实例

1、引入urllib模块

import urllib.request response = urllib.request.urlopen('jb51.net/') html = response.read()

2、使用 Request

import urllib.request req = urllib.request.Request('jb51.net/') response = urllib.request.urlopen(req) the_page = response.read()

3、发送数据

4、发送数据和header

#! /usr/bin/env python3 import urllib.parse import urllib.request url = 'localhost/login.php' user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' values = { 'act' : 'login', 'login[email]' : 'yzhang@i9i8.com', 'login[password]' : '123456' } headers = { 'User-Agent' : user_agent } data = urllib.parse.urlencode(values) req = urllib.request.Request(url, data, headers) response = urllib.request.urlopen(req) the_page = response.read() print(the_page.decode("utf8"))

5、www.jb51.net /" password_mgr.add_password(None, top_level_url, 'rekfan', 'xxxxxx') handler = urllib.request.HTTPBasicAuthHandler(password_mgr) # create "opener" (OpenerDirector instance) opener = urllib.request.build_opener(handler) # use the opener to fetch a URL a_url = "www.jb51.net /" x = opener.open(a_url) print(x.read()) # Install the opener. # Now all calls to urllib.request.urlopen use our opener. urllib.request.install_opener(opener) a = urllib.request.urlopen(a_url).read().decode('utf8') print(a)

9、使用代理

10、超时

11.自己创建build_opener

12.urlib.resquest.urlretrieve远程下载

header=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36')] #创建opener对象 opener=urllib.request.build_opener() opener.addheaders=header #设置opener对象作为urlopen()使用的全局opener urllib.request.install_opener(opener) #下载文件到当前文件夹 urllib.request.urlretrieve('//www.jb51.net/','baidu.html') #清除urlretrieve产生的缓存 urlib.resquest.urlcleanup()

13.post请求

有时在urlopen的data数据直接decode(‘utf-8')会失败，必须要使用gzip.decompress(‘utf-8','ignore')才能打开，猜测应该是header的问题，换一个有时会好

本文主要讲解了python爬虫模块urllib详细使用方法与实例全解，更多关于python爬虫模块urllib详细使用方法与实例请查看下面的相关链接

标签：Python 爬虫开发之 urllib

get请求

2、urllib.request.Request

3、urllib.request.BaseHandler

代理服务器设置

1、引入urllib模块

2、使用 Request

3、发送数据

4、发送数据和header

9、使用代理

10、超时

11.自己创建build_opener

12.urlib.resquest.urlretrieve远程下载

13.post请求

相关推荐

get请求

2、urllib.request.Request

3、urllib.request.BaseHandler

代理服务器设置

1、引入urllib模块

2、使用 Request

3、发送数据

4、发送数据和header

9、使用代理

10、超时

11.自己创建build_opener

12.urlib.resquest.urlretrieve远程下载

13.post请求

相关推荐