如何使用Python代理进行高效网站数据抓取？

2026-06-09 08:468阅读0评论SEO资源

内容介绍
文章标签
相关推荐

本文共计326个文字，预计阅读时间需要2分钟。

通过代理IP访问网站，我使用的代理是http协议的。根据需求，可以选择http或https协议的页面。访问量会增加，但效果并非想象中那么理想，背后需找时间进行研究。

代理IP通过www.kuaidaili.com/free/ 获取，我使用的的是www.kuaidaili.com/free/inha/16/',headers=headers)
html=req.text
proxy_list=[]
IP_list=re.findall(r'\d+\.\d+\.\d+\.\d+',html)
port_lits=re.findall(r'<td data-title="PORT">\d+</td>',html)

for i in range(len(IP_list)):
ip=IP_list[i]
port=re.sub(r'<td data-title="PORT">|</td>','',port_lits[i])
proxy='%s:%s' %(ip,port)
proxy_list.append(proxy)
return proxy_list

def Proxy_read(proxy_list,user_agent_list,i):
proxy_ip=proxy_list[i]
print ('当前代理ip：%s'%proxy_ip)
user_agent = random.choice(user_agent_list)
print('当前代理user_agent：%s'%user_agent)
sleep_time = random.randint(1,5)
print('等待时间：%s s' %sleep_time)
time.sleep(sleep_time)
print('开始获取')
headers = {
'User-Agent': user_agent
}

proxies={
'www.baidu.com' #blog 地址

try:
req = requests.get(url, headers=headers, proxies=proxies, timeout=6,verify=False)
html=req.text
print (html)
except Exception as e:
print(e)
print('******打开失败！******')
else:
global count
count += 1
print('OK!总计成功%s次！' % count)

if __name__ == '__main__':

proxy_list = Get_proxy_ip()

for i in range(100):

Proxy_read(proxy_list, user_agent_list, i)

标签：Python 代理爬取网站数据

本文共计326个文字，预计阅读时间需要2分钟。

标签：Python 代理爬取网站数据

相关推荐

相关推荐