如何进行爬虫实践的第一步操作?

2026-05-28 18:502阅读0评论SEO资讯
  • 内容介绍
  • 文章标签
  • 相关推荐

本文共计441个文字,预计阅读时间需要2分钟。

如何进行爬虫实践的第一步操作?

1. 开发环境:Anaconda3; Python 3.6.4; 爬虫部分:使用Requests处理http、post请求。Beautiful Soup处理HTML页面标签,提取信息。目标网站:百科学术网站,该实战是我百科学术处理系统的一部分。

如何进行爬虫实践的第一步操作?

1、开发环境:

Anaconda3;

python 3.6.4;

爬虫部分

使用Requests处理www.yaoyanbaike.com/category/baby.html'
else:
get_url = 'www.yaoyanbaike.com/category/baby_' + str(number) + '.html' #这个是baby_数字,number就是目录索引数
head = {} #设置头
head['User-Agent'] = 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19'
# 模拟浏览器模式,定制请求头
download_req_get = request.Request(url = get_url, headers = head)# 设置Request

download_response_get = request.urlopen(download_req_get)# 设置urlopen获取页面所有内容
download_html_get = download_response_get.read().decode('UTF-8','ignore') # UTF-8模式读取获取的页面信息标签和内容

soup_texts = BeautifulSoup(download_html_get, 'lxml') # BeautifulSoup读取页面html标签和内容的信息

for link in soup_texts.find_all(["a"]):
print(str(text_file_number)+ " " + str(number) + " "+ link.get('href'))# 打印文件地址用于测试

s = link.get('href')
if s.find("/a/") == -1:
print("错误网址") # 只有包含"/a/"字符的才是有新闻的有效地址
else:
download_url = link.get('href')
head = {}
head['User-Agent'] = 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19'
download_req = request.Request(url = "www.yaoyanbaike.com" + download_url, headers = head)
print("www.yaoyanbaike.com" + download_url)
download_response = request.urlopen(download_req)
download_html = download_response.read().decode('UTF-8','ignore')
soup_texts = BeautifulSoup(download_html, 'lxml')
texts = soup_texts.find_all('article')
soup_text = BeautifulSoup(str(texts), 'lxml')
p = re.compile("<[^>]+>")
text=p.sub("", str(soup_text))# 去除页面标签

f1 = codecs.open('F:\\test\\'+str(text_file_number)+'.txt','w','UTF-8') # 将信息存储在本地

f1.write(text)
f1.close()
text_file_number = text_file_number + 1
number = number + 1

总结:

一个比较简单的爬虫实践,但是还是能有清晰的爬虫思路,值得收藏!

本文共计441个文字,预计阅读时间需要2分钟。

如何进行爬虫实践的第一步操作?

1. 开发环境:Anaconda3; Python 3.6.4; 爬虫部分:使用Requests处理http、post请求。Beautiful Soup处理HTML页面标签,提取信息。目标网站:百科学术网站,该实战是我百科学术处理系统的一部分。

如何进行爬虫实践的第一步操作?

1、开发环境:

Anaconda3;

python 3.6.4;

爬虫部分

使用Requests处理www.yaoyanbaike.com/category/baby.html'
else:
get_url = 'www.yaoyanbaike.com/category/baby_' + str(number) + '.html' #这个是baby_数字,number就是目录索引数
head = {} #设置头
head['User-Agent'] = 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19'
# 模拟浏览器模式,定制请求头
download_req_get = request.Request(url = get_url, headers = head)# 设置Request

download_response_get = request.urlopen(download_req_get)# 设置urlopen获取页面所有内容
download_html_get = download_response_get.read().decode('UTF-8','ignore') # UTF-8模式读取获取的页面信息标签和内容

soup_texts = BeautifulSoup(download_html_get, 'lxml') # BeautifulSoup读取页面html标签和内容的信息

for link in soup_texts.find_all(["a"]):
print(str(text_file_number)+ " " + str(number) + " "+ link.get('href'))# 打印文件地址用于测试

s = link.get('href')
if s.find("/a/") == -1:
print("错误网址") # 只有包含"/a/"字符的才是有新闻的有效地址
else:
download_url = link.get('href')
head = {}
head['User-Agent'] = 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19'
download_req = request.Request(url = "www.yaoyanbaike.com" + download_url, headers = head)
print("www.yaoyanbaike.com" + download_url)
download_response = request.urlopen(download_req)
download_html = download_response.read().decode('UTF-8','ignore')
soup_texts = BeautifulSoup(download_html, 'lxml')
texts = soup_texts.find_all('article')
soup_text = BeautifulSoup(str(texts), 'lxml')
p = re.compile("<[^>]+>")
text=p.sub("", str(soup_text))# 去除页面标签

f1 = codecs.open('F:\\test\\'+str(text_file_number)+'.txt','w','UTF-8') # 将信息存储在本地

f1.write(text)
f1.close()
text_file_number = text_file_number + 1
number = number + 1

总结:

一个比较简单的爬虫实践,但是还是能有清晰的爬虫思路,值得收藏!