如何进行爬虫实践的第一步操作?
- 内容介绍
- 文章标签
- 相关推荐
本文共计441个文字,预计阅读时间需要2分钟。
1. 开发环境:Anaconda3; Python 3.6.4; 爬虫部分:使用Requests处理http、post请求。Beautiful Soup处理HTML页面标签,提取信息。目标网站:百科学术网站,该实战是我百科学术处理系统的一部分。
1、开发环境:
Anaconda3;
python 3.6.4;
爬虫部分
使用Requests处理www.yaoyanbaike.com/category/baby.html'
else:
get_url = 'www.yaoyanbaike.com/category/baby_' + str(number) + '.html' #这个是baby_数字,number就是目录索引数
head = {} #设置头
head['User-Agent'] = 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19'
# 模拟浏览器模式,定制请求头
download_req_get = request.Request(url = get_url, headers = head)# 设置Request
download_response_get = request.urlopen(download_req_get)# 设置urlopen获取页面所有内容
download_html_get = download_response_get.read().decode('UTF-8','ignore') # UTF-8模式读取获取的页面信息标签和内容
soup_texts = BeautifulSoup(download_html_get, 'lxml') # BeautifulSoup读取页面html标签和内容的信息
for link in soup_texts.find_all(["a"]):
print(str(text_file_number)+ " " + str(number) + " "+ link.get('href'))# 打印文件地址用于测试
s = link.get('href')
if s.find("/a/") == -1:
print("错误网址") # 只有包含"/a/"字符的才是有新闻的有效地址
else:
download_url = link.get('href')
head = {}
head['User-Agent'] = 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19'
download_req = request.Request(url = "www.yaoyanbaike.com" + download_url, headers = head)
print("www.yaoyanbaike.com" + download_url)
download_response = request.urlopen(download_req)
download_html = download_response.read().decode('UTF-8','ignore')
soup_texts = BeautifulSoup(download_html, 'lxml')
texts = soup_texts.find_all('article')
soup_text = BeautifulSoup(str(texts), 'lxml')
p = re.compile("<[^>]+>")
text=p.sub("", str(soup_text))# 去除页面标签
f1 = codecs.open('F:\\test\\'+str(text_file_number)+'.txt','w','UTF-8') # 将信息存储在本地
f1.write(text)
f1.close()
text_file_number = text_file_number + 1
number = number + 1
总结:
一个比较简单的爬虫实践,但是还是能有清晰的爬虫思路,值得收藏!
本文共计441个文字,预计阅读时间需要2分钟。
1. 开发环境:Anaconda3; Python 3.6.4; 爬虫部分:使用Requests处理http、post请求。Beautiful Soup处理HTML页面标签,提取信息。目标网站:百科学术网站,该实战是我百科学术处理系统的一部分。
1、开发环境:
Anaconda3;
python 3.6.4;
爬虫部分
使用Requests处理www.yaoyanbaike.com/category/baby.html'
else:
get_url = 'www.yaoyanbaike.com/category/baby_' + str(number) + '.html' #这个是baby_数字,number就是目录索引数
head = {} #设置头
head['User-Agent'] = 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19'
# 模拟浏览器模式,定制请求头
download_req_get = request.Request(url = get_url, headers = head)# 设置Request
download_response_get = request.urlopen(download_req_get)# 设置urlopen获取页面所有内容
download_html_get = download_response_get.read().decode('UTF-8','ignore') # UTF-8模式读取获取的页面信息标签和内容
soup_texts = BeautifulSoup(download_html_get, 'lxml') # BeautifulSoup读取页面html标签和内容的信息
for link in soup_texts.find_all(["a"]):
print(str(text_file_number)+ " " + str(number) + " "+ link.get('href'))# 打印文件地址用于测试
s = link.get('href')
if s.find("/a/") == -1:
print("错误网址") # 只有包含"/a/"字符的才是有新闻的有效地址
else:
download_url = link.get('href')
head = {}
head['User-Agent'] = 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19'
download_req = request.Request(url = "www.yaoyanbaike.com" + download_url, headers = head)
print("www.yaoyanbaike.com" + download_url)
download_response = request.urlopen(download_req)
download_html = download_response.read().decode('UTF-8','ignore')
soup_texts = BeautifulSoup(download_html, 'lxml')
texts = soup_texts.find_all('article')
soup_text = BeautifulSoup(str(texts), 'lxml')
p = re.compile("<[^>]+>")
text=p.sub("", str(soup_text))# 去除页面标签
f1 = codecs.open('F:\\test\\'+str(text_file_number)+'.txt','w','UTF-8') # 将信息存储在本地
f1.write(text)
f1.close()
text_file_number = text_file_number + 1
number = number + 1
总结:
一个比较简单的爬虫实践,但是还是能有清晰的爬虫思路,值得收藏!

