Scrapy框架如何实现高效的数据抓取?
- 内容介绍
- 文章标签
- 相关推荐
本文共计665个文字,预计阅读时间需要3分钟。
Scrapy 框架:- spiders:编写请求发送与解析- 引擎:调度器 scheduler、下载器 Downloader- 处理数据:响应文件、spiders、item、pipeline
创建项目(scrapy startproject xxx):创建一个爬虫项目,指定目标网址和目标内容
Scrapy框架:
spiders 发送请求 ==>引擎==> 调度器scheduler==>Downloader下载器,响应文件==>spiders==>处理数据,item,pipeline.
新建项目(scrapy startproject xxx):新建一个爬虫项目
明确目标(编写items.py):明确抓取的目标
制作爬虫(spiders/xxspider.py):制作爬虫开始爬取数据
存储内容(pipelines.py):设计管道存储爬取内容
运行爬虫项目:
命令行运行:scrapy crawl myspider
pycharm运行:from scrapy import cmdline
cmdline.execute(‘scrapy crawl myspider‘.split(" "))
管道:
先在settings.py里面:
ITEM_PIPELINES = {
# ‘mySpider.pipelines.mySpiderPipelines‘:100,
‘mySpider.pipelines.MyspiderPipeline‘: 300,
}
然后在pipelines.py里面:
import json
class MyspiderPipeline(object):
def __init__(self):
self.filename = open(‘teacher.json‘,‘w‘,encoding=‘utf8‘)
# 处理item数据
def process_item(self, item, spider):
jsontxt = json.dumps(dict(item),ensure_ascii=False)+ "\n"
self.filename.write(jsontxt)
# return item
# 结束调用
def close_spider(self,spider):
self.filename.close()
回调函数到下一页:myspider.py:写在for循环外
# 将请求重新发送给调度器入队列,交给下载器下载
yield scrapy.Request(self.url+str(self.offest),callback = self.parse)
设置报头:
DEFAULT_REQUEST_HEADERS = {
‘User-Agent‘:‘Mozilla/5.0(compatible; MSIE 9.0;Windows NT 6.1;Trident/5.0;‘,
‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
# ‘Accept-Language‘: ‘en‘,
}
设置延迟:
#DOWNLOAD_DELAY = 3
设置管道:
ITEM_PIPELINES = {
# ‘mySpider.pipelines.mySpiderPipelines‘:100,
‘mySpider.pipelines.MyspiderPipeline‘: 300,
}
管道处理文字:
import json
class MyspiderPipeline(object):
def __init__(self):
self.filename = open(‘teacher.json‘,‘w‘,encoding=‘utf8‘)
# 处理item数据
def process_item(self, item, spider):
jsontxt = json.dumps(dict(item),ensure_ascii=False)+ "\n"
self.filename.write(jsontxt)
# return item
# 结束调用
def close_spider(self,spider):
self.filename.close()
管道处理图片:
import scrapyfrom scrapy.utils.project import get_project_settingsfrom scrapy.pipelines.images import ImagesPipelineimport osclass ImagesPipeline(ImagesPipeline): #def process_item(self, item, spider): # return item # 获取settings文件里设置的变量值 IMAGES_STORE = get_project_settings().get("IMAGES_STORE") def get_media_requests(self, item, info): image_url = item["imagelink"] yield scrapy.Request(image_url) def item_completed(self, result, item, info): image_path = [x["path"] for ok, x in result if ok] os.rename(self.IMAGES_STORE + "/" + image_path[0], self.IMAGES_STORE + "/" + item["nickname"] + ".jpg") item["imagePath"] = self.IMAGES_STORE + "/" + item["nickname"] return item
本文共计665个文字,预计阅读时间需要3分钟。
Scrapy 框架:- spiders:编写请求发送与解析- 引擎:调度器 scheduler、下载器 Downloader- 处理数据:响应文件、spiders、item、pipeline
创建项目(scrapy startproject xxx):创建一个爬虫项目,指定目标网址和目标内容
Scrapy框架:
spiders 发送请求 ==>引擎==> 调度器scheduler==>Downloader下载器,响应文件==>spiders==>处理数据,item,pipeline.
新建项目(scrapy startproject xxx):新建一个爬虫项目
明确目标(编写items.py):明确抓取的目标
制作爬虫(spiders/xxspider.py):制作爬虫开始爬取数据
存储内容(pipelines.py):设计管道存储爬取内容
运行爬虫项目:
命令行运行:scrapy crawl myspider
pycharm运行:from scrapy import cmdline
cmdline.execute(‘scrapy crawl myspider‘.split(" "))
管道:
先在settings.py里面:
ITEM_PIPELINES = {
# ‘mySpider.pipelines.mySpiderPipelines‘:100,
‘mySpider.pipelines.MyspiderPipeline‘: 300,
}
然后在pipelines.py里面:
import json
class MyspiderPipeline(object):
def __init__(self):
self.filename = open(‘teacher.json‘,‘w‘,encoding=‘utf8‘)
# 处理item数据
def process_item(self, item, spider):
jsontxt = json.dumps(dict(item),ensure_ascii=False)+ "\n"
self.filename.write(jsontxt)
# return item
# 结束调用
def close_spider(self,spider):
self.filename.close()
回调函数到下一页:myspider.py:写在for循环外
# 将请求重新发送给调度器入队列,交给下载器下载
yield scrapy.Request(self.url+str(self.offest),callback = self.parse)
设置报头:
DEFAULT_REQUEST_HEADERS = {
‘User-Agent‘:‘Mozilla/5.0(compatible; MSIE 9.0;Windows NT 6.1;Trident/5.0;‘,
‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
# ‘Accept-Language‘: ‘en‘,
}
设置延迟:
#DOWNLOAD_DELAY = 3
设置管道:
ITEM_PIPELINES = {
# ‘mySpider.pipelines.mySpiderPipelines‘:100,
‘mySpider.pipelines.MyspiderPipeline‘: 300,
}
管道处理文字:
import json
class MyspiderPipeline(object):
def __init__(self):
self.filename = open(‘teacher.json‘,‘w‘,encoding=‘utf8‘)
# 处理item数据
def process_item(self, item, spider):
jsontxt = json.dumps(dict(item),ensure_ascii=False)+ "\n"
self.filename.write(jsontxt)
# return item
# 结束调用
def close_spider(self,spider):
self.filename.close()
管道处理图片:
import scrapyfrom scrapy.utils.project import get_project_settingsfrom scrapy.pipelines.images import ImagesPipelineimport osclass ImagesPipeline(ImagesPipeline): #def process_item(self, item, spider): # return item # 获取settings文件里设置的变量值 IMAGES_STORE = get_project_settings().get("IMAGES_STORE") def get_media_requests(self, item, info): image_url = item["imagelink"] yield scrapy.Request(image_url) def item_completed(self, result, item, info): image_path = [x["path"] for ok, x in result if ok] os.rename(self.IMAGES_STORE + "/" + image_path[0], self.IMAGES_STORE + "/" + item["nickname"] + ".jpg") item["imagePath"] = self.IMAGES_STORE + "/" + item["nickname"] return item

