Scrapy FormRequest的formdata参数具体用法是怎样的?

2026-05-26 20:431阅读0评论SEO资讯
  • 内容介绍
  • 文章标签
  • 相关推荐

本文共计1108个文字,预计阅读时间需要5分钟。

Scrapy FormRequest的formdata参数具体用法是怎样的?

1. 背景:在网页爬取过程中,有时会使用scrapy.FormRequest向目标网站提交数据(如表单提交)。 根据scrapy官方文档的标准写法,应使用:`scrapy.FormRequest(url, formdata={})`。

1. 背景

在网页爬取的时候,有时候会使用scrapy.FormRequest向目标网站提交数据(表单提交)。参照scrapy官方文档的标准写法是:

# header信息 unicornHeader = { 'Host': 'www.example.com', 'Referer': 'www.example.com/', } # 表单需要提交的数据 myFormData = {'name': 'John Doe', 'age': '27'} # 自定义信息,向下层响应(response)传递下去 customerData = {'key1': 'value1', 'key2': 'value2'} yield scrapy.FormRequest(url = "www.example.com/post/action", headers = unicornHeader, method = 'POST', # GET or POST formdata = myFormData, # 表单提交的数据 meta = customerData, # 自定义,向response传递数据 callback = self.after_post, errback = self.error_handle, # 如果需要多次提交表单,且url一样,那么就必须加此参数dont_filter,防止被当成重复网页过滤掉了 dont_filter = True )

但是,当表单提交数据myFormData 是形如字典内嵌字典的形式,又该如何写?

2. 案例 — 参数为字典

在做亚马逊网站爬取时,当进入商家店铺,爬取店铺内商品列表时,发现采取的方式是ajax请求,返回的是json数据。

请求信息如下:

响应信息如下:

Scrapy FormRequest的formdata参数具体用法是怎样的?

如上图所示,From Data中的数据包含一个字典:

marketplaceID:ATVPDKIKX0DER seller:A2FE6D62A4WM6Q productSearchRequestData:{"marketplace":"ATVPDKIKX0DER","seller":"A2FE6D62A4WM6Q","url":"/sp/ajax/products","pageSize":12,"searchKeyword":"","extraRestrictions":{},"pageNumber":"1"} # formDate 必须构造如下: myFormData = { 'marketplaceID' : 'ATVPDKIKX0DER', 'seller' : 'A2FE6D62A4WM6Q', # 注意下面这一行,内部字典是作为一个字符串的形式 'productSearchRequestData' :'{"marketplace":"ATVPDKIKX0DER","seller":"A2FE6D62A4WM6Q","url":"/sp/ajax/products","pageSize":12,"searchKeyword":"","extraRestrictions":{},"pageNumber":"1"}' }

在amazon中实际使用的构造方法如下:

def sendRequestForProducts(response): ajaxParam = response.meta for pageIdx in range(1, ajaxParam['totalPageNum']+1): ajaxParam['isFirstAjax'] = False ajaxParam['pageNumber'] = pageIdx unicornHeader = { 'Host': 'www.amazon.com', 'Origin': 'www.amazon.com', 'Referer': ajaxParam['referUrl'], } ''' marketplaceID:ATVPDKIKX0DER seller:AYZQAQRQKEXRP productSearchRequestData:{"marketplace":"ATVPDKIKX0DER","seller":"AYZQAQRQKEXRP","url":"/sp/ajax/products","pageSize":12,"searchKeyword":"","extraRestrictions":{},"pageNumber":1} ''' productSearchRequestData = '{"marketplace": "ATVPDKIKX0DER", "seller": "' + f'{ajaxParam["sellerID"]}' + '","url": "/sp/ajax/products", "pageSize": 12, "searchKeyword": "","extraRestrictions": {}, "pageNumber": "' + str(pageIdx) + '"}' formdataProduct = { 'marketplaceID': ajaxParam['marketplaceID'], 'seller': ajaxParam['sellerID'], 'productSearchRequestData': productSearchRequestData } productAjaxMeta = ajaxParam # 请求店铺商品列表 yield scrapy.FormRequest( url = 'www.amazon.com/sp/ajax/products', headers = unicornHeader, formdata = formdataProduct, func = 'POST', meta = productAjaxMeta, callback = self.solderProductAjax, errback = self.error, # 处理www.example.com/sp/ajax', headers = unicornHeader, formdata = { 'Field': '{"pageIdx":99, "size":"10"}', 'func': 'nextPage', }, func = 'POST', callback = self.handleFunc, ) # 请求数据为:Field=%7B%22pageIdx%22%3A99%2C%22size%22%3A%2210%22%7D&func=nextPage

第二种,按照如下方式发出请求,结果如下(错误,无法获取到正确的数据):

yield scrapy.FormRequest( url = 'www.example.com/sp/ajax', headers = unicornHeader, formdata = { 'Field': {"pageIdx":99, "size":"10"}, 'func': 'nextPage', }, func = 'POST', callback = self.handleFunc, ) # 经过错误的编码之后,发送的请求为:Field=size&Field=pageIdx&func=nextPage

我们跟踪看一下scrapy中的源码:

# E:/Miniconda/Lib/site-packages/scrapy/www.amztracker.com/unicorn.php', headers = unicornHeader, # formdata 的参数必须是字符串 formdata={'rank': 10, 'category': productDetailInfo['topCategory']}, method = 'GET', meta = {'productDetailInfo': productDetailInfo}, callback = self.amztrackerSale, errback = self.error, # 本项目中这里触发errback占绝大多数 dont_filter = True, # 按理来说是不需要加此参数的 ) # 提示如下ERROR: Traceback (most recent call last): File "E:\Miniconda\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "E:\Miniconda\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "E:\Miniconda\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "E:\Miniconda\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "E:\Miniconda\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "E:\PyCharmCode\categorySelectorAmazon1\categorySelectorAmazon1\spiders\categorySelectorAmazon1Clawer.py", line 224, in parseProductDetail dont_filter = True, File "E:\Miniconda\lib\site-packages\scrapy\www.1688.com/', } # python3 所有的字符串都是unicode # 动漫周边 tobyte为:%B6%AF%C2%FE%D6%DC%B1%DF formatStr = "动漫周边".encode('gbk') print(f"formatStr = {formatStr}") yield FormRequest( url = 's.1688.com/selloffer/offer_search.htm', headers = unicornHeaders, formdata = {'keywords': formatStr, 'n': 'y', 'spm': 'a260k.635.1998096057.d1'}, method = 'GET', meta={}, callback = self.parseCategoryPage, errback = self.error, # 本项目中这里触发errback占绝大多数 dont_filter = True, # 按理来说是不需要加此参数的 ) # 日志如下: formatStr = b'\xb6\xaf\xc2\xfe\xd6\xdc\xb1\xdf' 2017-11-16 15:11:02 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET sec.1688.com/query.htm?smApp=searchweb2&smPolicy=searchweb2-selloffer-anti_Spider-seo-html-checklogin&smCharset=GBK&smTag=MTE1LjIxNi4xNjAuNDYsLDU5OWQ1NWIyZTk0NDQ1Y2E5ZDAzODRlOGM1MDI2OTZj&smReturn=s.1688.com/selloffer/offer_search.htm?keywords=%B6%AF%C2%FE%D6%DC%B1%DF&n=y&spm=a260k.635.1998096057.d1> # s.1688.com/selloffer/offer_search.htm?keywords=%B6%AF%C2%FE%D6%DC%B1%DF&n=y&spm=a260k.635.1998096057.d1

以上这篇scrapy爬虫:scrapy.FormRequest中formdata参数详解就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持易盾网络。

本文共计1108个文字,预计阅读时间需要5分钟。

Scrapy FormRequest的formdata参数具体用法是怎样的?

1. 背景:在网页爬取过程中,有时会使用scrapy.FormRequest向目标网站提交数据(如表单提交)。 根据scrapy官方文档的标准写法,应使用:`scrapy.FormRequest(url, formdata={})`。

1. 背景

在网页爬取的时候,有时候会使用scrapy.FormRequest向目标网站提交数据(表单提交)。参照scrapy官方文档的标准写法是:

# header信息 unicornHeader = { 'Host': 'www.example.com', 'Referer': 'www.example.com/', } # 表单需要提交的数据 myFormData = {'name': 'John Doe', 'age': '27'} # 自定义信息,向下层响应(response)传递下去 customerData = {'key1': 'value1', 'key2': 'value2'} yield scrapy.FormRequest(url = "www.example.com/post/action", headers = unicornHeader, method = 'POST', # GET or POST formdata = myFormData, # 表单提交的数据 meta = customerData, # 自定义,向response传递数据 callback = self.after_post, errback = self.error_handle, # 如果需要多次提交表单,且url一样,那么就必须加此参数dont_filter,防止被当成重复网页过滤掉了 dont_filter = True )

但是,当表单提交数据myFormData 是形如字典内嵌字典的形式,又该如何写?

2. 案例 — 参数为字典

在做亚马逊网站爬取时,当进入商家店铺,爬取店铺内商品列表时,发现采取的方式是ajax请求,返回的是json数据。

请求信息如下:

响应信息如下:

Scrapy FormRequest的formdata参数具体用法是怎样的?

如上图所示,From Data中的数据包含一个字典:

marketplaceID:ATVPDKIKX0DER seller:A2FE6D62A4WM6Q productSearchRequestData:{"marketplace":"ATVPDKIKX0DER","seller":"A2FE6D62A4WM6Q","url":"/sp/ajax/products","pageSize":12,"searchKeyword":"","extraRestrictions":{},"pageNumber":"1"} # formDate 必须构造如下: myFormData = { 'marketplaceID' : 'ATVPDKIKX0DER', 'seller' : 'A2FE6D62A4WM6Q', # 注意下面这一行,内部字典是作为一个字符串的形式 'productSearchRequestData' :'{"marketplace":"ATVPDKIKX0DER","seller":"A2FE6D62A4WM6Q","url":"/sp/ajax/products","pageSize":12,"searchKeyword":"","extraRestrictions":{},"pageNumber":"1"}' }

在amazon中实际使用的构造方法如下:

def sendRequestForProducts(response): ajaxParam = response.meta for pageIdx in range(1, ajaxParam['totalPageNum']+1): ajaxParam['isFirstAjax'] = False ajaxParam['pageNumber'] = pageIdx unicornHeader = { 'Host': 'www.amazon.com', 'Origin': 'www.amazon.com', 'Referer': ajaxParam['referUrl'], } ''' marketplaceID:ATVPDKIKX0DER seller:AYZQAQRQKEXRP productSearchRequestData:{"marketplace":"ATVPDKIKX0DER","seller":"AYZQAQRQKEXRP","url":"/sp/ajax/products","pageSize":12,"searchKeyword":"","extraRestrictions":{},"pageNumber":1} ''' productSearchRequestData = '{"marketplace": "ATVPDKIKX0DER", "seller": "' + f'{ajaxParam["sellerID"]}' + '","url": "/sp/ajax/products", "pageSize": 12, "searchKeyword": "","extraRestrictions": {}, "pageNumber": "' + str(pageIdx) + '"}' formdataProduct = { 'marketplaceID': ajaxParam['marketplaceID'], 'seller': ajaxParam['sellerID'], 'productSearchRequestData': productSearchRequestData } productAjaxMeta = ajaxParam # 请求店铺商品列表 yield scrapy.FormRequest( url = 'www.amazon.com/sp/ajax/products', headers = unicornHeader, formdata = formdataProduct, func = 'POST', meta = productAjaxMeta, callback = self.solderProductAjax, errback = self.error, # 处理www.example.com/sp/ajax', headers = unicornHeader, formdata = { 'Field': '{"pageIdx":99, "size":"10"}', 'func': 'nextPage', }, func = 'POST', callback = self.handleFunc, ) # 请求数据为:Field=%7B%22pageIdx%22%3A99%2C%22size%22%3A%2210%22%7D&func=nextPage

第二种,按照如下方式发出请求,结果如下(错误,无法获取到正确的数据):

yield scrapy.FormRequest( url = 'www.example.com/sp/ajax', headers = unicornHeader, formdata = { 'Field': {"pageIdx":99, "size":"10"}, 'func': 'nextPage', }, func = 'POST', callback = self.handleFunc, ) # 经过错误的编码之后,发送的请求为:Field=size&Field=pageIdx&func=nextPage

我们跟踪看一下scrapy中的源码:

# E:/Miniconda/Lib/site-packages/scrapy/www.amztracker.com/unicorn.php', headers = unicornHeader, # formdata 的参数必须是字符串 formdata={'rank': 10, 'category': productDetailInfo['topCategory']}, method = 'GET', meta = {'productDetailInfo': productDetailInfo}, callback = self.amztrackerSale, errback = self.error, # 本项目中这里触发errback占绝大多数 dont_filter = True, # 按理来说是不需要加此参数的 ) # 提示如下ERROR: Traceback (most recent call last): File "E:\Miniconda\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "E:\Miniconda\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "E:\Miniconda\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "E:\Miniconda\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "E:\Miniconda\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "E:\PyCharmCode\categorySelectorAmazon1\categorySelectorAmazon1\spiders\categorySelectorAmazon1Clawer.py", line 224, in parseProductDetail dont_filter = True, File "E:\Miniconda\lib\site-packages\scrapy\www.1688.com/', } # python3 所有的字符串都是unicode # 动漫周边 tobyte为:%B6%AF%C2%FE%D6%DC%B1%DF formatStr = "动漫周边".encode('gbk') print(f"formatStr = {formatStr}") yield FormRequest( url = 's.1688.com/selloffer/offer_search.htm', headers = unicornHeaders, formdata = {'keywords': formatStr, 'n': 'y', 'spm': 'a260k.635.1998096057.d1'}, method = 'GET', meta={}, callback = self.parseCategoryPage, errback = self.error, # 本项目中这里触发errback占绝大多数 dont_filter = True, # 按理来说是不需要加此参数的 ) # 日志如下: formatStr = b'\xb6\xaf\xc2\xfe\xd6\xdc\xb1\xdf' 2017-11-16 15:11:02 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET sec.1688.com/query.htm?smApp=searchweb2&smPolicy=searchweb2-selloffer-anti_Spider-seo-html-checklogin&smCharset=GBK&smTag=MTE1LjIxNi4xNjAuNDYsLDU5OWQ1NWIyZTk0NDQ1Y2E5ZDAzODRlOGM1MDI2OTZj&smReturn=s.1688.com/selloffer/offer_search.htm?keywords=%B6%AF%C2%FE%D6%DC%B1%DF&n=y&spm=a260k.635.1998096057.d1> # s.1688.com/selloffer/offer_search.htm?keywords=%B6%AF%C2%FE%D6%DC%B1%DF&n=y&spm=a260k.635.1998096057.d1

以上这篇scrapy爬虫:scrapy.FormRequest中formdata参数详解就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持易盾网络。