道客巴巴爬虫如何高效抓取内容?

2026-05-28 18:202阅读0评论SEO资讯
  • 内容介绍
  • 文章标签
  • 相关推荐

本文共计218个文字,预计阅读时间需要1分钟。

道客巴巴爬虫如何高效抓取内容?

使用xpathhelp控件,导入requests、re、json、pandas(别名pd)、time模块,并从selenium导入webdriver。


使用xpathhelp控件

import requests, re, json, pandas as pd, time
from selenium import webdriver # selenium2.48.0 支持phantomjs
from lxml import etree
import time
import os, time
# 页 www.doc88.com/list-8308-0-1.html
# 文件 www.doc88.com/p-9139147359378.html
driver = webdriver.PhantomJS(executable_path=r'C:\Users\wang\Desktop\phantomjs-2.1.1-windows (1)\bin\phantomjs.exe')
file_urls_list=[]
for i in range(1,30,1):
time.sleep(3)
url = "www.doc88.com/list-8308-0-"+str(i)+"1.html"
driver.get(url=url)
tree = etree.HTML(driver.page_source)
file_urls = tree.xpath(".//h3[@class='sd-type-title']/a/@href")
file_urls=[ "www.doc88.com/"+str(i) for i in file_urls ]
file_urls_list.extend(file_urls)
print(file_urls)
with open("url.txt","w",encoding="utf-8") as f:
for i in file_urls:
if len(i)==len("www.doc88.com//p-7367816610215.html"):
f.write(i)
f.write("\n")
f.close()

道客巴巴爬虫如何高效抓取内容?



本文共计218个文字,预计阅读时间需要1分钟。

道客巴巴爬虫如何高效抓取内容?

使用xpathhelp控件,导入requests、re、json、pandas(别名pd)、time模块,并从selenium导入webdriver。


使用xpathhelp控件

import requests, re, json, pandas as pd, time
from selenium import webdriver # selenium2.48.0 支持phantomjs
from lxml import etree
import time
import os, time
# 页 www.doc88.com/list-8308-0-1.html
# 文件 www.doc88.com/p-9139147359378.html
driver = webdriver.PhantomJS(executable_path=r'C:\Users\wang\Desktop\phantomjs-2.1.1-windows (1)\bin\phantomjs.exe')
file_urls_list=[]
for i in range(1,30,1):
time.sleep(3)
url = "www.doc88.com/list-8308-0-"+str(i)+"1.html"
driver.get(url=url)
tree = etree.HTML(driver.page_source)
file_urls = tree.xpath(".//h3[@class='sd-type-title']/a/@href")
file_urls=[ "www.doc88.com/"+str(i) for i in file_urls ]
file_urls_list.extend(file_urls)
print(file_urls)
with open("url.txt","w",encoding="utf-8") as f:
for i in file_urls:
if len(i)==len("www.doc88.com//p-7367816610215.html"):
f.write(i)
f.write("\n")
f.close()

道客巴巴爬虫如何高效抓取内容?