如何使用Python多线程抓取网页并存入Excel文件？

2026-05-21 20:172阅读0评论SEO基础

内容介绍
文章标签
相关推荐

本文共计408个文字，预计阅读时间需要2分钟。

伪原创可以改写为：模仿创新。以下是改写后的开头内容：

模仿创新，是一种在原有基础上进行改进和发展的创新方式。它不涉及原创性的突破，但通过巧妙地整合和优化现有资源，能够创造出新的价值。

#!/usr/bin/env python
# coding: utf-8

# In[1]:

import pandas as pd
import threading
import requests
from bs4 import BeautifulSoup
from time import sleep
from datetime import datetime

# In[2]:

df = pd.read_excel("网站对应名字.xlsx")

# In[16]:

sites = df.URL
data_count = len(sites)
thread_count = 16
threads = []
n_loops = range(thread_count)

# In[17]:

names = [None]*data_count

# In[18]:

def get_url_title(site):
try:
html = requests.get(site)
soup = BeautifulSoup(html.content)
return soup.find("title").text
except BaseException:
return "网址有误"

# In[19]:

# 从改点开始
def write_title(start):
# 引用全局变量
global data_count,thread_count,names
for i in range(start,data_count,thread_count):
names[i] = get_url_title(sites[i])
print(i,names[i])

# In[20]:

def main():
global threads,n_loops
for i in n_loops:
t = threading.Thread(target=write_title,args=(i,))
threads.append(t)
# 启动多个线程
for i in n_loops:
threads[i].start()
# wait for all threads to finish
for i in n_loops:
threads[i].join()

# In[21]:

if __name__ == '__main__':
main()

# In[22]:

names

# In[10]:

names

# In[11]:

len(names)

# In[12]:

df.info

# In[23]:

import multiprocessing
print(multiprocessing.cpu_count())

# In[ ]:

标签：Python 多线程爬取网页

本文共计408个文字，预计阅读时间需要2分钟。

伪原创可以改写为：模仿创新。以下是改写后的开头内容：

模仿创新，是一种在原有基础上进行改进和发展的创新方式。它不涉及原创性的突破，但通过巧妙地整合和优化现有资源，能够创造出新的价值。

标签：Python 多线程爬取网页

相关推荐

相关推荐