如何使用lxml模块在Python爬虫(part5)中高效解析HTML?
- 内容介绍
- 文章标签
- 相关推荐
本文共计754个文字,预计阅读时间需要4分钟。
使用lxml模块解析HTML字符串,可以通过XPath表达式来匹配和提取内容。以下是一个简单的示例,展示如何安装lxml解析库,并使用XPath来提取HTML内容:
安装lxml库:pip install lxml
代码示例:pythonfrom lxml import etree
HTML内容_content= Test Page Hello, lxml!
This is a test paragraph.
Item 1 Item 2解析HTMLtree=etree.HTML(_content)
使用XPath匹配title=tree.xpath('//title/text()')[0]print(f'Title: {title}')
使用XPath匹配段落内容paragraphs=tree.xpath('//p/text()')print('Paragraphs:')for paragraph in paragraphs: print(paragraph)
使用XPath匹配列表项items=tree.xpath('//li/text()')print('List Items:')for item in items: print(item)
学习笔记
lxml模块
- 关于lxml
lxml解析模块可以利用Xpath表达式来匹配HTML字符串的内容。
- 关于lxml解析库的安装
进入cmd,输入以下代码,即可安装:
pip install lxml- 语法
#创建解析对象
parse_html = etree.HTML(html)
#html = requests.get(url, headers = headers).content.decode('utf-8')
#解析对象调用xpath
r_list = parse_html.xpath('xpath表达式')
#只要调用xpath,返回的结果一定为列表
- 举个例子
针对下面HTML文档,我们利用Xpath获取所有li节点对象、所有name节点的class属性值、所有food节点里的文本内容:
<ol><li class="Ra01">
<name class = 'Bunny01'>小黄</name>
<age>8</age>
<food>胡萝卜</food>
</li>
<li class="Ra01">
<name class = 'Bunny02'>大白</name>
<age>9</age>
<food>白菜</food>
</li>
<li class="Ra02">
<name class = 'Bunny03'>奥尼尔</name>
<age>20</age>
<food>提草</food>
</li>
<li class="Ra03">
<name class = 'Bunny03'>王子</name>
<age>30</age>
<food>进口提草</food>
</li>
</ol>
代码:
# -*- coding: utf-8 -*-from lxml import etree
html = \
"""
<ol>
<li class="Ra01">
<name class = 'Bunny01'>小黄</name>
<age>8</age>
<food>胡萝卜</food>
</li>
<li class="Ra01">
<name class = 'Bunny02'>大白</name>
<age>9</age>
<food>白菜</food>
</li>
<li class="Ra02">
<name class = 'Bunny03'>奥尼尔</name>
<age>20</age>
<food>提草</food>
</li>
<li class="Ra03">
<name class = 'Bunny03'>王子</name>
<age>30</age>
<food>进口提草</food>
</li>
</ol>
"""
parse_html = etree.HTML(html)
#获取所有li节点对象
li_list = parse_html.xpath('//ol/li')
print(li_list)
print('-'*20)
#获取所有name节点的class属性值
name_list = parse_html.xpath('//ol/li/name/@class')
print(name_list)
print('-'*20)
#获取所有food节点里的文本内容
food_list = parse_html.xpath('//ol/li/food/text()')
print(food_list)
控制台输出结果:
[<Element li at 0xad2d7371c8>, <Element li at 0xad2d737448>, <Element li at 0xad2d737288>, <Element li at 0xad2d737488>]--------------------
['Bunny01', 'Bunny02', 'Bunny03', 'Bunny03']
--------------------
['胡萝卜', '白菜', '提草', '进口提草']
本文共计754个文字,预计阅读时间需要4分钟。
使用lxml模块解析HTML字符串,可以通过XPath表达式来匹配和提取内容。以下是一个简单的示例,展示如何安装lxml解析库,并使用XPath来提取HTML内容:
安装lxml库:pip install lxml
代码示例:pythonfrom lxml import etree
HTML内容_content= Test Page Hello, lxml!
This is a test paragraph.
Item 1 Item 2解析HTMLtree=etree.HTML(_content)
使用XPath匹配title=tree.xpath('//title/text()')[0]print(f'Title: {title}')
使用XPath匹配段落内容paragraphs=tree.xpath('//p/text()')print('Paragraphs:')for paragraph in paragraphs: print(paragraph)
使用XPath匹配列表项items=tree.xpath('//li/text()')print('List Items:')for item in items: print(item)
学习笔记
lxml模块
- 关于lxml
lxml解析模块可以利用Xpath表达式来匹配HTML字符串的内容。
- 关于lxml解析库的安装
进入cmd,输入以下代码,即可安装:
pip install lxml- 语法
#创建解析对象
parse_html = etree.HTML(html)
#html = requests.get(url, headers = headers).content.decode('utf-8')
#解析对象调用xpath
r_list = parse_html.xpath('xpath表达式')
#只要调用xpath,返回的结果一定为列表
- 举个例子
针对下面HTML文档,我们利用Xpath获取所有li节点对象、所有name节点的class属性值、所有food节点里的文本内容:
<ol><li class="Ra01">
<name class = 'Bunny01'>小黄</name>
<age>8</age>
<food>胡萝卜</food>
</li>
<li class="Ra01">
<name class = 'Bunny02'>大白</name>
<age>9</age>
<food>白菜</food>
</li>
<li class="Ra02">
<name class = 'Bunny03'>奥尼尔</name>
<age>20</age>
<food>提草</food>
</li>
<li class="Ra03">
<name class = 'Bunny03'>王子</name>
<age>30</age>
<food>进口提草</food>
</li>
</ol>
代码:
# -*- coding: utf-8 -*-from lxml import etree
html = \
"""
<ol>
<li class="Ra01">
<name class = 'Bunny01'>小黄</name>
<age>8</age>
<food>胡萝卜</food>
</li>
<li class="Ra01">
<name class = 'Bunny02'>大白</name>
<age>9</age>
<food>白菜</food>
</li>
<li class="Ra02">
<name class = 'Bunny03'>奥尼尔</name>
<age>20</age>
<food>提草</food>
</li>
<li class="Ra03">
<name class = 'Bunny03'>王子</name>
<age>30</age>
<food>进口提草</food>
</li>
</ol>
"""
parse_html = etree.HTML(html)
#获取所有li节点对象
li_list = parse_html.xpath('//ol/li')
print(li_list)
print('-'*20)
#获取所有name节点的class属性值
name_list = parse_html.xpath('//ol/li/name/@class')
print(name_list)
print('-'*20)
#获取所有food节点里的文本内容
food_list = parse_html.xpath('//ol/li/food/text()')
print(food_list)
控制台输出结果:
[<Element li at 0xad2d7371c8>, <Element li at 0xad2d737448>, <Element li at 0xad2d737288>, <Element li at 0xad2d737488>]--------------------
['Bunny01', 'Bunny02', 'Bunny03', 'Bunny03']
--------------------
['胡萝卜', '白菜', '提草', '进口提草']

