像使用jQuery一样的使用BeautifulSoup

阅读时间：全文 1925 字，预估用时 10 分钟

创作日期：2017-04-23

文章标签：

上篇文章：上海自考专业及详细信息汇总

下篇文章：PostgreSQL函数（存储过程）--案例分析

BEGIN

BeautifulSoup是什么？

BeautifulSoup是Python的一个模块，用于更好的解析html代码，帮助开发人员更好的操作节点元素，能用jQuery的方式操作Dom节点还会用xpath吗？

BeautifulSoup和解释器的安装

官网地址 🔗

安装BeautifulSoup

sudo apt-get install python-bs4
easy_install beautifulsoup4
pip install beautifulsoup4
选择版本下载源文件并解压 🔗,进入文件目录，通过命令python setup.py install安装

安装解释器lxml

sudo apt-get install python-lxml
easy_install lxml
pip install lxml

安装解释器html5lib

sudo apt-get python-html5lib
easy_install html5lib
pip install html5lib

引入及使用

BeautifulSoup()的第二个参数用来指定解析器.
注意:后面代码将省略引入直接使用

from bs4 import BeautifulSoup
import requests
cont = requests.get('https://www.baidu.com').content
soup = BeautifulSoup(cont, 'lxml')

BeautifulSoup的自定义对象

BeautifulSoup中自定义了四种自定义对象,分别是Tag, NavigableString, BeautifulSoup, Comment

Tag

顾名思义,就是获取的html代码中的标签.

>>> soup = BeautifulSoup('<a id="taga">haha</a>', 'lxml')
>>> soup.a #获取标签内容
<a id="taga">haha</a>
>>> type(soup.a)
<class 'bs4.element.Tag'>
>>> soup.a.name #获取标签名
'a'
>>> soup.a['id'] #获取指定值
'taga'
>>> soup.a.attrs #获取所有属性
{'id': 'taga'}
>>> soup.a['id'] = 'heheda' #修改属性
>>> soup.a
<a id="heheda">haha</a>
>>> soup.a['class'] = 'nihao' #增加属性
>>> soup.a
<a class="nihao" id="heheda">haha</a>
>>> del soup.a['id'] #删除属性
>>> soup.a
<a class="nihao">haha</a>

NavigableString

应该是继承自python的unicode数据类型,然后增加一些特有的方法和支持迭代等,下次有机会看下源码.

>>> soup = BeautifulSoup('<a id="taga">haha</a>', 'lxml')
>>> soup.a.string #通过属性取出文本
u'haha'
>>> type(soup.a.string)
<class 'bs4.element.NavigableString'>
>>> soup.a.get_text() #通过方法取出文本
u'haha'
>>> type(soup.a.get_text())
<type 'unicode'>
>>> soup.a.string = 'heheda' #修改文本内容,不返回内容
>>> soup.a
<a class="nihao">heheda</a>
>>> soup.a.string.replace_with('hahahah') #修改文本内容并返回原内容
u'heheda'
>>> soup.a
<a class="nihao">hahahah</a>

BeautifulSoup

BeautifulSoup表示整个文档对象,相当于js中的document对象.

Comment

获取html中的注释文本,继承自NavigableString对象

操纵节点

我们的标题就是像jQuery一样的使用bs,用过jQuery的都知道,非常好用和方便,其实bs的很多方法都和jQuery相似,就让我们一起像jQuery操纵Dom元素一样的操纵bs节点吧!!!
除了利用百度首页,可能你还需要一段样例html代码段,这是我用的,然后特别需求再特别增加.如下:

soup = BeautifulSoup('''
<a>
	<b id='tagb'>
		i am tab b
	</b>
	hello
</a>
<b></b>
<c></c>
<d></d>
''')

操纵直接子节点及子孙节点

.和find(): 只能获取第一个匹配子孙节点(标签).
find_all(): 获取全部匹配的子孙节点(标签).
.contents: 获取直接子节点的列表.
.children: 获取直接子节点的迭代器.
.descendants: 获取所有子孙节点的迭代器.
strings: 获取所有子孙节点中字符串类型的列表.
stripped_strings: 相比strings排除空格和换行符.

>>> cont = requests.get('http://www.baidu.com').content
>>> soup = BeautifulSoup(cont, 'lxml')
>>> print soup.title
<title>百度一下，你就知道</title>
>>> soup.body.a #返回第一个a节点
<a class="mnav" href="http://news.baidu.com" name="tj_trnews">\u65b0\u95fb</a>
>>> soup.find('a') #返回第一个a节点
<a class="mnav" href="http://news.baidu.com" name="tj_trnews">\u65b0\u95fb</a>
>>> soup.find_all('a') #返回所有a节点
[<a class="mnav" href="http://news.baidu.com" name="tj_trnews">\u65b0\u95fb</a>, ...more... , <a class="cp-feedback" href="http://jianyi.baidu.com/">\u610f\u89c1\u53cd\u9988</a>]
>>> soup.head.contents #返回子节点列表
[<mate...>, <link...>, <title...>]
>>> soup.head.children #返回直接字节点迭代器,soup.head.children返回所有子孙节点迭代器
<listiterator object at 0x7fd32e093950>

操纵直接父节点及祖先节点

.parent: 获取直接父节点.
.parents: 获取父节点及所有祖先节点.

操纵兄弟节点及同级解析对象

.next_sibling: 获取后一个兄长节点.
.next_siblings: 获取所有后面的兄长节点的迭代器.
.previous_sibling: 获取前一个兄弟节点.
.previous_siblings: 获取所有前面的兄弟节点的迭代器.
.next_element: 获取后一个解析对象.
.next_elements: 获取后面所有解析对象的迭代器.
.previous_element: 获取前一个解析对象.
.previous_elements: 获取前面所有解析对象的迭代器.

find()及find_all()的高级用法

.find_all('a'): 查找标签名为a的节点.
.find_all(re.compile('^a')): 查找标签名首字母为a的节点.
.find_all(['a', 'b']): 查找标签名为a及b的节点.
.find_all(True): 返回所有节点.

>>> soup1 = BeautifulSoup('<a><b>haha</b><p>hello</p></a>', 'lxml')
>>> soup1.find_all(True)
[<html><body><a><b>haha</b><p>hello</p></a></body></html>, <body><a><b>haha</b><p>hello</p></a></body>, <a><b>haha</b><p>hello</p></a>, <b>haha</b>, <p>hello</p>]

.find_all(fun): 传入方法.

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)

find_all()源码剖析,find()可参考find_all(),区别为返回首个匹配的结果.


#查找模块位置

>>> import bs4
>>> bs4.__file__
'/usr/lib/python2.7/dist-packages/bs4/__init__.pyc'

通过分析可知find_all()方法在/usr/lib/python2.7/dist-packages/bs4/element.py文件里. 查找得到def find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs): 其中参数(参数的值可用正则匹配:re.compile(”)):

name为节点标签名.
attrs为节点的属性名及值的键值对.
recursive(翻译:递归),会生成一个迭代器,当sttr,text,limit,**kwargs都为none时起作用,用于搜索直接子节点.
text为查找类型为NavigableString值为text的文本节点,要完全匹配.
limit取的节点的个数.

soup.find_all('div', {'class': 'haha'}) #返回所有类名为haha的div节点
soup.find_all(text='ahaha') #查找值为ahaha的文本节点,找到返回该字符串,否则返回空数组.

**kwargs传入的class_优先级大于attrs中的class,但执行的结果同attrs中的class一样.

其它查找方法,与find()和find_all()的实现方式类似,区别于用途及结果

前一个方法返回首个匹配的结果,后一个方法返回匹配的列表.

.find_parent()及.find_parents()
.find_next_sibling()及.find_next_siblings()
.find_previous_siblings()及.find_previous_sibling()
.find_next()及.find_all_next()
.find_previous()及.find_all_previous()

css选择器

查找节点还不够爽?好吧,那就来点更刺激的,css工程师对css选择器可是高度依赖,这包括jQuery的语法$(“div#id”)一样,非常的方便.使用大体如下:

.select('div')
.select('div a')
.select('div > a')
.select('div.haha')
.select('div#haha')
.select('div:nth-of-type(2)')
.select('div + a')
.select('div > a[href$="baidu.com"]')
…太多了根本列举不完哇

修改文档树

soup = BeautifulSoup(‘i am taga’, ‘lxml’),为了避免歧义及流畅阅读,下面每点都应该提前初始化此句,且执行完修改后省略打印语句.

修改标签名: soup.a.name = 'b'

>>> soup.a.name = 'b'
<b id="haha">i am taga</b>

修改及增加节点属性: soup.a['id'] = 'hehe',soup.a['class'] = 'haha'

>>> soup.a['id'] = 'hehe'
>>> soup.a['class'] = 'hehe'
<a id="hehe" class="haha">i am taga</a>

删除属性及判断属性是否存在: del soup.a['id'], .has_attr()

>>> soup.a.has_attr('id')
True
>>> del soup.a['id']
<a>i am taga</a>
>>> soup.a.has_attr('id')
False

修改文本内容: soup.a.string = 'taga is me'

>>> soup.a.string = 'taga is me'
<a id="haha">taga is me</a>

增加文本内容: soup.a.append(' ahaha') 或 soup.a.append(soup.new_string(' ahaha'))

>>> soup.a.append(' ahaha')
<a id="haha">taga is me ahaha</a>

节点内部插入新节点: soup.a.append(soup.new_tag('b', id='hehe'))

>>> tagb = soup.new_tag('b', id='hehe')
>>> tagb.string = 'i am tagb'
>>> soup.a.append(tagb)
<a id="haha">i am taga<b id="hehe">i am tagb</b></a>

我们试试用`soup.a.append('<b id="hehe">i am tagb</b>')`方式插入看看结果.

>>> soup.a.append('<b id="hehe">i am tagb</b>')
<a id="haha">i am taga&lt;b id="hehe"&gt;i am tagb&lt;/b&gt;</a>

指定位置插入: soup.a.insert(0, soup.new_tag('b', id='hehe'))

>>> tagb = soup.new_tag('b', id='hehe')
>>> tagb.string = 'i am tagb'
>>> soup.a.insert(0, tagb)
<a id="haha"><b id="hehe">i am tagb</b>i am taga</a>

节点前面插入新节点(.append()同理): soup.a.insert_before()(soup.new_tag('b', id='hehe'))
节点后面插入新节点(.append()同理): soup.a.insert_after()(soup.new_tag('b', id='hehe'))
清除节点内部元素无返回值: .clear()
清除节点内部元素并返回清除的内容: .extract()
销毁当前节点: .decompose()
替换当前节点: .replace_with()
用新标签包裹当前节点: .wrap()

>>> soup.a.string.wrap(soup.new_tag('b'))
<b>i am taga</b>
<a id="haha"><b>i am taga</b></a>

去除当前节点的tag标签: .unwrap()
格式化输出(传入编码名可指定编码): .prettify()
获取节点的所有文本节点: .get_text() 指定连接方式及清除首尾空白: .get_text(',', strip=True)

特别申明

默认情况下BeautifulSoup解析html字符串时会自动转换成unicode编码,输出时自动转换成utf-8
参考网址 🔗

最后

　　通过对BeautifulSoup的全面学习,已经可以很好的操纵Dom节点了,方法和jQuery大同小异,仔细思考,别有一番韵味,jQuery解决各浏览器操作Dom的差异问题,BeautifulSoup也是解决各个解释器对于Dom解析后操纵的差异,都是语法糖层面的创新,达到解放程序员双手的的目的,方便开发.下一步将利用BeautifulSoup做个小项目,敬请期待.

FINISH

上篇文章：上海自考专业及详细信息汇总

下篇文章：PostgreSQL函数（存储过程）--案例分析