BeautifulSoup中的find_all()及select()查找方法

铁松溜达py

3679人浏览 · 2024-02-24 19:08:37

铁松溜达py · 2024-02-24 19:08:37 发布

#Beautiful Soup库中的find_all()方法是用于查找HTML文档中符合指定条件的所有元素。它返回一个列表，其中包含了找到的所有元素。

# 对于Beautiful Soup库中的find_all()方法，其参数类型可以分为以下几种：
# 标签名：字符串类型，用于选择指定标签名的元素。例如：'p'、'a'等。
# 属性名：字符串类型，用于选择具有指定属性的元素。例如：'class'、'id'等。
# 属性值：可以是字符串或正则表达式，用于选择具有指定属性值的元素。可以使用字典形式传递，例如{'class': 'intro'}表示选择具有class属性且属性值为'intro'的元素；也可以使用正则表达式，例如{'href': '^https://'}表示选择href属性以https://开头的元素。

# find_all()方法的基本语法如下：
# find_all(name, attrs, recursive, string, limit, **kwargs)
# 参数说明：
# name：可选参数，用于指定要查找的标签名。
# attrs：可选参数，用于指定要查找的元素的属性或属性值。可以是一个字典或关键字参数形式。例如，attrs={'class': 'intro'} 或 class_='intro'。
# recursive：可选参数，表示是否递归查找，默认为True，即在所有子孙节点中查找。soup.find_all('p', recursive=False)
# string：可选参数，用于指定要查找的元素的文本内容。soup.find_all('p', string='This is a regular paragraph.')
# limit：可选参数，用于限制返回的结果数量。 soup.find_all('p', class_='intro', limit=2)
# **kwargs：关键字参数形式的属性查找，例如 id='my-id'
# find_all()方法还支持传递任意的关键字参数来指定其他属性的查询条件。这些额外的关键字参数将被视为要匹配的元素的属性名和属性值。
# 例如，如果要查找所有具有id属性且id属性值为"content"的元素，可以使用以下方式：
# elements = soup.find_all(id='content')
# 同样，如果要查找所有具有data-custom属性且属性值为"123"的元素，可以使用以下方式：
# elements = soup.find_all(attrs={'data-custom': '123'})
# 提示: 参数名class后面添加下划线_是因为class是Python中的保留关键字。

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
	<title>Find All Method Example</title>
</head>
<body>
	<p class="intro">This is the first paragraph.</p>
	<p class="intro">This is the second paragraph.</p>
	<a href="https://example.com">Link 1</a>
	<a href="https://example.com">Link 2</a>
	<div id="content" class="main-content">
		<h2>Title</h2>
		<p>This is a paragraph within the main content.</p>
	</div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 选择所有的 p 标签
p_elements = soup.find_all('p')
for element in p_elements:
    print(element)

print('-' * 20)

# 不递归查找，只在直接子节点中查找 p 标签
p_elements_non_recursive = soup.find_all('p', recursive=False)
for element in p_elements_non_recursive:
    print(element)

print('-' * 20)
# 查找文本内容为 "This is a regular paragraph." 的 p 标签
p_element = soup.find_all('p', string='This is the first paragraph.')
print(p_element)

print('-' * 20)

# 选择具有 class 属性且属性值为 "intro" 的元素
intro_elements = soup.find_all(attrs={'class': 'intro'})
for element in intro_elements:
    print(element)

print('-' * 20)

# 选择具有 href 属性且属性值以 "https://" 开头的元素
link_elements = soup.find_all(href=True)
for element in link_elements:
    if element['href'].startswith('https://'):
        print(element)

select() 方法是 Beautiful Soup 库中用于使用 CSS 选择器选择元素的方法。它返回所有匹配选择器的元素列表。

select() 方法的基本语法如下：
select(selector)
参数说明：
selector：要使用的 CSS 选择器，可以是标签名、类名、ID、属性等。
select()方法的参数类型：
# 标签选择器：字符串类型，如'p'、'a'等。
# 类class选择器：字符串类型，如'.intro'、'.link'等。
# ID选择器：字符串类型，如'#header'、'#footer'等。
# 属性选择器：字符串类型，如'[href]'、'[src^="https://"]'等。

# select 类选择器: 可以通过在选择器前加上点号来表示，用于选择具有指定类名的元素。例如：
# .intro：选择类名为 "intro" 的元素。
# .link：选择类名为 "link" 的元素。
# .active：选择类名为 "active" 的元素。

# select ID选择器: 可以通过在选择器前加上'井号'来表示，用于选择具有指定ID的元素。例如：
# #header：选择ID为 "header" 的元素。
# #footer：选择ID为 "footer" 的元素。
# #nav：选择ID为 "nav" 的元素。

# select 属性选择器: 可以使用'方括号'语法来表示，用于选择具有指定属性或属性值的元素。例如：
# [href]：选择具有 href 属性的元素。
# [src]：选择具有 src 属性的元素。
# [class="intro"]：选择具有 class 属性且属性值为 "intro" 的元素。

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
	<title>CSS Class Selector Example</title>
</head>
<body>
	<div class="intro">
		<h1>Welcome to Beautiful Soup!</h1>
		<p class="intro">Beautiful Soup is a Python library for pulling data out of HTML and XML files.</p>
	</div>
	<a href="https://chat18.aichatos.xyz/#/chat/1708763899782" class="link">Click me!</a>
	<ul>
		<li class="active">Item 1</li>
		<li>Item 2</li>
		<li class="active">Item 3</li>
	</ul>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 类选择器: 可以通过在选择器前加上'点号'来表示，用于选择具有指定类名的元素

# 选择类名为 "intro" 的元素
intro_elements = soup.select('.intro')
for element in intro_elements:
    print(element)

# 选择具有 class 属性且属性值为 "intro" 的元素
# intro_elements = soup.select('[class="intro"]')
# for element in intro_elements:
#     print(element)

# 选择具有 href 属性的元素
href_elements = soup.select('[href]')
for element in href_elements:
    print(element)

# # 选择类名为 "link" 的元素
# link_elements = soup.select('.link')
# for element in link_elements:
#     print(element)

# # 选择类名为 "active" 的元素
# active_elements = soup.select('.active')
# for element in active_elements:
#     print(element)