【Python学习】网络爬虫-批量获取免费代理地址

批量获取免费代理地址：导入requests模块# 从bs4中导入BeautifulSoup模块# 定义获取代理地址的方法# 定义proxy_ips列表存储代理地址# 设置headers# 从第一页开始循环访问print(f"正在爬取第。

u014481728

3311人浏览 · 2023-10-23 13:00:00

u014481728 · 2023-10-23 13:00:00 发布

一、案件起因

最近因为一些大家都懂的原因需要使用代理池，但是找到很多网站和贴子，都没有找到满意结果，要么就是网站需要充会员，要么就是地址不能用；我自己只是时不时的，稍微用一下，这样就充会员是不是不太好，也没人给报销，关键是充会员，别人肯定会瞧不起我啊，这事闹的，这我怎么能屈服呢，我得接着找，最后在github上的一个项目的readme文件中找到几个地址，测试了一下，只有少部分的网站还能打开，IP也能用，我心下一喜，看来今夜有戏了，我得来一发，搞个小池子，同时也把免费的代理地址留在最后，供大家学习。
先来看一下结果：
在这里插入图片描述

二、案件分析

要做个简单的代理池，要完成以下三步内容：
一是完成代理地址的抓取，这里我们选择89免费代理来进行抓取，不用太多，几十个就够用了。
二是要对抓取的地址进行有效性检测，可以用百度大黑子的地址来进行测试。
三是对地址进行保存，由于不常使用，直接保存到文本文件即可。
需要使用# 导入requests、bs4模块，下面我就开始吧

三、案件侦破

第一步：导入模块
导入requests模块和bs4的BeautifulSoup模块。

# 导入requests模块
import requests
# 从bs4中导入BeautifulSoup模块
from bs4 import BeautifulSoup

第二步：批量获取代理地址
这次使用89免费代理，地址为https://www.89ip.cn/index_1.html，经过手动抽样测试，地址有效。我们使用chrome浏览器打开地址，在检查功能中，通过元素选择来确定ip与port的元素标签，如下图：
在这里插入图片描述
因为要获取表中的所有数据，我们先通过layui-table类名获取整张表格数据，再逐步提取tr->td标签，获取每行的数据，并保存到列表中，为了方便后期使用，我们固定保存的格式为：“http://ip:port”。

以下是获取代理地址的方法，这里pages参数是需要抓取页面的数量，ua是User-Agent参数，这两个参数值在第三步的主函数里，标签获取使用BeautifulSoup.find_all()方法，该方法最终返回一个代理地址列表。

# 定义获取代理地址的方法
def get_proxy(pages, ua):
    # 定义proxy_ips列表存储代理地址
    proxy_ips = []
    # 设置headers
    headers = {"User-Agent" : ua}
    # 从第一页开始循环访问
    for page in range(1,pages+1):
        print(f"正在爬取第{page}页!")
        url = "https://www.89ip.cn/index_{page}.html"
        res = requests.get(url, headers=headers)
        # 使用.text属性获取网页内容，赋值给html
        html = res.text
        # 用BeautifulSoup()传入变量html和解析器lxml，赋值给soup
        soup = BeautifulSoup(html,"lxml")
        # 使用find_all()方法查找类名为layui-table的标签
        table = soup.find_all(class_ = "layui-table")[0]
        # 使用find_all()方法查找tr标签
        trs = table.find_all("tr")
        # 使用for循环逐个访问trs列表中的tr标签,一个tr代表一行，第一行为表头，不记录
        for i in range(1,len(trs)):
            # 使用find_all()方法查找td标签
            ip = trs[i].find_all("td")[0].text.strip()
            port = trs[i].find_all("td")[1].text.strip()
            # 拼接代理地址
            proxy_ip = f"http://{ip}:{port}"
            # 将获取的代理地址保存到proxy_ips列表
            proxy_ips.append(proxy_ip)
    # 返回proxy_ips列表
    return proxy_ips

第三步：地址有效性检测
获得proxy_ips列表后，我们通过访问"https://www.baidu.com"地址进行有效性验证。
以下是获取代理地址的方法，其中ip上第一步获取到的地址，ua为User-Agent，这里设置了超时限制，如果3秒内未访问成功，则返回请求超时，该代理地址在后面会丢弃，该方法最终返回网页访问状态。

# 定义代理地址有效性验证方法
def test_proxy(ip, ua):
    # 设置headers
    headers = {"User-Agent" : ua}
    url = "https://www.baidu.com"
    # 设置代理信息
    proxies = {"http": ip}
    # 通过请求百度首页来验证代理地址是否有效
    try:
        res = requests.get(url, headers=headers, proxies=proxies, timeout=3)  
    except requests.exceptions.Timeout:
        # 超过3秒未返回，则请求超时
        print("请求超时")
        result_code = 0
    else:
        result_code = res.status_code
    # finally:
    #     return res.status_code
    # 返回请求状态
    return result_code

第四步：主函数与地址保存
由于保存比较简单我们与主函数写在一起即可
判断第二步的返回值是否为200，将返回值为200的代理地址保存到good_ips 列表当中。最后使用with open()将good_ips 列表中的数据写入“代理池.txt”文件

# 主函数
if __name__ == "__main__":
    # 定义good_ips列表用于存储有效的代理地址
    good_ips = []
    # 定义User-Agent参数
    ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
    # 设置获取页面的数量
    pages = 5
    # 调用get_proxy方法获取网站上的免费代理
    proxy_list = get_proxy(pages, ua)
    # 输出获取结果
    print(f"共爬取了 {len(proxy_list)} 个代理地址!")

    # 对获取的代理地址逐个进行有效性验证
    for ip in proxy_list:
        # 调用验证方法
        result = test_proxy(ip, ua)
        # 判断返回状态是否为200
        if result == 200:
            # 如果返回状态是为200，则保存到good_ips列表中
            good_ips.append(ip)
        else:
            # 否则continue
            continue
    # 输出检测结果
    print(f"共有 {len(good_ips)} 个代理地址通过了检测!")
    
    # 保存有效代理地址到文件中
    with open("D:\学习资料\Python\VSCode\yequbiancheng\工作业务\csdn自动化脚本\代理池.txt", "a") as fp:
        # 将地址逐个写入txt文件中
        for i in good_ips:
            fp.write(f"{i}"+"\n")
    print(f"代理地址保存完毕,保存地址为D:\学习资料\Python\VSCode\yequbiancheng\工作业务\csdn自动化脚本\代理池.txt")

四、案件总结

合并后代码如下：

# 导入requests模块
import requests
# 从bs4中导入BeautifulSoup模块
from bs4 import BeautifulSoup

# 定义获取代理地址的方法
def get_proxy(pages, ua):
    # 定义proxy_ips列表存储代理地址
    proxy_ips = []
    # 设置headers
    headers = {"User-Agent" : ua}
    # 从第一页开始循环访问
    for page in range(1,pages+1):
        print(f"正在爬取第{page}页!")
        url = "https://www.89ip.cn/index_{page}.html"
        res = requests.get(url, headers=headers)
        # 使用.text属性获取网页内容，赋值给html
        html = res.text
        # 用BeautifulSoup()传入变量html和解析器lxml，赋值给soup
        soup = BeautifulSoup(html,"lxml")
        # 使用find_all()方法查找类名为layui-table的标签
        table = soup.find_all(class_ = "layui-table")[0]
        # 使用find_all()方法查找tr标签
        trs = table.find_all("tr")
        # 使用for循环逐个访问trs列表中的tr标签,一个tr代表一行，第一行为表头，不记录
        for i in range(1,len(trs)):
            # 使用find_all()方法查找td标签
            ip = trs[i].find_all("td")[0].text.strip()
            port = trs[i].find_all("td")[1].text.strip()
            # 拼接代理地址
            proxy_ip = f"http://{ip}:{port}"
            # 将获取的代理地址保存到proxy_ips列表
            proxy_ips.append(proxy_ip)
    # 返回proxy_ips列表
    return proxy_ips

# 定义代理地址有效性验证方法
def test_proxy(ip, ua):
    # 设置headers
    headers = {"User-Agent" : ua}
    url = "https://www.baidu.com"
    # 设置代理信息
    proxies = {"http": ip}
    # 通过请求百度首页来验证代理地址是否有效
    try:
        res = requests.get(url, headers=headers, proxies=proxies, timeout=3)  
    except requests.exceptions.Timeout:
        # 超过3秒未返回，则请求超时
        print("请求超时")
        result_code = 0
    else:
        result_code = res.status_code
    # finally:
    #     return res.status_code
    # 返回请求状态
    return result_code

# 主函数
if __name__ == "__main__":
    # 定义good_ips列表用于存储有效的代理地址
    good_ips = []
    # 定义User-Agent参数
    ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
    # 设置获取页面的数量
    pages = 5
    # 调用get_proxy方法获取网站上的免费代理
    proxy_list = get_proxy(pages, ua)
    # 输出获取结果
    print(f"共爬取了 {len(proxy_list)} 个代理地址!")

    # 对获取的代理地址逐个进行有效性验证
    for ip in proxy_list:
        # 调用验证方法
        result = test_proxy(ip, ua)
        # 判断返回状态是否为200
        if result == 200:
            # 如果返回状态是为200，则保存到good_ips列表中
            good_ips.append(ip)
        else:
            # 否则continue
            continue
    # 输出检测结果
    print(f"共有 {len(good_ips)} 个代理地址通过了检测!")
    
    # 保存有效代理地址到文件中
    with open("D:\学习资料\Python\VSCode\yequbiancheng\工作业务\csdn自动化脚本\代理池.txt", "a") as fp:
        # 将地址逐个写入txt文件中
        for i in good_ips:
            fp.write(f"{i}"+"\n")
    print(f"代理地址保存完毕,保存地址为D:\学习资料\Python\VSCode\yequbiancheng\工作业务\csdn自动化脚本\代理池.txt")

执行结果如图：
在这里插入图片描述
代理池的使用
最后再来一个代理池的使用，先读取保存代理ip的文件到列表中，在从列表中随机抽取一个IP做为代理地址使用。

import requests
import random
import time

# 读取保存代理ip的文件,并赋值给proxy_ips列表
with open("D:\学习资料\Python\VSCode\yequbiancheng\工作业务\csdn自动化脚本\代理池.txt","r") as fp:
    proxy_ips = fp.readlines()

# 定义User-Agent列表
uas = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"]

# 待访问的地址
url = "https://www.jd.com"

for i in range(5):
    # 从列表中随取一个代理地址
    proxy_ip = proxy_ips[random.randint(0, len(proxy_ips)-1)]
    # 设置代理信息
    proxies = {"http": proxy_ip}
    # 从uas列表中随取一个ua作为User-Agent
    ua = uas[random.randint(0, len(uas)-1)]
    # 设置headers
    headers = {
            "User-Agent": ua
            }
    try:
        res = requests.get(url, headers=headers, proxies=proxies)
    except:
        # 访问失败
        print(f"访问失败 {proxy_ip} {url}")
    else:
        # 访问成功
        print(f"{res.status_code} {proxy_ip} {url}")
    # html = res.text
    finally:
        time.sleep(2)