python笔记-爬取猎聘网招聘信息

文章目录

persistenthuang

6502人浏览 · 2020-10-18 11:45:36

persistenthuang · 2020-10-18 11:45:36 发布

猎聘网信息爬取

121212121

爬取猎聘网信息是为了完成需求分析这门课的作业
哎，为了完成作业，五天入门python爬虫，找了个视频就开始了，学习笔记如下爬取豆瓣笔记
这篇博客用来记录，爬取猎聘网的整个过程
爬取过程整体分为三个过程：
- 爬取职位链接
- 爬取职位详情信息
- 可视化信息统计

爬取职位链接

1. 构建URL：

https://www.liepin.com/zhaopin/?compkind=&dqs=010&pubTime=&pageSize=40&salary=&compTag=&sortFlag=&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98&siTag=LiAE77uh7ygbLjiB5afMYg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_fp&d_ckId=cd34a20d8742a36fa58243aee1ca77fe&d_curPage=0&d_pageSize=40&d_headId=cd34a20d8742a36fa58243aee1ca77fe

https://www.liepin.com：域名
/zhaopin/：网站前缀
?：问号后面接参数
这里我们要爬取关键词为“数据挖掘”各个地区的职位信息来分析，所以要分析的关键词为（key），和地区参数（dqs）以&相连
参考上述URL：（key=%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98），后面一串乱码是因为汉字在作为关键词时要序列化
因为这里用urllib这个包来获取网页，所以要把汉字序列化，如果使用requests包来获取就不用
调用（urllib.parse.quote(“数据挖掘”, ‘utf-8’)）这个函数来序列化汉字

参考代码如下

def main():
    job_list = []
    key = "数据挖掘"
    dqs = ["010", "020", "050020", "050090", "030", "060080", "040", "060020", "070020", "210040", "280020", "170020"]
    new_key = urllib.parse.quote(key, 'utf-8')
    for item in dqs:
        url = "https://www.liepin.com/zhaopin/?key="+new_key+"&dqs="+item
        print(url)
        # 获取职位列表链接
        job_html = get_job_html(url)
        # 解析网页分析网页得到链接
        link_list = get_job_link(job_html)
        # 把链接储存到数组中
        for i in link_list:
            job_list.append(i)
    # 保存职位链接到表格中
    save_link(job_list)

2. 获取网页

这里获取网页调用一个包：（from fake_useragent import UserAgent）
需要在pip中安装：pip install fake_useragent
首先要构造一个请求头：猎聘网的反爬虫不是很强大，不用登录就可以访问，调用UserAgent().random 可以随机生成浏览器标识，这样就不会被阻止
如果网站的反扒做的很好就要在网页的请求头上添加相应的参数，参考如下图

1212121
参考代码：

def get_job_html(url):
    print("-------爬取job网页-------")
    html = ""
    head = {
        "User-Agent": UserAgent().random
    }
    """
    head:模拟浏览器头部信息
    "User-Agent":浏览器标识
    """
    request = urllib.request.Request(url=url, headers=head)
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode("utf-8")
    except Exception as e:
        return None
    return html

3. 解析网页

分析网页元素获取数据
由于每一个页面只有40条数据，所以要实现自动获取获取下一页链接，来实现自动爬取，找到网页种的元素（下一页）获取下一页的链接，实现递归爬取

121212

参考代码如下

def get_job_link(html):
    job_link = []
    # 解析网页得到链接
    soup = BeautifulSoup(html, "html.parser")
    for item in soup.find_all('h3'):
        if item.has_attr("title"):
            # 抽取链接内容
            link = item.find_all("a")[0]["href"]
        job_link.append(link)
        print(link)

    try:
        find_next_link = soup.select(".pager > div.pagerbar > a")[7]['href']
        if find_next_link == "javascript:":
            return job_link
        # 拼接上域名
        find_next_link = "https://www.liepin.com" + str(find_next_link).replace('°', '0')
        print(find_next_link)
        # 获取到下一个网页的数据
        next_html = get_job_html(find_next_link)
        # 解析网页
        if next_html is not None:
            next_link = get_job_link(next_html)
            for link in next_link:
                job_link.append(link)
    except Exception as e:
        print(e)
    finally:
        return job_link

4. 保存数据到表格

表格操作很简单，就不赘述

def save_link(link_list):
    work_book = xlwt.Workbook(encoding="utf-8", style_compression=0)
    work_sheet = work_book.add_sheet("job_link", cell_overwrite_ok=True)
    col = "Link"
    work_sheet.write(0, 0, col)
    for i in range(0, len(link_list)):
        print("第%d条" % i)
        data = link_list[i]
        work_sheet.write(i+1, 0, str(data))
    work_book.save("job_link.xls")    # 保存数据

爬取职位详情信息

获取到职位链接后保存在表格中，下一步就是访问这个些链接，爬取到详细信息，并保存到数据库中
爬取开始就发现有链接的规律，有两种链接，第一种是正常的可以直接访问的，还有一种没有添加域名的，所以我们有加上域名

1. 基本步骤

获取表格中的链接
获取网页
解析网页
保存数据

基本框架的搭建：

def main():
    # 读取表格链接
    links = read_excel_get_link()
    # 获取链接网页
    for i in range(0, len(links)):
        if links[i][0] != 'h':
            links[i] = "https://www.liepin.com" + links[i]
        print(links[i])
        # 获取网页
        message_html = getLink.get_job_html(links[i])
        if message_html is not None:
            # 解析数据
            message_data = get_message_data(message_html)
        else:
            continue
        # 保存一条数据
        try:
            save_datas_sql(message_data)
        except Exception as e:
            continue

2. 获取表格链接

表格操作不在赘述

def read_excel_get_link():
    links = []
    # 读取表格链接数据
    # 打开表格
    work_book = xlrd.open_workbook("job_link.xls")
    # 获取sheet
    sheet = work_book.sheet_by_index(0)
    for i in range(1, sheet.nrows):
        link = sheet.cell(i, 0).value
        links.append(link)
    return links

3. 获取职位详情信息网页

message_html = getLink.get_job_html(links[i])

调用上面获取职位链接时的函数：get_job_html

4. 解析详情网页得到数据

解析网页，获取到页面的元素：
- 职位名称
- 公司
- 薪水
- 职位描述

参看网页的元素：
12121

使用标签选择器来定位元素
在爬取过程中有时会遇到一些转义字符的问题需要注意

参考代码

def get_message_data(html):
    data = []
    soup = BeautifulSoup(html, "html.parser")
    try:
        # 岗位名称
        title = soup.select(".title-info > h1")[0]['title']
        data.append(title)

        # 公司
        company = soup.select(".title-info > h3 > a")
        if len(company) != 0:
            company = company[0]['title']
        else:
            company = " "
        data.append(company)

        # 薪水
        salary = soup.select(".job-title-left > p")
        if len(salary) != 0:
            salary = salary[0].contents[0]
        else:
            salary = " "
        salary = salary \
            .replace('\n', '') \
            .replace('\t', '') \
            .replace('\r', '') \
            .replace(' ', '') \
            .replace('"', '')
        data.append(salary)

        # 描述
        description = soup.select(".content.content-word")
        if len(description) != 0:
            all_des = description[0].contents
            description = " "
            for item in all_des:
                if type(item) == bs4.element.NavigableString:
                    # print(item)
                    description = description + item
            # print(description)
        else:
            description = " "
        description = description \
            .replace('\n', '') \
            .replace('\t', '') \
            .replace('\r', '') \
            .replace(' ', '') \
            .replace('"', '')
        data.append(description)
    except Exception as e:
        print(e)
    finally:
        print(data)
        return data

5. 保存数据到数据库

使用sqlite3数据库可以很好的储存数据，也方便查询数据

创建数据库代码

# 建表语句
def init_job_sqlite():
    connet = sqlite3.connect("job_message.db")  # 打开或创建文件
    # 建表
    c = connet.cursor()  # 获取游标
    sql = '''
        create table if not exists job_message(
            id integer not null primary key autoincrement,
            title text not null,
            company text,
            salary text,
            description  text
        )
    '''
    c.execute(sql)  # 执行sql语句
    connet.commit()  # 提交
    connet.close()  # 关闭数据库

插入数据到数据库中实现数据的储存

def save_datas_sql(data):
    init_job_sqlite()  # 初始化数控库
    # 插入数据
    connet = sqlite3.connect("job_message.db")  # 打开或创建文件
    c = connet.cursor()  # 获取游标
    for index in range(0, 4):
        data[index] = '"' + data[index] + '"'
    sql = '''
        insert into job_message(title,company,salary,description)
        values(%s)''' % ",".join(data)
    c.execute(sql)
    connet.commit()

可视化职位信息

这里使用flask框架搭建一个简单网站实现数据的可视化
首先在网上随便找一个网页模板下载下来：参考下载网站

1. 首页

一些前端展示，这里不在赘述
在这里贴上我的文件结构

12121

2. 职位列表

这里就是一个查询数据库的过程
把数据库中的数据展示在网页上

12121

关于数据库中的数据如何展现在静态网页上，我这上一篇学习的博客中有记录爬取豆瓣笔记
由于数据太多，这里选取前100条数据显示出来
python代码参考下面的app.py中的代码
关键前端代码如下

<table class="table table-hover table-light">
			<tr>
				<td>id</td>
				<td>职位</td>
				<td>公司</td>
				<td>工资</td>
				<td>职位描述</td>
			</tr>
			{%for job in jobs%}
			<tr>
				<td>{{job[0]}}</td>
				<td>{{job[1]}}</td>
				<td>{{job[2]}}</td>
				<td>{{job[3]}}</td>
				<td>{{job[4]}}</td>
			</tr>
			{%endfor%}
		</table>

3. 薪水分析

关于薪水的分布，我们可以用把数据库中的数据读取出来，做个计算，统计薪水的分布图
- 薪水的格式都是：a-bk·c薪，统计公式：ave = (x + y) / 2 * z/10得到统计结果，并排序一下
- 这里使用百度的echarts来实现，百度echarts跳转链接
- 这里要引入一个js文件：上手链接

121212
关键前端代码：

		<div id="main" style="width: 100%;height:450px;margin: 0 auto;"></div>
<script type="text/javascript">
	// 基于准备好的dom，初始化echarts实例
	var myChart = echarts.init( document.getElementById('main'));
	var data = {{ data }};
	option = {
		xAxis: {
			type: 'value',
			splitLine: {
				lineStyle: {
					type: 'dashed'
				}
			},
			name: "年薪/万",
			splitNumber: 10
		},
		yAxis: {
			type: 'value',
			name: "统计/个",
			splitLine: {
				lineStyle: {
					type: 'dashed'
				}
			}
		},
		series: [{
			symbolSize: 10,
			data: data,
			type: 'scatter'
		}]
	};


	// 使用刚指定的配置项和数据显示图表。
	myChart.setOption(option);
</script>

4. 职位描述词云

词云的生成使用jieba分词器来实现
使用wordcloud来绘图：参考文档链接
具体实现参考下方app.py代码

23131312

5. app.py文件代码

from flask import Flask, render_template
import sqlite3
import jieba  # 分词
from matplotlib import pyplot as plt  # 绘图，数据可视化
from wordcloud import WordCloud  # 词云
from PIL import Image  # 图片处理
import numpy as np  # 矩阵运算
import re

app = Flask(__name__)


@app.route('/')
def home():
    return render_template("home.html")


@app.route('/job')
def movie():
    list = []
    connet = sqlite3.connect("./liepin/job_message.db")  # 打开或创建文件
    c = connet.cursor()  # 获取游标
    sql = '''select * from job_message LIMIT ((0-1)*100),100'''
    datas = c.execute(sql)
    for item in datas:
        list.append(item)
    c.close()
    connet.close()
    return render_template("job.html", jobs=list)


@app.route('/salary')
def score():
    data = []
    con = sqlite3.connect("./liepin/job_message.db")
    cur = con.cursor()
    sql = "select salary,count(salary) from job_message group by salary"
    datas = cur.execute(sql)
    for item in datas:
        point = []
        s = re.search(r"(\d*)-(\d*)k·(\d*)薪", item[0])
        if s is None:
            continue
        x = int(s.group(1))
        y = int(s.group(2))
        z = int(s.group(3))
        ave = (x + y) / 2 * z/10
        point.append(ave)
        point.append(int(item[1]))
        data.append(point)

    for i in range(0, len(data)):
        for j in range(0, len(data) - 1):
            if data[j][0] > data[j + 1][0]:
                a = data[j][0]
                data[j][0] = data[j + 1][0]
                data[j + 1][0] = a
                b = data[j][1]
                data[j][1] = data[j + 1][1]
                data[j + 1][1] = b
    print(data)
    cur.close()
    con.close()
    return render_template("salary.html", data=data)


@app.route('/word')
def word():
    # 获取数据
    con = sqlite3.connect("./liepin/job_message.db")
    cur = con.cursor()
    sql = "select description from job_message"
    data = con.execute(sql)
    # 拼接
    text = ""
    for item in data:
        text = text + item[0]
    # 停用词列表
    stopwords = ['任职', '要求', '职位',
                 '描述', '优先', '，',
                 '相关', '专业', '熟练',
                 '使用', '工作', '职责']
    cut = jieba.cut(text, cut_all=False)
    print(type(cut))
    final = []
    for seg in cut:
        if seg not in stopwords:
            print(seg)
            final.append(seg)

    string = " ".join(final)
    cur.close()
    con.close()
    # 找到一张图片
    img = Image.open("./static/timg.jpg")  # 打开遮罩图片
    img_array = np.array(img)  # 将图片转换为数组
    wold_cloud = WordCloud(
        background_color="#FFFFFF",
        mask=img_array,
        font_path="STKAITI.TTF",  # 字体所在位置
    ).generate_from_text(string)  # 放入词
    # 绘制图片
    fig = plt.figure(1)
    plt.imshow(wold_cloud)
    plt.axis('off')  # 不显示坐标轴
    # plt.show()  # 显示生成的词云图片
    plt.savefig("./static/word.jpg", dpi=300)

    return render_template("word.html")


@app.route('/author')
def author():
    return render_template("author.html")


if __name__ == '__main__':
    app.run(debug=True)