Step06:selenium+beautifulsoup爬取智联岗位信息,存储至Excel/csv文件
爬取智联岗位信息本次使用开发环境python 3.6.5+Pycharm,当然此次代码仅供参考。详细代码地址:我的github下载1.目标站点网页源代码获取由于使用Firefox浏览器,所以需要下载其驱动:geckodriver.exe,并设置该exe文件在win系统环境变量下。def get_content(arcurl):browser = webdriver.Firefo...
·
爬取智联岗位信息
本次使用开发环境python 3.6.5+Pycharm,当然此次代码仅供参考。
详细代码地址:我的github下载
1.目标站点网页源代码获取
由于使用Firefox浏览器,所以需要下载其驱动:geckodriver.exe,并设置该exe文件在win系统环境变量下。
def get_content(arcurl):
browser = webdriver.Firefox()
browser.get(arcurl)
html = browser.page_source
browser.close()
return html
2.解析网页,获取所需数据
本次解析尝试使用beautifulsoup库,具体api可查询beautifulsoup官方文档,有详细阐述。
def parse_page_shezhao(html):
#print(html)
soup = BeautifulSoup(html, "lxml")
message = []
message_dict = []
div_list = soup.select('#listContent > div')
for div in div_list:
messdict = {}
div_infobox = div.select('div.listItemBox > div.infoBox')
if len(div_infobox) > 0:
nameBox = div_infobox[0].select('.nameBox > div.jobName')
if len(nameBox) > 0:
jobname = nameBox[0].get_text()
job_link = nameBox[0].select('a')[0].attrs['href']
companyBox = div_infobox[0].select('.nameBox > div.commpanyName')
if len(companyBox) > 0:
company_name = companyBox[0].get_text()
company_link = companyBox[0].select('a')[0].attrs['href']
jobDesc = div_infobox[0].select('.descBox > div.jobDesc')
if len(jobDesc) > 0:
jobadr = jobDesc[0].get_text()
commpanyDesc = div_infobox[0].select('.descBox > div.commpanyDesc')
if len(commpanyDesc) > 0:
jobadr += " " + commpanyDesc[0].get_text()
job_welfare = div_infobox[0].select('div > div.job_welfare > div')
desc = ""
for xvar in job_welfare:
desc += xvar.get_text() + "; "
commpanyStatus = div_infobox[0].select('div > div.commpanyStatus')
desc += "【" + commpanyStatus[0].get_text() + "】"
messdict['职位链接'] = job_link
messdict['职位']=jobname
messdict['公司'] = jobname
messdict['公司链接'] = company_name
messdict['相关性质'] = jobadr
messdict['职责描述'] = desc
message.append([job_link, jobname, company_name, company_link, jobadr, desc])
message_dict.append(messdict)
return message, message_dict
3.保存数据至文件
第一种,保存至csv文件,需要pip3 install csv。
def write_csv_headers(path, headers):
with open(path, 'a', encoding='gb18030', newline='') as f:
f_csv = csv.DictWriter(f, headers)
f_csv.writeheader()
def write_csv_rows(path, headers, rows):
'''
写入行
'''
with open(path, 'a', encoding='gb18030', newline='') as f:
f_csv = csv.DictWriter(f, headers)
f_csv.writerows(rows)
def csv_write(csv_name, headers, html):
write_csv_headers(csv_name, headers)
items, others = parse_page_shezhao(html)
write_csv_rows(csv_name, headers, others)
第二种,保存至xls文件,即Excel文件,需要pip3 install xlwt。
def excel_write(filename, html):
# 创建excel文件,声明编码为utf-8
wb = xlwt.Workbook(encoding='utf-8')
# 创建表格
ws = wb.add_sheet('sheet1')
# 表头信息
headData = ['LINK_URL', '职位', '公司', '公司链接', '相关性质', '职责描述']
# 写入表头信息
for colnum in range(0, len(headData)):
ws.write(0, colnum, headData[colnum], xlwt.easyxf('font: bold on'))
# 从第2行开始写入
index = 1
#for item in parse_page(html):
items, others = parse_page_shezhao(html)
for item in items:
print(item)
for i in range(0, len(headData)):
# .write(行,列,数据)
ws.write(index, i, item[i])
index += 1
# 保存excel
wb.save(filename)
4.main
def main():
url = input("请输入URL:\n")
filename = input("请输入存储文件名:\n")
html = get_content(url)
csv_name = filename + '.csv'
headers = ['职位链接', '职位', '公司', '公司链接', '相关性质', '职责描述']
csv_write(csv_name, headers, html)
excel_name = filename + ".xls"
#excel_write("智联招聘岗位爬虫结果.xls", html)
excel_write(excel_name, html)
5.成果展示
执行文件,输入url和存储文件名:
本次测试url:https://sou.zhaopin.com/?pageSize=60&jl=765&sf=10001&st=15000&kw=java&kt=3&=10001
本次测试文件名:java岗位信息表
输出结果:
开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!
更多推荐
已为社区贡献1条内容
所有评论(0)