团队-爬虫豆瓣top250项目-项目进度

正则表达式在线检测工具：http://tool.oschina.net/regex/进程：1.源代码HTML　　#将url转换为HTML源码defgetHtml(url):try: page = urllib.request.urlopen(url) html = page.read()except: print("fa...

weixin_33738555

80人浏览 · 2017-11-06 12:29:00

weixin_33738555 · 2017-11-06 12:29:00 发布

正则表达式在线检测工具：http://tool.oschina.net/regex/

进程：

1.源代码HTML

　　#将url转换为HTML源码
def getHtml(url):
    try:
        page = urllib.request.urlopen(url)
        html = page.read()
    except:
        print("failed to geturl")
        return ''
    else:
        return html

2.爬取书名

　　#通过正则表达式获取该网页下每本书的title（换行符没去掉）
def getTitle(html):
    nameList = re.findall(r'<a href="https.*?".*?target="_blank">(.*?)</a>',html,re.S)
    newNameList = [];
    global topnum
    for index,item in enumerate(nameList):
        if item.find("img") == -1:#通过检测img,只保留中文标题
            #item.replace('\n','')
            #item.strip()
            #item.splitlines()
            #re.sub('\r|\n', '', item)
            if topnum%26 !=0:
                #newNameList.append("Top " + str(topnum) + " " + item);
                newNameList.append(item);
            topnum += 1;
    return newNameList

3.爬取图片

　　#通过正则表达式获取该网页下每本书的图片链接
def getImg(html):
    imgList = re.findall(r'img.*?width=.*?src="(http.*?)"',html,re.S)
    newImgList = []
    for index,item in enumerate(imgList):
        if item.find("js") == -1 and item.find("css") == -1 and item.find("dale") == -1 and item.find("icon") == -1and item.find("png") == -1:
            newImgList.append(item);

    return newImgList;

4.翻页

　　#实现翻页,每页25个
for page in range(0,450,25):
    url = "https://www.douban.com/doulist/1264675/?start={}".format(page)
    html = getHtml(url).decode("UTF-8");
    if html == '':
        namesUrl.extend('none');
        imgsUrl.extend('none')
        scoresUrl.extend('none')
        commentsUrl.extend('none')
        introductionsUrl.extend('none')
    else:
        namesUrl.extend(getTitle(html))
        imgsUrl.extend(getImg(html))
        scoresUrl.extend(getScore(html))
        commentsUrl.extend(getComment(html))
        introductionsUrl.extend(getDetail(html))

暂时完成以上的模块

转载于:https://www.cnblogs.com/Zlxz/p/7792538.html

开放原子开发者工作坊

开放原子开发者工作坊旨在鼓励更多人参与开源活动，与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动，如meetup、训练营等，主打技术交流，干货满满，真诚地邀请各位开发者共同参与！

更多推荐

“小满”安全车控操作系统正式在AtomGit开源

10月24日，由中国汽车工业协会指导，普华基础软件股份有限公司主办的“小满”安全车控操作系统开源发布会暨共建计划说明会成功举行。普华基础软件宣布将安全车控操作系统“小满”（简称“小满”）V24.10源代码正式在开放原子开源基金会（简称“基金会”）旗下AtomGit开源协作平台开源，并在AtomGit平