Python制作CSDN数据中心——访客量可视化

CSDN访客量可视化由于CSDN网页上不显示具体的访客量，手机app上的访客量又和公开文章的访问量对不上，也搞不懂他还算了啥，所以我就把所有文章的访问量统计下来作为访客量数据。具体思路：1. 每天中午12点通过requests+pyquery获取总体访问量并保存2. 通过matplotlib将数据可视化，并分析访客流量3. 测试代码还有功能等待开发GitHub地...

99Kies

806人浏览 · 2019-08-18 16:40:55

99Kies · 2019-08-18 16:40:55 发布

CSDN访客量可视化

由于CSDN网页上不显示具体的访客量，手机app上的访客量又和公开文章的访问量对不上，也搞不懂他还算了啥，所以我就把所有文章的访问量统计下来作为访客量数据。

具体思路：

1. 每天中午12点通过requests+pyquery获取总体访问量并保存

2. 通过matplotlib将数据可视化，并分析访客流量

3. 测试代码

还有功能等待开发

GitHub地址：https://github.com/99Kies/Visitor_Monitor

收集数据程序大致可以分为下面三个模块

爬取模块

存储模块

更新模块

爬取模块

1. 分析页面(https://blog.csdn.net/qq_19381989)

可见每个博文都放在一个单独的div中，这种格式最适合用pyquery拉，值得注意的是上面和下面都有一个”多余“的div，在定位或者保存的时候要注意一哈，最后copy一下他的css位置 #mainBox > main > div.article-list > div

2. 分析如何翻页

一翻页，观察一下url就晓得这玩意可以用requests.get去翻页

2. 编写代码

import requests
from pyquery import PyQuery as pq
import time

def get_read_number(page):
    '''
    page指自己csdn博客的页数
    :param page: 自己csdn博客的页数
    :return: {文章主题，访客量，评论量} 时间(年/月/日)
    '''
    all_read = 0
    #用来存储所有的浏览量
    count = 0
    #用来存储定位到多少篇文章
    for i in range(1,page+1):
        url = 'https://blog.csdn.net/qq_19381989/article/list/{}'.format(i)
        # print(url)
        r = requests.get(url)
        doc = pq(r.text)
        items = doc('#mainBox > main > div.article-list > div').items()
        for item in items:
            project = {
                'title': item.find('h4 > a').text(),
                'read': item.find('div.info-box.d-flex.align-content-center > p:nth-child(3) > span > span').text(),
                'talk':item.find('div.info-box.d-flex.align-content-center > p:nth-child(5) > span > span').text(),
            }
            flag = 1
            #利用flag记录是否为首与末的空div，空的就不存储料
            for example in project:
                # if project['read'] is '' or '原' not in project['title']:
                #若只想统计原创作品时---
                if project['read'] is '':
                    flag = 0
            if flag == 1:
                all_read += int(project['read'])
                count += 1
    return all_read, time.strftime("%Y/%m/%d",time.localtime(time.time()))

利用matplotlib实时检测一下（时间改为%H:%M:%S测试一番）

import matplotlib.pyplot as plt
import time
from numpy import *

def plot_show_msg(msg):
    xtime = []
    yread = []
    for xy in msg:
        xtime.append(xy[1])
        yread.append(xy[0])
    print(msg)
    ax = array(xtime)
    ay = array(yread)
    plt.close()
    plt.plot(ax, ay)
    plt.xticks(rotation=70)
    plt.margins(0.08)
    plt.subplots_adjust(bottom=0.15)
    plt.xlabel("Date")
    plt.ylabel("Visitors")
    plt.title("Visitor Data Visualization")
    plt.show()
    plt.pause(1)
    plt.close()
    print('----------')

if __name__ == '__main__':
    msg = []
    while 1:
        time.sleep(1)
        all_read, now_time = get_read_number(3)
        msg.append((all_read,now_time))
        plot_show_msg(msg)

存储模块

将爬取到的数据存储到对应的文件中

def write_to_csvfile(all_read, date):
    msg_path = '文件夹名'
    filename = msg_path + os.path.sep +'文件名'
    if not os.path.exists(msg_path):
        os.mkdir(msg_path)
        with open(filename, 'a', encoding='utf-8') as csvfile:
            writer = csv.writer(csvfile, dialect='unix')
            writer.writerow(['read','date'])
    try:
        with open(filename,'a',encoding='utf-8') as csvfile:
            writer = csv.writer(csvfile, dialect='unix')
            writer.writerow((all_read,date))
    except Exception as e:
        print(e)

更新模块

无限更新的方法太耗费资源料，我打算就和哔哩哔哩差不多一天搞一次，每天中午12点更新数据，将数据存储在本地文件中，这样Web应用就可以调用这些数据，制作优美的表格撩

有两种存储方式

1. 本地电脑bat文件定时运行; //存储在./Read_msg/read_msg.csv

2. 放服务器上跑程序，利用上方的代码，运行时判断今天的日期存在保存文件里吗，若没则进行更新，若有则更新，还有一个就是若没有这个保存文件时也要进行更新; //存储在./Test_msg/test_msg.csv

方法一（我用于本地测试）

在写完代码后发现几年前就有人去写过这个csdn访客数据化撩 https://blog.csdn.net/s740556472/article/details/78239204

在当时CSDN是有显示浏览量的，现在没了，这位博主把bat方法写的很详细，学习了，感谢！！！

编写day_to_day.bat文件 (csdn_read_save.py中整合了爬虫模块和存储模块)，

此方法的数据存储在./Read_msg/read_msg.csv中

win+R (compmgmt.msc) 打开计算机管理

进入任务管理程序库栏，并创建任务

csdn_read_save.py整合代码：

import requests
from pyquery import PyQuery as pq
import re
import matplotlib.pyplot as plt
import time
from numpy import *
import csv
import os

def get_read_number(page):
    all_read = 0
    count = 0
    for i in range(1,page+1):
        url = 'https://blog.csdn.net/qq_19381989/article/list/{}'.format(i)
        # print(url)
        r = requests.get(url)
        doc = pq(r.text)
        items = doc('#mainBox > main > div.article-list > div').items()
        for item in items:
            project = {
                'title': item.find('h4 > a').text(),
                'read': item.find('div.info-box.d-flex.align-content-center > p:nth-child(3) > span > span').text(),
                'talk':item.find('div.info-box.d-flex.align-content-center > p:nth-child(5) > span > span').text(),
            }
            flag = 1
            for example in project:
                # if project['read'] is '' or '原' not in project['title']:
                if project['read'] is '':
                    flag = 0
            if flag == 1:
                all_read += int(project['read'])
                count += 1
    return str(all_read), time.strftime("%Y/%m/%d",time.localtime(time.time()))

def write_to_file(all_read, date):
    msg_path = 'Read_msg'
    filename = msg_path + os.path.sep +'read_msg.csv'
    if not os.path.exists(msg_path):
        os.mkdir(msg_path)
        with open(filename, 'a', encoding='utf-8') as csvfile:
            writer = csv.writer(csvfile, dialect='unix')
            writer.writerow(['read','date'])
    try:
        with open(filename,'a',encoding='utf-8') as csvfile:
            writer = csv.writer(csvfile, dialect='unix')
            writer.writerow((all_read,date))
    except Exception as e:
        print(e)

if __name__ == '__main__':
    all_read, date = get_read_number(3)
    print(all_read,date)
    write_to_file(all_read, date)

方法二（我用于Web端）

大致代码如下，每次运行前都判断一下当前时间是否被记录过，此方法的数据保存在./Test_msg/test_msg.csv中

import os
import csv

def is_yesterday_yn():
    '''
    每次保存时都打开存储访客数据的文件判断一下最后一次保存的是否为昨天，若是则进行爬取
    若没有访客数据的文件时也要进行爬虫
    :param filename: 访客数据文件名
    :return: True/False True：需要爬虫。False：无需爬虫
    '''
    msg_path = 'Test_msg'
    today = time.strftime("%Y/%m/%d",time.localtime(time.time()))
    filename = msg_path + os.path.sep + 'test_msg.csv'
    if not os.path.exists(msg_path):
        return True
    with open(filename,'r',encoding='utf-8') as csvfile:
        reader = str(csvfile.readlines())
        print(reader)
    if today in reader:
        print('is Today')
        return False
    else:
        print('isn\'t today, you need update!')
        return True

def update_msg():
    if is_yesterday_yn():
        all_read, date = get_read_number(3)
        write_to_csvfile(all_read, date)

这样就可以确保运行代码时造成的相同日期的存储，尽量做到一天一更，在web端上的大致意思就是每请求一次页面都要检验一次数据是否需要更新，且检验完之后每隔15分钟重新检验一次。

下面再添加一个matplotlib大致实现一下代码

def get_msg(filename):
    try:
        xtime = []
        yread = []
        with open(filename,'r',encoding='utf-8') as csvfile:
            reader = csv.reader(csvfile)
            for row in list(reader)[1:]:
                xtime.append(row[1])
                yread.append(row[0])
        return xtime,yread
    except:
        print('Read Error')

def plot_show_msg(xtime,yread):
    ax = array(xtime)
    ay = array(yread)
    # plt.ion()
    plt.close()
    plt.plot(ax,ay)
    plt.xticks(rotation=70)
    plt.margins(0.08)
    plt.subplots_adjust(bottom=0.15)
    plt.xlabel("Date")
    plt.ylabel("Visitors")
    #图的标题
    plt.title("Visitor Data Visualization")
    plt.show()
    plt.pause(1)
    plt.close()

if __name__ == '__main__':
    while 1:
        update_msg()
        xtime, yread = get_msg('./Test_msg/test_msg.csv')
        plot_show_msg(xtime, yread)
        time.sleep(900)
        #每隔15分钟判断一次

GitHub地址：https://github.com/99Kies/Visitor_Monitor

希望有大家来个star