利用word2vec实现关键词聚类

一、训练自己的词向量通常需要以下4个步骤：1.语料准备，从原始的语料中提取出我们需要的语料信息2.分词：这里采用jieba分词，另外加载了自定义的词典和停用词典，停用词典使用的是哈工大停用词词典https://github.com/orangefly0214/stopwords，自定义词典和自己训练的词向量的主题相关，需要自己定义，自定义词典的格式可参加jieba官网给出的格式，htt...

暴躁的猴子

7115人浏览 · 2019-05-24 16:17:54

暴躁的猴子 · 2019-05-24 16:17:54 发布

一、训练自己的词向量

通常需要以下4个步骤：

1.语料准备，从原始的语料中提取出我们需要的语料信息

2.分词：这里采用jieba分词，另外加载了自定义的词典和停用词典，停用词典使用的是哈工大停用词词典https://github.com/orangefly0214/stopwords，自定义词典和自己训练的词向量的主题相关，需要自己定义，自定义词典的格式可参加jieba官网给出的格式，https://raw.githubusercontent.com/fxsjy/jieba/master/extra_dict/dict.txt.big,第一列是单词，第二列是词频，第三列是标注出来的词性，如果没有，二三列也可以默认为空。

3.加载分词后的语料利用gensim训练word2vec模型。训练时，可以用word2vec.Text8Corpus来加载需要训练的语料，另外需要自己设置窗口大小，迭代次数以及词向量的维度等等。

代码如下：

#! usr/bin/env python3
# -*- coding:utf-8 -*-
import jieba
import jieba.posseg as psg
from gensim.models import word2vec
import gensim
import re
#提取语料
def extract(path,save):
    with open(path,'r',encoding='utf-8')as  fd:
        lines=fd.readlines()
        for i,line in enumerate(lines):
            if i%5000==0:
                print("has extracted "+str(i)+" rows.")
            list=line.split('\t')
            sentence="".join(list[2:])
            sentence=re.sub(r'<[^>]+>','',sentence)
            sentence=re.sub(r'[0-9]','',sentence)
            with open(save,'a+',encoding='utf-8') as wd:
                wd.write(sentence)
        fd.close()
#对语料进行分词
#加载停用词
def getStopwords(stopwords_path):
    stopwords = []
    with open(stopwords_path, "r", encoding='utf8') as f:
        lines = f.readlines()
        for line in lines:
            stopwords.append(line.strip())
    return stopwords

#分词
def segment(corpus,stopwords_path,cut_save):
    stopwords=getStopwords(stopwords_path)
    segmentCorpus = open(cut_save, 'a', encoding='utf8')
    with open(corpus,'r', encoding='utf8') as f:
        text = f.readlines()
        for i,sentence in enumerate(text):
            if i%5000==0:
                print("has segmentation "+str(i)+" rows.")
            sentence = list(jieba.cut(sentence))
            sentence_segment = []
            for word in sentence:
                if word not in stopwords:
                    sentence_segment.append(word)
            segmentCorpus.write(" ".join(sentence_segment))
        del text
        f.close()
        segmentCorpus.close()

def run(data_path,corpus_path,user_dict,stopwords_path,cutCorpus,output1,output2):
    #提取语料
    extract(data_path,corpus_path)
    print("finished extrac corpus---")
    # 加载自定义词典
    jieba.load_userdict(user_dict)
    segment(corpus_path,stopwords_path,cutCorpus)
    print('finished segment---')
    #训练词向量
    sentences = word2vec.Text8Corpus(cutCorpus)
    model = word2vec.Word2Vec(sentences, size=300, window=10, min_count=5,iter=20)
    model.save(output1)
    model.wv.save_word2vec_format(output2, binary=False)
    # model = gensim.models.Word2Vec.load('vec.model')
    print(model.most_similar('债券'))
    print()
    print(model.most_similar('股票'))
    print()
    print(model.most_similar('腾讯'))
    print()



if __name__ == '__main__':
    data_path='./finance_classes.txt'
    corpus_path='./corpus.txt'
    stopwords_path='./stopwords.dat'
    user_dict='./user_dict.txt'
    cutCorpus='./cutCorpus.txt'
    output1='./models/vec.model'
    output2='./models/word2vec_format'
    run(data_path,corpus_path,user_dict,stopwords_path,cutCorpus,output1,output2)

二、利用google官方提供的方法实现关键词聚类

1. 上面已经利用现有的语料实现词向量生成，接下来我们可以使用google官网提供的方法来利用word2vec实现对关键词的聚类

官网代码下载链接：https://code.google.com/archive/p/word2vec/source

2.下载了这些source文件之后，需要在文件所在目录执行make命令，

执行make命令之前：

当 make 命令第一次执行时，它扫描 Makefile 找到目标以及其依赖。如果这些依赖自身也是目标，继续为这些依赖扫描 Makefile 建立其依赖关系，然后编译它们。一旦主依赖编译之后，然后就编译主目标（这是通过 make 命令传入的）。

执行make命令之后：

现在，假设你对某个源文件进行了修改，你再次执行 make 命令，它将只编译与该源文件相关的目标文件，因此，编译完最终的可执行文件节省了大量的时间。

3.利用google官方的word2vec来训练自己的词向量。

./word2vec -train /data0/shixi_jiajuan/cutCorpus.txt -output vectors.bin -cbow 0 -size 300 -window 10 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1

4.利用官方提供的聚类方法，Kmeans来实现关键词聚类

./distance vectors.bin

参考：

https://blog.csdn.net/zhaoxinfan/article/details/11069485

AtomGit 开源协作平台测评赛

瓜分20万奖金获得内推名额丰厚实物奖励易参与易上手

更多推荐

【Spring Boot 】Spring Boot + HikariCP 连接池使用示例

文章目录示例工具版本HikariCP 依赖HikariCP 配置1. connectionTimeout2. minimumIdle3. maximumPoolSize4. idleTimeout5. maxLifetime6. autoCommitSpring Boot Data + HikariCP + MySQL示例测试应用程序1. 使用 Maven 命令2. 使用 Eclipse3. 使用