elasticsearch IK 中文分词器 精确查询
IKAnalyzer: 免费开源的java分词器,目前比较流行的中文分词器之一,简单,稳定,想要特别好的效果,需要自行维护词库,支持自定义词典。安装ik分词器插件https://github.com/medcl/elasticsearch-analysis-ik/releases?after=v6.4.2...
在上面2篇文章的基础上,来学习下IK
IKAnalyzer: 免费开源的java分词器,目前比较流行的中文分词器之一,简单,稳定,想要特别好的效果,需要自行维护词库,支持自定义词典。
一. 安装ik分词器插件
下载地址:https://github.com/medcl/elasticsearch-analysis-ik/releases?after=v6.4.2,版本如下图:
然后在elasticsearch下的plugins文件下创建一个ik文件夹,将上面的压缩包解压到这里。
修改ik文件下的plugin-descriptor.properties文件,设置版本号(我这里用的elasticsearch的版本号是6.2.2,ik版本号是6.3.0,官网说同一个大版本号都是可以用的,这里大版本是6),如下:
description=IK Analyzer for Elasticsearch
#
# 'version': plugin's version
version=6.3.0
#
# 'name': the plugin name
name=analysis-ik
#
# 'classname': the name of the class to load, fully-qualified.
classname=org.elasticsearch.plugin.analysis.ik.AnalysisIkPlugin
#
# 'java.version' version of java the code is built against
# use the system property java.specification.version
# version string must be a sequence of nonnegative decimal integers
# separated by "."'s and may have leading zeros
java.version=1.8
#
# 'elasticsearch.version' version of elasticsearch compiled against
# You will have to release a new version of the plugin for each new
# elasticsearch release. This version is checked when the plugin
# is loaded so Elasticsearch will refuse to start in the presence of
# plugins with the incorrect elasticsearch.version.
elasticsearch.version=6.2.2
然后重启elasticserch:切到bin文件夹下执行命令elasticsearch。
启动成功界面如下所示:
二. 常见ES接口测试
1. 创建一个index
http://localhost:9200/es PUT
执行结果:
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "es"
}
2.创建表,设置分词create a mapping
http://localhost:9200/es/_mapping/doc POST
参数:Content-Type application/json
{
"properties":{
"content":{
"type":"text",
"analyzer":"ik_max_word",
"search_analyzer":"ik_smart"
}
}
}
执行结果:
{
"acknowledged": true
}
3.添加数据,添加4条测试数据
http://localhost:9200/es/doc/1 POST
header: Content-Type :application/json
参数:
{
"content":"美国留给伊拉克的是个烂摊子吗"
}
http://localhost:9200/es/doc/2 POST
header: Content-Type :application/json
{
"content":"公安部:各地校车将享最高路权"
}
http://localhost:9200/es/doc/3 POST
header: Content-Type :application/json
{
"content":"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
}
http://localhost:9200/es/doc/4 POST
header: Content-Type :application/json
{
"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
}
执行结果:
{
"_index": "es",
"_type": "doc",
"_id": "4",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 1,
"_primary_term": 1
}
4. 分词查找,如下查找到2条数据【match查询(注意,match查询只能是针对单个字段)】
http://localhost:9200/es/_search POST
参数:
{
"query": {
"match": {
"content": "中国"
}
}
}
执行结果:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.6489038,
"hits": [
{
"_index": "es",
"_type": "doc",
"_id": "4",
"_score": 0.6489038,
"_source": {
"content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
}
},
{
"_index": "es",
"_type": "doc",
"_id": "3",
"_score": 0.2876821,
"_source": {
"content": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
}
}
]
}
}
5. 删除index
http://localhost:9200/index delete
6.分析分词_analyze
http://localhost:9200/_analyze POST
参数:
{
"analyzer":"ik_max_word",
"text":"中国人"
}
执行结果:
{
"tokens": [
{
"token": "中国人",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "中国",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "国人",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 2
}
]
}
三. 精确合并查询 (and,or)
1. 全部数据如下:
http://localhost:9200/es/_search 查所有数据,如下:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 1,
"hits": [
{
"_index": "es",
"_type": "doc",
"_id": "5",
"_score": 1,
"_source": {
"content": "美国人特朗普垃圾"
}
},
{
"_index": "es",
"_type": "doc",
"_id": "4",
"_score": 1,
"_source": {
"content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
}
},
{
"_index": "es",
"_type": "doc",
"_id": "2",
"_score": 1,
"_source": {
"content": "公安部:各地校车将享最高路权"
}
},
{
"_index": "es",
"_type": "doc",
"_id": "6",
"_score": 1,
"_source": {
"content": "美航空母舰国人"
}
},
{
"_index": "es",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"content": "美国人111"
}
},
{
"_index": "es",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"content": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
}
}
]
}
}
2. or查询
http://localhost:9200/es/_search
参数:
{
"query": {
"match": {
"content": "美国人1"
}
}
}
通过上面的结果可以分析出: 这是因为解析器会将”美国人111“,拆分为了2个词“美国人”和“111”,而且默认的操作符是or,所以查到了如上的2条数据 。
3. 利用and查询,实现精确查询
http://localhost:9200/es/_search?pretty=true POST
{
"query": {
"match": {
"content": {
"query": "美国人111",
"operator": "and"
}
}
}
}
通过以上的结果可以看出,查询的结果content中包含“美国人”和“111”的记录 ,用了and,故精确的查到了一条数据
4.“美国人111”上面的分词,可以被划分为下面的几个词,如下截图:
ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合,适合 Term Query;
ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”,适合 Phrase 查询。
参考:https://www.jianshu.com/p/362f85ebf383
开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!
更多推荐
所有评论(0)