ElasticSearch分词器

这样当我们去搜索某个关键词时，ES 首先根据它的前缀或者后缀迅速缩小关键词的在 term dictionary 中的范围，大大减少了磁盘IO的次数。为了进一步提高索引效率，ES对trem的前缀或后缀构建了trem index，用于对trem本身的索引，如下图所示。基本处理逻辑为按照预先制定的分词规则，把原始文档分割成若干更小粒度的词项，粒度大小取决于分词器规则。Elasticsearch 的JSO

胡尚

1444人浏览 · 2024-08-13 17:43:38

胡尚 · 2024-08-13 17:43:38 发布

文章目录

分词器

分词器

基本概念

分词器官方称之为文本分析器

基本处理逻辑为按照预先制定的分词规则，把原始文档分割成若干更小粒度的词项，粒度大小取决于分词器规则。

分词发生时期

分词器的处理过程发生在 Index Time 和 Search Time 两个时期。

Index Time：文档写入并创建倒排索引时期，其分词逻辑取决于映射参数analyzer。
Search Time：搜索发生时期，其分词仅对搜索词产生作用。

分词器的组成

切词器（Tokenizer）：用于定义切词（分词）逻辑
词项过滤器（Token Filter）：用于对分词之后的单个词项的处理逻辑
字符过滤器（Character Filter）：用于处理单个字符

注意：分词器不会对源数据造成任何影响，分词仅仅是针对倒排索引或者搜索词的行为。

切词器

主要用来对原始文本进行细粒度拆分。拆分之后的每一个部分称之为一个 Term词项

可以把切词器理解为预定义的切词规则。官方内置了很多种切词器，默认的切词器为standard，我们可以安装切词器插件使用ik分词器ik_max_word

词项过滤器

处理切词完成之后的词项，例如把大小写转换，删除停用词或同义词处理等

# 使用lowercase过滤器 将大写字母转小写
GET /_analyze
{
  "filter": ["lowercase"],
  "text": ["WWW ELASTIC ORG CN"]
}




# 将小写字母转大写
GET /_analyze
{
  "tokenizer" : "standard",
  "filter" : ["uppercase"],
  "text" : ["www.elastic.org.cn","www elastic org cn"]
}

在这里插入图片描述

停用词

在切词完成之后，会被干掉词项，即停用词。停用词可以自定义

我们安装的ik切词器中定义停用词的文件如下所示

英文停用词（english）：a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with。

在这里插入图片描述

# 使用stop停用词，会发现分词结果中不存在 are这个词
GET _analyze
{
  "tokenizer": "standard", 
  "filter": ["stop"],
  "text": ["What are you doing"]
}

在这里插入图片描述

### 自定义 filter
DELETE test_token_filter_stop
PUT test_token_filter_stop
{
  "settings": {
    "analysis": {
      "filter": {
        "my_filter": {		# my_filter为自定义filter名
          "type": "stop",	# 类型为stop停用词
          "stopwords": [	# 停用的词为www
            "www"
          ],
          "ignore_case": true
        }
      }
    }
  }
}

# 测试 分词结果中不存在www
GET test_token_filter_stop/_analyze
{
  "tokenizer": "standard", 
  "filter": ["my_filter"], 
  "text": ["What www WWW are you doing"]
}

在这里插入图片描述

同义词

同义词定义规则

a, b, c => d：这种方式，a、b、c 会被 d 代替。
a, b, c, d：这种方式下，a、b、c、d 是等价的。

PUT test_token_filter_synonym
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym": { 		# 自定义filter为 my_synonym
          "type": "synonym",	# 类型为 synonym
          "synonyms": [ "good, nice => excellent" ]   # good, nice 等价于 excellent
        }
      }
    }
  }
}

# 测试 对good进行分词 结果存入倒排索引中的是excellent
GET test_token_filter_synonym/_analyze
{
  "tokenizer": "standard", 
  "filter": ["my_synonym"], 
  "text": ["good"]
}

字符过滤器

分词之前的预处理，过滤无用字符。

定义格式：

PUT <index_name>
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {   # my_char_filter自定义名称
          "type": "<char_filter_type>"  # 指定字符filter类型   html_strip   mapping   pattern_replace
        }
      }
    }
  }
}

type：使用的字符过滤器类型名称，可配置以下值：

html_strip
mapping
pattern_replace

HTML 标签过滤器

类型type为 html_strip 代表使用 HTML 标签过滤器

参数：escaped_tags：需要保留的 html 标签

# 创建索引时，自定义filter
PUT test_html_strip_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {    		# 自定义filter名 my_char_filter
          "type": "html_strip",  # 类型为 html_strip     代表使用 HTML 标签过滤器
          "escaped_tags": [     # 当前仅保留 a 标签        
            "a"
          ]
        }
      }
    }
  }
}

# 测试，分词结果中不存在<p>标签 只有<a>标签
GET test_html_strip_filter/_analyze
{
  "tokenizer": "standard", 
  "char_filter": ["my_char_filter"],
  "text": ["<p>I&apos;m so <a>happy</a>!</p>"]
}

字符映射过滤器

type为mapping 代表使用字符映射过滤器

# 创建索引时自定义filter
PUT test_html_strip_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "hs_char_filter": {		# 自定义filter名 hs_char_filter
          "type": "mapping",    # mapping 代表使用字符映射过滤器
          "mappings": [          # 数组中规定的字符会被等价替换为 => 指定的字符
            "滚 => *",
            "垃 => *",
            "圾 => *"
          ]
        }
      }
    }
  }
}

# 测试，分词结果中指定字符会被替换为 * 
GET test_html_strip_filter/_analyze
{
  #"tokenizer": "standard", 
  "char_filter": ["hs_char_filter"],
  "text": "你就是个垃圾！滚"
}

正则替换过滤器

type为 pattern_replace 代表使用正则替换过滤器

PUT text_pattern_replace_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",    # pattern_replace 代表使用正则替换过滤器            
          "pattern": """(\d{3})\d{4}(\d{4})""",    # 正则表达式
          "replacement": "$1****$2"				# 中间4位数字替换为 *
        }
      }
    }
  }
}

GET text_pattern_replace_filter/_analyze
{
  "char_filter": ["my_char_filter"],
  "text": "您的手机号是18868686688"
}

倒排索引的数据结构

数据写入ES时，会经过分词器，将数据切分不同的trem。ES将trem于其对应的文档建立一种映射关系成为倒排索引。

Elasticsearch 的JSON文档中的每个字段，都有自己的倒排索引。

可以指定对某些字段不做索引：

优点︰节省存储空间
缺点: 字段无法被搜索

为了进一步提高索引效率，ES对trem的前缀或后缀构建了trem index，用于对trem本身的索引，如下图所示

在这里插入图片描述

这样当我们去搜索某个关键词时，ES 首先根据它的前缀或者后缀迅速缩小关键词的在 term dictionary 中的范围，大大减少了磁盘IO的次数。

Term Dictionary 单词词典

记录所有文档的trem，记录trem和倒排列表的关联关系
Posting List 倒排列表

记录了单词trem对应的文档结合，由倒排索引项Posting 组成
Posting 倒排索引项
- 文档ID
- 词频TF。该单词trem在文档中出现的次数，用于相关性评分
- 位置position。单词在文档中分词的位置，用于短语搜索 match_phrase query
- 偏移量offset。记录单词的开始结束位置，实现高亮显示 highlight

GET /_analyze
{
  "analyzer": "ik_max_word",
  "text": ["中华人民共和国"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "中华人民共和国",
      "start_offset" : 0,	
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0		# 位置下标0
    },
    {
      "token" : "中华人民",
      "start_offset" : 0,		# 开始偏移量
      "end_offset" : 4,			# 结束偏移量
      "type" : "CN_WORD",
      "position" : 1			# 位置下标1  之后依次递增
    }
    ...
    ]
}