analyzer（分析器）

analyzed（被分析）的 string fields（字符串字段）的值通过 analyzer（分析器）来传递，将字符串转换为一串 tokens（标记）标记或者 terms（词条）。例如，基于某种分析器，字符串 "The quick Brown Foxes" 被解析为 : quick，brown，fox。这些是索引该字段的实际 terms（词条），可以用来有效地搜索大块文本内的单个单词。

这样的分析过程不仅发生在索引的时候，而且在查询时也需要 : 查询字符串需要通过相同（或类似的）analyzer分析器传递，以便尝试查找那些存在于索引的相同格式的 terms（词条）。

Elasticsearch 内置了许多 pre-defined analyzers（预定义的分析器），可以在不进一步配置的情况下使用。它还附带许多 character filters（字符过滤器），tokenizers（分词器）和Token Filters（标记过滤器）。可以用来组合配置每个索引的自定义analyzer（分析器）。

每一个查询，每一个字段或索引都可以指定分析器，在索引的时候，Elasticsearch 将按以下顺序查找 analyzer（分析器）:

定义在字段映射中的 analyzer（分析器）。
索引设置中 default（默认）的 analyzer（分析器）。
standard（标准的）analyzer（分析器）。

在查询时，还有几层 :

在 full-text query（全文查找）中定义的 analyzer（分析器）。
在字段映射中定义的 search_analyzer（搜索分析器）。
在字段映射中定义的 analyzer（分析器）。
在索引配置中 default_search（默认搜索的）analyzer（分析器）。
索引设置中 default（默认）的 analyzer（分析器）。
standard（标准的）analyzer（分析器）。

为特定字段指定分析器的最简单的方法是在字段映射中进行定义，如下所示 :

curl -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "my_type": {
      "properties": {
        "text": { # 1
          "type": "text",
          "fields": {
            "english": { # 2
              "type":     "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}
'
curl -XGET 'localhost:9200/my_index/_analyze?pretty' -H 'Content-Type: application/json' -d' # 3
{
  "field": "text",
  "text": "The quick Brown Foxes."
}
'
curl -XGET 'localhost:9200/my_index/_analyze?pretty' -H 'Content-Type: application/json' -d' # 4
{
  "field": "text.english",
  "text": "The quick Brown Foxes."
}
'

text字段使用默认的standard（标准的）分析器。

text.english多字段使用english分词器，可以删除stop words（停用词）并应用于stemming词干。

返回tokens（标记）: [the，quick，brown，foxes]。

返回tokens（标记）: [quick，brown，fox]。

search_quote_analyzer（搜索引用分析器）

该search_quote_analyzer设置允许你为短语指定 analyzer（分析器），这在处理禁用短语的 stop words（停用词）时特别有用。

要使用三个 analyzer（分析器）设置来禁用短语的停用词 :

一个 analyzer（分析器）设置成索引所有的 terms（词条）包括 stop words（停用词）。
一个 search_analyzer设置成将移除 stop words（停用词）的非短语查询。
一个search_quote_analyzer设置不会移除 stop words（停用词）的短语查询。

curl -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d'
{
   "settings":{
      "analysis":{
         "analyzer":{
            "my_analyzer":{ # 1
               "type":"custom",
               "tokenizer":"standard",
               "filter":[
                  "lowercase"
               ]
            },
            "my_stop_analyzer":{ # 2
               "type":"custom",
               "tokenizer":"standard",
               "filter":[
                  "lowercase",
                  "english_stop"
               ]
            }
         },
         "filter":{
            "english_stop":{
               "type":"stop",
               "stopwords":"_english_"
            }
         }
      }
   },
   "mappings":{
      "my_type":{
         "properties":{
            "title": {
               "type":"text",
               "analyzer":"my_analyzer", # 3
               "search_analyzer":"my_stop_analyzer", # 4
               "search_quote_analyzer":"my_analyzer" # 5
            }
         }
      }
   }
}
'

my_analyzer分析器，用于标识所有terms（词条）包括stop words（停用词）。

移除stopwords（停用词）的my_stop_analyzer分析器。

analyzer（分析器）设置指向将在索引时使用的my_analyzer分析器。

search_analyzer设置指向my_stop_analyzer，并移除非短语查询的stop words（停用词）。

search_quote_analyzer设置指向my_analyzer分析器，并确保stop words（停用词）不会从短语查询中移除。

PUT my_index/my_type/1
{
   "title":"The Quick Brown Fox"
}

PUT my_index/my_type/2
{
   "title":"A Quick Brown Fox"
}

GET my_index/my_type/_search
{
   "query":{
      "query_string":{
         "query":"\"the quick brown fox\"" # 1
      }
   }
}

由于查询时用括号括起来的,因此它被检测为短语查询。因此search_quote_analyzer会启动并确保停用词不会从查询中移除。my_analyzer分析器将返回与其中一个文档相匹配的terms（词条）[the,quick,brown,fox]。同时，将通过my_stop_analyzer分析器分析terms（词条）查询，该分析器将过滤掉stop words（停用词）。因此，搜索 The quick brown fox 或 A quick brown fox 将返回两个文档，因为这两个文档都包含以下tokens（词元）[quick,brown,fox]。没有search_quote_analyzer，将不可能对phrasequeries（短语查询）做到精确匹配，因为短语查询时stop words（停用词）会被删除，从而导致两个文档都会被匹配到。

Previous3.2.3.Mapping parameters（映射参数）Nextnormalizer(归一化)

Last updated 6 years ago

Was this helpful?