Pattern Replace Character Filter(正则替换字符)
Pattern Replace Character Filter使用正则表达式来匹配应替换为指定替换字符串的字符。替换字符串可以引用正则表达式中的捕获组。

小心"病态"正则表达式

Pattern Replace Character Filter使用Java正则表达式
一个“病态”的正则表达式可能会运行得非常慢,甚至会抛出一个StackOverflowError,还会导致其运行的节点突然退出。

配置

Pattern Replace Character Filter会接收以下参数:
参数名称
参数说明
pattern
一个Java正则表达式. 必填
replacement
替换字符串,可以使用$ 1 .. $ 9语法来引用捕获组,可以参考这里
flags
Java正则表达式标志。标志应该用“|”进行分割,例如“CASE_INSENSITIVE | COMMENTS”。
在下面例子中,我们使用Pattern Replace Character Filter实现用下划线替代任何嵌入的破折号,即123-456-789→123_456_789:
1
PUT my_index
2
{
3
"settings": {
4
"analysis": {
5
"analyzer": {
6
"my_analyzer": {
7
"tokenizer": "standard",
8
"char_filter": [
9
"my_char_filter"
10
]
11
}
12
},
13
"char_filter": {
14
"my_char_filter": {
15
"type": "pattern_replace",
16
"pattern": "(\\d+)-(?=\\d)",
17
"replacement": "$1_"
18
}
19
}
20
}
21
}
22
}
23
24
POST my_index/_analyze
25
{
26
"analyzer": "my_analyzer",
27
"text": "My credit card is 123-456-789"
28
}
Copied!
上面案例将返回如下结果:
1
[ My, credit, card, is 123_456_789 ]
Copied!
出于搜索目的使用替换字符串,会引起原始文本长度的更改,进而导致不正确的高亮显示。案例如下所示。
这个示例在遇到小写字母后跟大写字母(即fooBarBaz→foo Bar Baz)时插入空格,允许单独查询camelCase字词:
1
PUT my_index
2
{
3
"settings": {
4
"analysis": {
5
"analyzer": {
6
"my_analyzer": {
7
"tokenizer": "standard",
8
"char_filter": [
9
"my_char_filter"
10
],
11
"filter": [
12
"lowercase"
13
]
14
}
15
},
16
"char_filter": {
17
"my_char_filter": {
18
"type": "pattern_replace",
19
"pattern": "(?<=\\p{Lower})(?=\\p{Upper})",
20
"replacement": " "
21
}
22
}
23
}
24
},
25
"mappings": {
26
"my_type": {
27
"properties": {
28
"text": {
29
"type": "text",
30
"analyzer": "my_analyzer"
31
}
32
}
33
}
34
}
35
}
36
37
POST my_index/_analyze
38
{
39
"analyzer": "my_analyzer",
40
"text": "The fooBarBaz method"
41
}
Copied!
示例返回的结果如下:
1
[ the, foo, bar, baz, method ]
Copied!
查询bar可以正确找到文档,但突出显示结果将产生不正确的高光,因为我们的字符过滤器更改了原始文本的长度:
1
PUT my_index/my_doc/1?refresh
2
{
3
"text": "The fooBarBaz method"
4
}
5
6
GET my_index/_search
7
{
8
"query": {
9
"match": {
10
"text": "bar"
11
}
12
},
13
"highlight": {
14
"fields": {
15
"text": {}
16
}
17
}
18
}
Copied!
以上的输出结果是:
1
{
2
"timed_out": false,
3
"took": $body.took,
4
"_shards": {
5
"total": 5,
6
"successful": 5,
7
"failed": 0
8
},
9
"hits": {
10
"total": 1,
11
"max_score": 0.2824934,
12
"hits": [
13
{
14
"_index": "my_index",
15
"_type": "my_doc",
16
"_id": "1",
17
"_score": 0.2824934,
18
"_source": {
19
"text": "The fooBarBaz method"
20
},
21
"highlight": {
22
"text": [
23
"The foo<em>Ba</em>rBaz method"
24
]
25
}
26
}
27
]
28
}
29
}
Copied!
Copy link
Contents
配置