# Pattern Replace Character Filter(正则替换字符)

**Pattern Replace Character Filter**使用正则表达式来匹配应替换为指定替换字符串的字符。替换字符串可以引用正则表达式中的捕获组。

> ### 小心"病态"正则表达式
>
> **Pattern Replace Character Filter**使用[Java正则表达式](http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html)。
>
> 一个“病态”的正则表达式可能会运行得非常慢，甚至会抛出一个**StackOverflowError**，还会导致其运行的节点突然退出。
>
> 更多信息请查看[pathological regular expressions and how to avoid them](http://www.regular-expressions.info/catastrophic.html)[.](http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html)

## 配置

**Pattern Replace Character Filter**会接收以下参数：

| 参数名称          | 参数说明                                                                                                                                                                     |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `pattern`     | 一个[Java正则表达式](http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). 必填                                                                                 |
| `replacement` | 替换字符串，可以使用$ 1 .. $ 9语法来引用捕获组，可以参考[这里](http://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#appendReplacement-java.lang.StringBuffer-java.lang.String-)。 |
| `flags`       | Java正则表达式[标志](http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary)。标志应该用“\|”进行分割，例如“CASE\_INSENSITIVE \| COMMENTS”。                      |

在下面例子中，我们使用**Pattern Replace Character Filter**实现用下划线替代任何嵌入的破折号，即123-456-789→123\_456\_789：

```
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}
```

上面案例将返回如下结果：

```
[ My, credit, card, is 123_456_789 ]
```

> 出于搜索目的使用替换字符串，会引起原始文本长度的更改，进而导致不正确的高亮显示。案例如下所示。

这个示例在遇到小写字母后跟大写字母（即fooBarBaz→foo Bar Baz）时插入空格，允许单独查询camelCase字词：

```
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ],
          "filter": [
            "lowercase"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(?<=\\p{Lower})(?=\\p{Upper})",
          "replacement": " "
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The fooBarBaz method"
}
```

示例返回的结果如下：

```
[ the, foo, bar, baz, method ]
```

查询**bar**可以正确找到文档，但突出显示结果将产生不正确的高光，因为我们的字符过滤器更改了原始文本的长度：

```
PUT my_index/my_doc/1?refresh
{
  "text": "The fooBarBaz method"
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": "bar"
    }
  },
  "highlight": {
    "fields": {
      "text": {}
    }
  }
}
```

以上的输出结果是：

```
{
  "timed_out": false,
  "took": $body.took,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2824934,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_doc",
        "_id": "1",
        "_score": 0.2824934,
        "_source": {
          "text": "The fooBarBaz method"
        },
        "highlight": {
          "text": [
            "The foo<em>Ba</em>rBaz method"
          ]
        }
      }
    ]
  }
}
```
