Resource Hub

创建于2025年07月07日 15:32

状态

公开

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Unicode_character_class_escape

Unicode字符类正则表达式解析

[\\pP\\pS\\pZ] 是一个Unicode字符类的正则表达式模式，用于匹配特定类型的字符。

各部分含义

`\pP` - 标点符号类

匹配所有Unicode标点符号字符
包括：句号(.)、逗号(,)、分号(;)、冒号(:)、问号(?)、感叹号(!)等
示例：！？。，；：

`\pS` - 符号类

匹配所有Unicode符号字符
包括：数学符号、货币符号、修饰符号等
示例：+ - = $ € ¥ @ # % ^ & *

`\pZ` - 分隔符类

匹配所有Unicode分隔符字符
包括：空格、制表符、换行符等空白字符
示例：空格、制表符、不间断空格

使用场景

这个正则表达式通常用于：

文本清理：去除或替换标点、符号和空白字符
数据预处理：在自然语言处理中过滤非字母数字字符
格式验证：检查文本中是否包含特殊字符
分词处理：将标点和符号作为分隔符

示例匹配

该模式会匹配以下字符：

中文标点：。，！？；：
英文标点：. , ! ? ; :
数学符号：+ - × ÷ = ≠
货币符号：$ € ¥ £
空白字符：空格、制表符、换行符

这是一个功能强大的Unicode字符类组合，适用于需要处理多语言文本中各种非字母数字字符的场景。

在 JavaScript 中使用

JavaScript 中使用 \p{P} 而不是 \pP
需要添加 u 标志以启用 Unicode 支持

1const text = "Hello, world! 你好，世界！";
2// 在 JavaScript 中需要使用 Unicode 属性转义
3const punctuation = text.match(/\p{P}/gu);
4console.log(punctuation); // [',', '!', '，', '！']

在 golang 中使用

Go 中的 Unicode 标点符号匹配

在 Go 语言中，\pP 语法并不直接支持，但可以使用 \p{P} 来匹配 Unicode 标点符号字符[1]。Go 的 regexp 包基于 RE2 语法实现，对 Unicode 属性的支持有限[2]。

支持情况

✅ 支持的语法

\p{P} - 匹配所有 Unicode 标点符号
\p{Punct} - 与 \p{P} 等效
\P{P} - 匹配非标点符号字符

❌ 不支持的语法

\pP - 这种简写形式不被支持
完整的 Unicode 属性数据库[3]

实际使用示例

基本匹配

1package main
2
3import (
4    "fmt"
5    "regexp"
6)
7
8func main() {
9    // 匹配标点符号
10    re := regexp.MustCompile(`\p{P}`)
11    
12    text := "Hello, world! 你好，世界！"
13    matches := re.FindAllString(text, -1)
14    fmt.Println(matches) // [, ! ， ！]
15    
16    // 检查单个字符是否为标点符号
17    fmt.Println(re.MatchString("!"))  // true
18    fmt.Println(re.MatchString("a"))  // false
19    fmt.Println(re.MatchString("，")) // true
20}

文本清理

1package main
2
3import (
4    "fmt"
5    "regexp"
6)
7
8func main() {
9    text := "Hello, world! 你好，世界！"
10    
11    // 移除所有标点符号
12    re := regexp.MustCompile(`\p{P}`)
13    clean := re.ReplaceAllString(text, "")
14    fmt.Println(clean) // "Hello world 你好世界"
15    
16    // 替换标点符号为空格
17    spaced := re.ReplaceAllString(text, " ")
18    fmt.Println(spaced) // "Hello  world  你好 世界 "
19}

使用字符类

1package main
2
3import (
4    "fmt"
5    "regexp"
6)
7
8func main() {
9    // 在字符类中使用
10    re := regexp.MustCompile(`[\p{P}\s]+`) // 匹配标点符号或空白字符
11    
12    text := "word1, word2! word3?"
13    parts := re.Split(text, -1)
14    fmt.Println(parts) // [word1 word2 word3 ]
15}

限制与注意事项

1. 有限的 Unicode 属性支持

Go 的 regexp 包不支持完整的 Unicode 属性数据库[3]，只支持：

Unicode 类别 (\p{L}, \p{N}, \p{P} 等)
Unicode 脚本 (\p{Latin}, \p{Han} 等)
不支持 Unicode 属性如 \p{White_Space}, \p{ID_Start} 等

2. 替代方案

如果需要更精确的标点符号检测，可以使用 unicode 包：

1package main
2
3import (
4    "fmt"
5    "unicode"
6)
7
8func main() {
9    chars := []rune{'!', 'a', '，', '?'}
10    
11    for _, r := range chars {
12        fmt.Printf("'%c' is punct: %v\n", r, unicode.IsPunct(r))
13    }
14}

3. 字符编码

Go 的 regexp 包完全支持 UTF-8 编码，可以正确处理中文标点符号[2]：

1package main
2
3import (
4    "fmt"
5    "regexp"
6)
7
8func main() {
9    re := regexp.MustCompile(`\p{P}`)
10    
11    // 中文标点符号
12    chinese := "你好，世界！《标题》"
13    matches := re.FindAllString(chinese, -1)
14    fmt.Println(matches) // [， ！ 《 》]
15}

总结

虽然 Go 不支持 \pP 这种简写语法，但 \p{P} 提供了相同的功能来匹配 Unicode 标点符号[1]。对于更复杂的 Unicode 字符处理需求，建议结合使用 unicode 包中的相关函数。

[1] https://www.tutorialspoint.com/check-if-the-rune-is-a-unicode-punctuation-character-or-not-in-golang [2] https://pkg.go.dev/regexp [3] https://github.com/golang/go/issues/10851 [4] https://stackoverflow.com/questions/71202611/golang-how-to-replace-string-in-regex-group [5] https://www.reddit.com/r/golang/comments/d38vbv/better_regex_support_for_go/ [6] https://pkg.go.dev/regexp/syntax [7] https://cs.lmu.edu/~ray/notes/regex/ [8] https://www.freecodecamp.org/news/what-is-punct-in-regex-how-to-match-all-punctuation-marks-in-regular-expressions/ [9] https://www.regular-expressions.info/unicode.html [10] https://groups.google.com/g/golang-nuts/c/kJ2Bkp2hilY [11] https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html [12] https://www.youtube.com/watch?v=sa-TUpSx1JA [13] https://www.reddit.com/r/golang/comments/da9o55/unicode_regexp_errors/ [14] https://gobyexample.com/regular-expressions [15] https://docs.pexip.com/admin/regex_reference.htm [16] https://stackoverflow.com/questions/72334719/matching-multiple-unicode-characters-in-golang-regexp [17] https://groups.google.com/g/golang-nuts/c/M3lmSUptExQ [18] https://github.com/golang/go/issues/55884 [19] https://yourbasic.org/golang/regexp-cheat-sheet/ [20] https://www.honeybadger.io/blog/a-definitive-guide-to-regular-expressions-in-go/