创建于2025年09月24日 16:00

状态

公开

JavaScript 中使用 `new RegExp` 的注意事项与安全转义

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/escape

可以看下这个的 RegExp.escape() polyfill 实现。

在日常开发中，我们经常需要用正则来做字符串匹配。大多数场景下，直接使用正则字面量 /.../ 就够了。但如果需要动态生成正则，就必须用到 new RegExp() 构造函数。然而，这里面隐藏了不少坑，尤其是 转义问题 和 安全性问题。本文就来系统总结一下。

1. 基本用法

1// 字面量写法
2const r1 = /\d+/;
3
4// 构造函数写法
5const r2 = new RegExp("\\d+");

看上去差别不大，但构造函数的第一个参数是字符串，所以需要额外注意转义。

2. 双重转义问题

在正则字面量里，\d 表示数字；但放在字符串里，"\d" 实际上是 "d"，根本不是正则里的 \d。因此必须写成 "\\d"，才能正确转义。

举几个常见例子：

1// 匹配点号 .
2/\./
3new RegExp("\\.")
4
5// 匹配反斜杠 \
6/\\/
7new RegExp("\\\\")

可以看到，凡是涉及 \ 的地方，都会变得非常啰嗦。

3. 修饰符必须单独传

在字面量里我们习惯写 /\d+/gi，而构造函数必须把修饰符作为第二个参数：

1const r = new RegExp("\\d+", "gi");

不要试图写进字符串里，那是无效的。

4. 动态拼接的风险

new RegExp() 的最大价值在于可以动态拼接字符串：

1const keyword = "hello";
2const r = new RegExp(keyword, "i");

这样可以根据用户输入生成正则。但这也带来了风险：

用户输入可能包含正则元字符（如 .*、[a-z] 等），会被当成正则语法解析，导致匹配逻辑被“注入”。

例如：

1const input = ".*";
2const r = new RegExp(input); // 相当于匹配任意字符串

这就存在安全隐患。

5. 如何安全转义用户输入

最好的做法是对用户输入做正则转义，把所有特殊字符都当作普通字符处理：

1function escapeRegExp(str) {
2  return str.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
3}

用法：

1const keyword = "a+b";
2const safeKeyword = escapeRegExp(keyword);
3const r = new RegExp(safeKeyword, "i");
4
5console.log(r.test("a+b")); // true

这样可以避免正则注入，也能让 new RegExp() 更安全可靠。

6. 总结

优先用字面量：固定正则推荐 /.../，可读性更好，也不用双重转义。
小心双重转义：new RegExp() 里凡是 \ 都得写成 \\。
修饰符单独传：第二个参数写 g i m s u y。
用户输入要转义：写一个 escapeRegExp 函数，避免安全问题。

一句话总结： 👉 new RegExp() 的优势是动态生成，缺点是转义繁琐，安全性也要格外小心。

1) 基础：安全转义（逃逸所有元字符）

1/**
2 * 将任意文本转为“字面量”正则片段。
3 * 逃逸的集合：.^$*+?()[]{}|\ 和反斜杠本身
4 */
5export function escapeRegExp(input: string): string {
6  // 注意 [] 内部也要转义 \ 和 ]
7  return input.replace(/[\\^$.*+?()[\]{}|]/g, '\\$&');
8}

说明：\\$& 的含义是“把命中的整个子串（元字符）前面加一个反斜杠”。

2) 进阶：安全构造 RegExp（带选项）

1export interface SafeRegexOptions {
2  /** 是否整词匹配（Unicode 友好），默认 false */
3  wholeWord?: boolean;
4  /** 是否要求从字符串开头开始匹配，默认 false */
5  startsWith?: boolean;
6  /** 是否要求到字符串结尾结束匹配，默认 false */
7  endsWith?: boolean;
8
9  /** 正则 flags，默认 'u'（推荐始终带上 'u' 以支持 Unicode） */
10  flags?: string;
11
12  /**
13   * 多关键词时是否去重，默认 true
14   * 仅在传入数组 inputs 时生效
15   */
16  dedupe?: boolean;
17
18  /**
19   * 当输入为空字符串时的行为：
20   * - 'match-nothing'（默认）：返回 /(?!)/，永不匹配
21   * - 'match-any': 返回 /^.*$/u（或加上 startsWith/endsWith 的变体）
22   */
23  onEmpty?: 'match-nothing' | 'match-any';
24}
25
26/**
27 * 安全构造正则：
28 * - 支持单个字符串或多个关键词（数组）
29 * - 自动转义
30 * - 可选整词/起止匹配
31 * - Unicode 安全
32 */
33export function makeSafeRegex(
34  inputs: string | string[],
35  {
36    wholeWord = false,
37    startsWith = false,
38    endsWith = false,
39    flags = 'u',
40    dedupe = true,
41    onEmpty = 'match-nothing',
42  }: SafeRegexOptions = {}
43): RegExp {
44  const arr = (Array.isArray(inputs) ? inputs : [inputs]).map(String);
45
46  // 处理空输入
47  const normalized = dedupe ? Array.from(new Set(arr)) : arr;
48  const nonEmpty = normalized.filter(s => s.length > 0);
49
50  if (nonEmpty.length === 0) {
51    if (onEmpty === 'match-any') {
52      const body = `${startsWith ? '^' : ''}.*${endsWith ? '$' : ''}`;
53      return new RegExp(body, flags);
54    }
55    return new RegExp('(?!.)'); // 永不匹配
56  }
57
58  // 逐个转义
59  const escaped = nonEmpty.map(escapeRegExp);
60
61  // 多关键词：用非捕获分组 + alternation（|）连接
62  let core =
63    escaped.length === 1 ? escaped[0] : `(?:${escaped.join('|')})`;
64
65  // 整词匹配（Unicode 友好）。\b 在 Unicode 下不可靠，自己造“词边界”：
66  // 词字符集合采用 \p{L}\p{N}_（字母、数字、下划线），需要 'u' 标志。
67  if (wholeWord) {
68    const WB = '[\\p{L}\\p{N}_]';
69    core = `(?<!${WB})${core}(?!${WB})`;
70    if (!flags.includes('u')) flags += 'u'; // 确保有 u
71  }
72
73  if (startsWith) core = `^${core}`;
74  if (endsWith) core = `${core}$`;
75
76  return new RegExp(core, flags);
77}

为何不用 `\b`？

\b 的“单词边界”是基于 ASCII 的，对中文、日文等并不可靠。
以上实现用 (?<![\p{L}\p{N}_]) ... (?![\p{L}\p{N}_]) 自定义边界，更通用（需要 u）。

3) 替换时也要“转义”：避免 `$1` 等被当作分组引用

1/**
2 * 将替换字符串转为字面量替换：
3 * 在 String.prototype.replace 中，$ 有特殊含义（$&, $1...）
4 * 所以要把所有 $ 变成 $$，避免被当作分组引用。
5 */
6export function escapeReplacement(repl: string): string {
7  return repl.replace(/\$/g, '$$');
8}

4) 实战示例

4.1 搜索框：用户输入当“字面量”搜索（忽略大小写）

1const input = 'a+b(c)'; // 来自用户
2const re = makeSafeRegex(input, { flags: 'iu' }); // i 忽略大小写，u 支持 Unicode
3console.log(re); // /a\+b\(c\)/iu
4console.log(re.test('xx A+B(C) yy')); // true

4.2 关键字高亮（全量多关键词）

1const keywords = ['C++', '[draft]', '中文'];
2const re = makeSafeRegex(keywords, { flags: 'igu' }); // 全局匹配以便高亮多处
3
4const text = '学 C++ 的人看过 [draft] 吗？中文也要高亮';
5const highlighted = text.replace(re, m => `<mark>${m}</mark>`);
6console.log(highlighted);

4.3 整词匹配（Unicode 友好）

1const reWord = makeSafeRegex('阿里', { wholeWord: true, flags: 'gu' });
2console.log('阿里云'.match(reWord));     // null（不是整词）
3console.log('阿里 是一个词'.match(reWord)); // 命中“阿里”

4.4 仅匹配开头/结尾

1const reStart = makeSafeRegex('hello', { startsWith: true, flags: 'u' });
2console.log(reStart.test('hello world')); // true
3console.log(reStart.test('a hello'));     // false
4
5const reEnd = makeSafeRegex('.mp4', { endsWith: true, flags: 'u' });
6console.log(reEnd.test('video.mp4'));     // true
7console.log(reEnd.test('video.mp4.bak')); // false

4.5 批量过滤文件名（多后缀 OR）

1const reExt = makeSafeRegex(['.png', '.jpg', '.jpeg', '.webp'], { endsWith: true, flags: 'iu' });
2console.log(['a.PNG', 'b.txt', 'c.webp?x=1'].filter(n => reExt.test(n))); 
3// => ['a.PNG']，注意 endsWith 不会匹配 '?x=1' 这种情况

如需“带 query 的 URL 后缀”过滤，可用：makeSafeRegex(['.png', '.jpg', '.jpeg', '.webp'].map(s => ${s}(?:\?.*)?$), { flags:'iu' }) 或自己拼装更复杂模式。

4.6 安全替换（保留用户原样文本）

1const rawFind = '(price)$';
2const rawRepl = '$99'; // 想替换为字面量“$99”
3
4const re = makeSafeRegex(rawFind, { flags: 'gu' });
5const replaced = 'final (price)$ here'.replace(re, escapeReplacement(rawRepl));
6console.log(replaced); // final $99 here

4.7 空输入的处理策略

1// 默认：空输入不匹配任何内容
2makeSafeRegex('', {});           // /(?!.)/
3
4// 也可以选择“匹配任意”（比如用户清空搜索框后显示所有结果）
5makeSafeRegex('', { onEmpty: 'match-any' }); // /^.*$/u

4.8 与 `new RegExp()` 双重转义的关系

以上工具函数直接返回 RegExp 实例，所以你无需操心 new RegExp("\\d+") 这种双重转义问题。
若你只想要被转义后的模式字符串（用于存数据库等），用 escapeRegExp() 即可；拿出来再 new RegExp(escaped, flags)。

5) 常见坑位速查

\b 不可靠：在非 ASCII 语言（中文等）上可能失效；用上面的 wholeWord（需要 u）。
大小写忽略 i + Unicode：某些字符的大小写折叠依赖 u，建议默认加 u。
多关键词：用 (?:a|b|c)，不要忘了先逐个转义再拼接。
替换中的 $："$1"、"$&" 都有特殊含义；字面量替换请用 escapeReplacement。
空输入：明确策略（不匹配 / 匹配任意）；不要让 new RegExp('') 默默匹配任何位置导致性能/逻辑问题。
性能：大列表关键词可考虑：
- 先按首字母分桶构造多个正则；
- 或在 Node/后端用 Aho-Corasick / Trie 做多模匹配；
- 或对超长文本采用“先粗后细”的两段式过滤。

JavaScript 中使用 new RegExp 的注意事项与安全转义