Engineering

正则表达式：文本处理的瑞士军刀

全面掌握 Go 语言的正则表达式：语法、匹配、捕获组、性能优化和实战案例

Leeting Yan 2021-02-27 8 分钟阅读 3614 字

正则表达式：文本处理的瑞士军刀

如果你要在代码里验证邮箱格式、提取日志中的时间戳、替换文本中的敏感信息、或者解析复杂的配置文件——正则表达式（Regular Expression，简称 regex）就是你的瑞士军刀。

正则表达式是一种强大的文本模式匹配工具，几乎每一种主流编程语言都支持它。Go 语言的 regexp 包实现了 RE2 语法，这是一种安全、高效、但略有简化的正则表达式变体。

今天我们就来全面学习 Go 的正则表达式，从基础语法到高级技巧，让你成为文本处理大师。

什么是正则表达式？

正则表达式是一个描述文本模式的字符串。比如：

\d{3}-\d{4} 匹配类似 “123-4567” 的电话号码
[a-z]+@[a-z]+\.[a-z]+ 匹配类似 “user@example.com” 的邮箱
^\d{4}-\d{2}-\d{2}$ 匹配类似 “2021-02-27” 的日期

正则表达式看起来像天书，但一旦你掌握了它的语法规则，就会发现它其实很有逻辑。

第一个正则表达式

package main

import (
	"fmt"
	"regexp"
)

func main() {
	// 编译正则表达式
	re := regexp.MustCompile(`\d+`)

	// 检查是否匹配
	fmt.Println(re.MatchString("Hello 123 World"))   // true
	fmt.Println(re.MatchString("Hello World"))        // false

	// 查找第一个匹配
	fmt.Println(re.FindString("Hello 123 World 456")) // "123"

	// 查找所有匹配
	fmt.Println(re.FindAllString("Hello 123 World 456", -1))
	// [123 456]
}

⚠️ 注意：Go 的正则表达式字符串用反引号 ` 包裹，这样就不需要转义反斜杠。如果用双引号，你需要写 \\d+。

编译 vs 不编译

Go 提供了两种方式执行正则表达式：

1. 预编译（推荐）

re := regexp.MustCompile(`pattern`)  // 失败时 panic
// 或者
re, err := regexp.Compile(`pattern`)  // 返回错误

预编译的正则表达式可以重复使用，性能更好。

2. 一次性使用

matched, _ := regexp.MatchString(`pattern`, "text")

适合只使用一次的情况。

💡 最佳实践：如果是频繁使用的正则表达式，用全局变量预编译：

var emailRE = regexp.MustCompile(`^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`)

func IsValidEmail(email string) bool {
	return emailRE.MatchString(email)
}

正则表达式语法速查

基本匹配

语法	说明	例子
`abc`	匹配字面字符串	“abc”
`.`	匹配任意字符（除了换行）	“a.c” 匹配 “abc”、“axc”
`\d`	匹配数字 [0-9]	`\d{3}` 匹配 “123”
`\D`	匹配非数字	`\D+` 匹配 “abc”
`\w`	匹配单词字符 [a-zA-Z0-9_]	`\w+` 匹配 “user_1”
`\W`	匹配非单词字符	`\W+` 匹配 “@#$”
`\s`	匹配空白字符	`\s+` 匹配 " \t\n"
`\S`	匹配非空白字符	`\S+` 匹配 “text”

量词

语法	说明	例子
`*`	0 次或多次	`a*` 匹配 “"、“a”、“aaa”
`+`	1 次或多次	`a+` 匹配 “a”、“aaa”
`?`	0 次或 1 次	`a?` 匹配 “"、“a”
`{n}`	精确 n 次	`\d{4}` 匹配 “2021”
`{n,}`	至少 n 次	`\d{2,}` 匹配 “12”、“1234”
`{n,m}`	n 到 m 次	`\d{2,4}` 匹配 “12”、“123”、“1234”

字符类

语法	说明	例子
`[abc]`	匹配 a、b 或 c	`[aeiou]` 匹配元音
`[^abc]`	匹配不是 a、b、c 的字符	`[^0-9]` 匹配非数字
`[a-z]`	匹配 a 到 z 之间的字符	`[A-Za-z]` 匹配字母

边界

语法	说明	例子
`^`	字符串开头	`^Hello` 匹配 “Hello world”
`$`	字符串结尾	`world$` 匹配 “Hello world”
`\b`	单词边界	`\bword\b` 匹配独立的 “word”

分组和引用

语法	说明	例子
`(abc)`	捕获组	`(\d+)` 捕获数字
`(?:abc)`	非捕获组	`(?:abc)+` 不捕获
`\1`	引用第一个捕获组	`(\w+)\s+\1` 匹配 “hello hello”
`(?P<name>abc)`	命名捕获组	`(?P<year>\d{4})`

选择

语法	说明	例子
`abc\|def`	匹配 abc 或 def	`cat\|dog`

常用操作

Match：检查是否匹配

re := regexp.MustCompile(`^\d+$`)

fmt.Println(re.MatchString("12345"))  // true
fmt.Println(re.MatchString("123a5"))  // false

// 也可以匹配字节切片
fmt.Println(re.Match([]byte("12345")))  // true

Find：查找匹配

re := regexp.MustCompile(`\d+`)

// 查找第一个
fmt.Println(re.FindString("abc 123 def 456"))  // "123"

// 查找所有（-1 表示不限数量）
fmt.Println(re.FindAllString("abc 123 def 456", -1))
// [123 456]

// 限制数量
fmt.Println(re.FindAllString("abc 123 def 456", 1))
// [123]

// 查找匹配的索引
fmt.Println(re.FindStringIndex("abc 123"))
// [4 7]（匹配的起止索引）

Replace：替换匹配

re := regexp.MustCompile(`\d+`)

// 替换为固定字符串
fmt.Println(re.ReplaceAllString("abc 123 def 456", "X"))
// "abc X def X"

// 替换为函数
result := re.ReplaceAllStringFunc("abc 123 def 456", func(match string) string {
	n, _ := strconv.Atoi(match)
	return strconv.Itoa(n * 2)
})
fmt.Println(result)  // "abc 246 def 912"

Split：按模式分割

re := regexp.MustCompile(`[\s,]+`)

fmt.Println(re.Split("a,b, c  d", -1))
// [a b c d]

捕获组

捕获组让你能提取匹配中的特定部分。

基本捕获组

re := regexp.MustCompile(`(\d{4})-(\d{2})-(\d{2})`)

// FindStringSubmatch 返回整个匹配和各个捕获组
match := re.FindStringSubmatch("今天的日期是 2021-02-27。")
fmt.Println(match)
// [2021-02-27 2021 02 27]

// match[0] 是整个匹配
// match[1] 是第一个捕获组（年）
// match[2] 是第二个捕获组（月）
// match[3] 是第三个捕获组（日）

year, month, day := match[1], match[2], match[3]
fmt.Printf("年: %s, 月: %s, 日: %s\n", year, month, day)

命名捕获组

命名捕获组让代码更易读：

re := regexp.MustCompile(`(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})`)

match := re.FindStringSubmatch("2021-02-27")

// 获取捕获组的名字和索引
names := re.SubexpNames()
result := make(map[string]string)

for i, name := range names {
	if i != 0 && name != "" {
		result[name] = match[i]
	}
}

fmt.Println(result)
// map[day:27 month:02 year:2021]

fmt.Println("年份:", result["year"])

查找所有捕获组

re := regexp.MustCompile(`(\w+)=(\d+)`)

text := "a=1 b=2 c=3"
matches := re.FindAllStringSubmatch(text, -1)

for _, match := range matches {
	fmt.Printf("%s = %s\n", match[1], match[2])
}
// a = 1
// b = 2
// c = 3

性能优化

1. 预编译正则表达式

// ❌ 不好：每次调用都重新编译
func IsValidEmail(email string) bool {
	re := regexp.MustCompile(`^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`)
	return re.MatchString(email)
}

// ✅ 好：只编译一次
var emailRE = regexp.MustCompile(`^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`)

func IsValidEmail(email string) bool {
	return emailRE.MatchString(email)
}

2. 用 Find 而不是 Match

如果你需要知道匹配的内容或位置，用 Find 而不是先 Match 再 Find：

// ❌ 不好：匹配两次
if re.MatchString(s) {
    result := re.FindString(s)
}

// ✅ 好：只匹配一次
result := re.FindString(s)
if result != "" {
    // 找到了
}

3. 避免过度的正则

不是所有文本处理都需要正则表达式：

// ❌ 用正则检查前缀
re := regexp.MustCompile(`^Hello`)
re.MatchString(s)

// ✅ 直接用 strings.HasPrefix
strings.HasPrefix(s, "Hello")

strings 包比正则快得多，能用 strings 就不要用 regexp。

实战：文本处理工具集

让我们用正则表达式实现一些实用的文本处理功能：

1. 邮箱验证

var emailRE = regexp.MustCompile(`^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`)

func IsValidEmail(email string) bool {
	return emailRE.MatchString(email)
}

// 测试
fmt.Println(IsValidEmail("user@example.com"))     // true
fmt.Println(IsValidEmail("invalid@.com"))         // false
fmt.Println(IsValidEmail("@example.com"))         // false

2. 提取 URL

var urlRE = regexp.MustCompile(`https?://[^\s"'>]+`)

func ExtractURLs(text string) []string {
	return urlRE.FindAllString(text, -1)
}

text := `访问我们的网站 https://example.com 或博客 http://blog.example.com`
fmt.Println(ExtractURLs(text))
// [https://example.com http://blog.example.com]

3. 解析日志

var logRE = regexp.MustCompile(`\[(?P<level>\w+)\]\s+(?P<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+(?P<message>.+)`)

type LogEntry struct {
	Level   string
	Time    string
	Message string
}

func ParseLog(line string) (*LogEntry, error) {
	match := logRE.FindStringSubmatch(line)
	if match == nil {
		return nil, fmt.Errorf("invalid log format")
	}

	result := make(map[string]string)
	for i, name := range logRE.SubexpNames() {
		if i != 0 && name != "" {
			result[name] = match[i]
		}
	}

	return &LogEntry{
		Level:   result["level"],
		Time:    result["time"],
		Message: result["message"],
	}, nil
}

// 测试
entry, _ := ParseLog("[INFO] 2021-02-27 16:05:00 Server started")
fmt.Printf("%+v\n", entry)
// &{Level:INFO Time:2021-02-27 16:05:00 Message:Server started}

4. 敏感信息脱敏

var (
	phoneRE = regexp.MustCompile(`(\d{3})\d{4}(\d{4})`)
	emailRE = regexp.MustCompile(`([a-zA-Z0-9._%+-]{2})[a-zA-Z0-9._%+-]*(@[a-zA-Z0-9.-]+)`)
	idRE    = regexp.MustCompile(`(\d{6})\d{8}(\d{4})`)
)

func MaskSensitiveInfo(text string) string {
	// 手机号：13812345678 → 138****5678
	text = phoneRE.ReplaceAllString(text, "${1}****${2}")
	
	// 邮箱：zhangsan@example.com → zh***@example.com
	text = emailRE.ReplaceAllString(text, "${1}***${2}")
	
	// 身份证：110101199001011234 → 110101********1234
	text = idRE.ReplaceAllString(text, "${1}********${2}")
	
	return text
}

text := "联系人：张三，手机：13812345678，邮箱：zhangsan@example.com，身份证：110101199001011234"
fmt.Println(MaskSensitiveInfo(text))
// 联系人：张三，手机：138****5678，邮箱：zh***@example.com，身份证：110101********1234

5. 模板变量替换

var templateVarRE = regexp.MustCompile(`\{\{\s*(\w+)\s*\}\}`)

func RenderTemplate(template string, vars map[string]string) string {
	return templateVarRE.ReplaceAllStringFunc(template, func(match string) string {
		// 提取变量名
		submatch := templateVarRE.FindStringSubmatch(match)
		name := submatch[1]
		
		if value, ok := vars[name]; ok {
			return value
		}
		return match  // 未找到则保留原样
	})
}

template := "你好，{{ name }}！你的订单 {{ order_id }} 已经发货。"
vars := map[string]string{
	"name":     "张三",
	"order_id": "ORD-12345",
}

fmt.Println(RenderTemplate(template, vars))
// 你好，张三！你的订单 ORD-12345 已经发货。

RE2 vs PCRE

Go 使用的是 RE2 正则引擎，和传统的 PCRE（Perl Compatible Regular Expressions）有一些区别：

RE2 不支持的特性

反向引用：\1、\2 等
环视（Lookaround）：(?=...)、(?!...)、(?<=...)、(?<!...)
条件语句：(?(...)...)
回溯控制：(*PRUNE)、(*SKIP) 等

为什么选择 RE2？

RE2 的设计目标是保证线性时间复杂度——不管输入多长、模式多复杂，匹配时间都是 O(n)。而 PCRE 在某些情况下会陷入指数级的回溯。

这意味着在 Go 中使用正则表达式不会有 ReDoS（正则表达式拒绝服务）风险，这是生产环境非常重要的安全保证。

常见陷阱

1. 贪婪 vs 非贪婪

默认量词是贪婪的，会匹配尽可能多的内容：

re := regexp.MustCompile(`<.*>`)
fmt.Println(re.FindString("<a> <b> <c>"))
// <a> <b> <c>（贪婪：匹配从第一个 < 到最后一个 >）

// 用 ? 让量词变成非贪婪
re = regexp.MustCompile(`<.*?>`)
fmt.Println(re.FindAllString("<a> <b> <c>", -1))
// [<a> <b> <c>]（非贪婪：每个标签独立匹配）

2. `.` 不匹配换行符

re := regexp.MustCompile(`a.b`)
fmt.Println(re.MatchString("a\nb"))  // false

// 用 (?s) 让 . 匹配换行符
re = regexp.MustCompile(`(?s)a.b`)
fmt.Println(re.MatchString("a\nb"))  // true

3. 字符集范围

// ❌ 不是你想的那样
re := regexp.MustCompile(`[A-z]+`)  // 包含了 [] ^ _ ` 等字符！

// ✅ 正确写法
re = regexp.MustCompile(`[A-Za-z]+`)

小结

今天我们全面学习了 Go 的正则表达式：

基础语法：字面字符、字符类、量词、边界
编译方式：MustCompile、Compile、MatchString
常用操作：Match、Find、Replace、Split
捕获组：基本捕获组、命名捕获组
性能优化：预编译、避免过度使用正则
RE2 特性：线性时间、不支持反向引用和环视
常见陷阱：贪婪/非贪婪、. 不匹配换行、字符集范围

正则表达式是一个强大的工具，但也要谨慎使用。简单的问题用 strings 包就够了，复杂的模式匹配才用正则。

练习时间

手机号验证：写一个函数，验证中国大陆手机号（11 位，1 开头）
URL 解析：提取 URL 中的协议、域名、路径、查询参数
Markdown 处理：把 Markdown 的链接 [text](url) 转换成 HTML
CSV 解析：用正则表达式解析 CSV 格式的数据（注意引号转义）
IP 地址验证：验证 IPv4 地址是否合法（注意每个部分在 0-255 之间）

系列总结

恭喜你！到这里你已经完成了 Go 语言进阶入门系列的全部 10 篇文章！🎉

让我们回顾一下这一系列的内容：

Goroutine：Go 的轻量级并发单元
Channel：goroutine 之间的通信管道
sync 包：传统的同步工具（锁、等待组等）
Context：超时、取消和值传递
文件 I/O：读写文件和处理目录
JSON 处理：序列化和反序列化
HTTP 编程：构建 Web 客户端和服务器
测试：单元测试、基准测试、测试覆盖率
Go Modules：现代的依赖管理
正则表达式：强大的文本模式匹配

这些知识涵盖了 Go 语言最实用的部分。掌握了这些，你已经能构建真实的 Go 应用了。

Go 语言的魅力在于它的简洁——用最少的语言特性解决最多的问题。希望这个系列能帮助你更好地理解和使用 Go 语言。

继续学习，继续实践，Go 的未来属于你！🚀

参考资料：

继续阅读

探索更多技术文章

浏览归档，发现更多关于系统设计、工具链和工程实践的内容。

全部文章返回首页

正则表达式：文本处理的瑞士军刀

什么是正则表达式？

第一个正则表达式

编译 vs 不编译

1. 预编译（推荐）

2. 一次性使用

正则表达式语法速查

基本匹配

量词

字符类

边界

分组和引用

选择

常用操作

Match：检查是否匹配

Find：查找匹配

Replace：替换匹配

Split：按模式分割

捕获组

基本捕获组

命名捕获组

查找所有捕获组

性能优化

1. 预编译正则表达式

2. 用 Find 而不是 Match

3. 避免过度的正则

实战：文本处理工具集

1. 邮箱验证

2. 提取 URL

3. 解析日志

4. 敏感信息脱敏

5. 模板变量替换

RE2 vs PCRE

RE2 不支持的特性

为什么选择 RE2？

常见陷阱

1. 贪婪 vs 非贪婪

2. . 不匹配换行符

3. 字符集范围

小结

练习时间

系列总结

探索更多技术文章

2. `.` 不匹配换行符