Java正则表达式学习(3)

2012-09-02

Java正则表达式学习(三)8.捕获组捕获组（capturing group）是将多个字符作为单独的单元来对待的一种方式。构

Java正则表达式学习(三)

8.捕获组

捕获组（capturing group）是将多个字符作为单独的单元来对待的一种方式。构建它们可以通过把字符放在一对圆括号中而成为一组。例如，正则表达式（dog）建了单个的组，包括字符"d" "o" 和 "g"。匹配捕获组输入的字符串部分将会存放于内存中，稍后通过反向引用再次调用。

8.1 编号方式

在Pattern的API描述中，获取组通过从左至右计算开始的圆括号进行编码。例如，在表达式((A)(B(C)))中，有下面的四组：

1.((A)(B(C)))

2.(A)

3.(B(C))

4.(C)

要找出当前的表达式中有多少组，通过调用Matcher对象的groupCount方法。groupCount方法返回int类型值，表示当前Matcher模式中捕获组的数量。例如，groupCount返回4时，表示模式中包含4个捕获组。

有一个特别的组--组0，它表示整个表达式。这个组不包括groupCount的报告范围内。以（？开始的组是纯粹的非捕获组（non-catturinggroup））,它不捕获文本，也不作为组总数而计数

Matcher中的一些方法，可以指定int类型的特定组号作为参数，因此理解组是如何编号的是尤为重要的。

public int start(int group):返回之前的匹配操作期间，给定组所捕获子序列的初始索引。

public int end(int group)：返回之前的匹配操作期间,给定组所捕获的字序列的最后字符索引加1。

public String group(int group):返回之前的匹配操作期间，通过给定组而捕获的输出字序列。

8.2 反向引用

匹配输入字符串的捕获组部分会存放在内存中，通过反向引用（backreferences）稍后再调用。在正则表达式中，反向引用使用反斜线（\）后跟一个表示需要在调用组号的数字来表示。例如，在表达式（\d\d）定义了匹配一行中的两个数字的捕获组，通过反向引用\1，表达式稍后会被再次调用。

匹配两个数字，且后面跟着两个完全相同的数字时，就可以使用（\d\d）\1 作为正则表达式：

Enter your regex: (\d\d)\1Enter input string to search: 1212I found the text "1212" starting at index 0 and ending at index 4.

如果更改最后两个数字，这时匹配就会失效：

Enter your regex: (\d\d)\1Enter input string to search: 1234No match found.

对于嵌套的捕获组而言，反向引用采用完全相同的方式进行工作，即指定一个反斜线加上需要被再次调用的组号。

9. 边界匹配器

通过指定一些边界匹配器（boundary matches）的信息，可以使模式匹配更为精确。比如说你对某个特定的单词感兴趣，并且它只出现在行首或者是行尾。又或者你想匹配发生在单词边界（word boundary），或者是上一个匹配的尾部。

下表中列出了所有的边界匹配器及其说明。

Enter your regex: ^dog$Enter input string to search: dogI found the text "dog" starting at index 0 and ending at index 3.Enter your regex: ^dog$Enter input string to search: dogNo match found.Enter your regex: \s*dog$Enter input string to search: dogI found the text " dog" starting at index 0 and ending at index 29.Enter your regex: ^dog\w*Enter input string to search: dogblahblahI found the text "dogblahblah" starting at index 0 and ending at index 11.

第一个例子匹配是成功的，这是因为模式占据了整个输入的字符串。第二个例子失败了，是由于输入的字符串在开始部分包含了额外的空格。第三个例子指定的表达式是不限的空格，后跟着在行尾的dog。第四个例子，需要dog放在行首，后面跟着是不限数量的单词字符。

对于检查一个单词开始和结束的边界模式（用于长字符里子字符串），这时可以在两边使用\b,例如\bdog\b。

Enter your regex: \bdog\bEnter input string to search: The dog plays in the yard.I found the text "dog" starting at index 4 and ending at index 7.Enter your regex: \bdog\bEnter input string to search: The doggie plays in the yard.No match found.

对于匹配非单词边界的表达式，可以使用\B来代替：

Enter your regex: \bdog\BEnter input string to search: The dog plays in the yard.No match found.Enter your regex: \bdog\BEnter input string to search: The doggie plays in the yard.I found the text "dog" starting at index 4 and ending at index 7.

对于需要匹配仅出现在前一个匹配的结尾，可以使用\G:

Enter your regex: dogEnter input string to search: dog dogI found the text "dog" starting at index 0 and ending at index 3.I found the text "dog" starting at index 4 and ending at index 7.Enter your regex: \GdogEnter input string to search: dog dogI found the text "dog" starting at index 0 and ending at index 3.

这里的第二个例子仅找到了一个匹配，这时由于第二次出现"dog"不是在前一个匹配结尾的开始。

热点排行

编程

Java正则表达式学习(3)