计算机程序的思维逻辑 (89) - 正则表达式 (中)

上节介绍了正则表达式的语法，本节介绍相关的java api。
正则表达式相关的类位于包java.util.regex下，有两个主要的类，一个是pattern，另一个是matcher。pattern表示正则表达式对象，它与要处理的具体字符串无关。matcher表示一个匹配，它将正则表达式应用于一个具体字符串，通过它对字符串进行处理。
字符串类string也是一个重要的类，我们在29节专门介绍过string，其中提到，它有一些方法，接受的参数不是普通的字符串，而是正则表达式。此外，正则表达式在java中是需要先以字符串形式表示的。
下面，我们先来介绍如何表示正则表达式，然后探讨如何利用它实现一些常见的文本处理任务，包括切分、验证、查找、和替换。
表示正则表达式
转义符 '\'
正则表达式由元字符和普通字符组成，字符'\'是一个元字符，要在正则表达式中表示'\'本身，需要使用它转义，即'\\'。
在java中，没有什么特殊的语法能直接表示正则表达式，需要用字符串表示，而在字符串中，'\'也是一个元字符，为了在字符串中表示正则表达式的'\'，就需要使用两个'\'，即'\\'，而要匹配'\'本身，就需要四个'\'，即'\\\\'，比如说，如下表达式：
<(\w+)>(.*)</\1>
对应的字符串表示就是：
<(\\w+)>(.*)</\\1>
一个简单规则是，正则表达式中的任何一个'\'，在字符串中，需要替换为两个'\'。
pattern对象
字符串表示的正则表达式可以被编译为一个pattern对象，比如：
string regex = <(\\w+)>(.*)</\\1>; pattern pattern = pattern.compile(regex);
pattern是正则表达式的面向对象表示，所谓编译，简单理解就是将字符串表示为了一个内部结构，这个结构是一个有穷自动机，关于有穷自动机的理论比较深入，我们就不探讨了。
编译有一定的成本，而且pattern对象只与正则表达式有关，与要处理的具体文本无关，它可以安全地被多线程共享，所以，在使用同一个正则表达式处理多个文本时，应该尽量重用同一个pattern对象，避免重复编译。
匹配模式
pattern的compile方法接受一个额外参数，可以指定匹配模式：
public static pattern compile(string regex, int flags)
上节，我们介绍过三种匹配模式：单行模式(点号模式)、多行模式和大小写无关模式，它们对应的常量分别为：pattern.dotall，pattern.multiline和pattern.case_insensitive，多个模式可以一起使用，通过'|'连起来即可，如下所示：
pattern.compile(regex, pattern.case_insensitive | pattern.dotall)
还有一个模式pattern.literal，在此模式下，正则表达式字符串中的元字符将失去特殊含义，被看做普通字符。pattern有一个静态方法：
public static string quote(string s)
quote()的目的是类似的，它将s中的字符都看作普通字符。我们在上节介绍过\q和\e，\q和\e之间的字符会被视为普通字符。quote()基本上就是在字符串s的前后加了\q和\e，比如，如果s为\\d{6}，则quote()的返回值就是\\q\\d{6}\\e。
切分
简单情况
文本处理的一个常见需求是根据分隔符切分字符串，比如在处理csv文件时，按逗号分隔每个字段，这个需求听上去很容易满足，因为string类有如下方法：
public string[] split(string regex)
比如：
string str = abc,def,hello; string[] fields = str.split(,); system.out.println(field num: +fields.length); system.out.println(arrays.tostring(fields));
输出为：
field num: 3[abc, def,  hello]
不过，有一些重要的细节，我们需要注意。
转义元字符
split将参数regex看做正则表达式，而不是普通的字符，如果分隔符是元字符，比如. $ | ( ) [ { ^ ? * + \，就需要转义，比如按点号'.'分隔，就需要写为：
string[] fields = str.split(\\.);
如果分隔符是用户指定的，程序事先不知道，可以通过pattern.quote()将其看做普通字符串。
将多个字符用作分隔符
既然是正则表达式，分隔符就不一定是一个字符，比如，可以将一个或多个空白字符或点号作为分隔符，如下所示：
string str = abc  def      hello.\n   world; string[] fields = str.split([\\s.]+);
fields内容为：
[abc, def, hello, world]
空白字符串
需要说明的是，尾部的空白字符串不会包含在返回的结果数组中，但头部和中间的空白字符串会被包含在内，比如：
string str = ,abc,,def,,; string[] fields = str.split(,); system.out.println(field num: +fields.length); system.out.println(arrays.tostring(fields));
输出为：
field num: 4[, abc, , def]
找不到分隔符
如果字符串中找不到匹配regex的分隔符，返回数组长度为1，元素为原字符串。
切分数目限制
split方法接受一个额外的参数limit，用于限定切分的数目：
public string[] split(string regex, int limit)
不带limit参数的split，其limit相当于0。关于limit的含义，我们通过一个例子说明下，比如字符串是a:b:c:，分隔符是:，在limit为不同值的情况下，其返回数组如下表所示：
pattern的split方法
pattern也有两个split方法，与string方法的定义类似：
public string[] split(charsequence input)public string[] split(charsequence input, int limit)
与string方法的区别是：
pattern接受的参数是charsequence，更为通用，我们知道string, stringbuilder, stringbuffer, charbuffer等都实现了该接口；
如果regex长度大于1或包含元字符，string的split方法会先将regex编译为pattern对象，再调用pattern的split方法，这时，为避免重复编译，应该优先采用pattern的方法；
如果regex就是一个字符且不是元字符，string的split方法会采用更为简单高效的实现，所以，这时，应该优先采用string的split方法。
验证
验证就是检验输入文本是否完整匹配预定义的正则表达式，经常用于检验用户的输入是否合法。
string有如下方法：
public boolean matches(string regex)
比如：
string regex = \\d{8}; string str = 12345678; system.out.println(str.matches(regex));
检查输入是否是8位数字，输出为true。
string的matches实际调用的是pattern的如下方法：
public static boolean matches(string regex, charsequence input)
这是一个静态方法，它的代码为：
public static boolean matches(string regex, charsequence input) {     pattern p = pattern.compile(regex);     matcher m = p.matcher(input);return m.matches(); }
就是先调用compile编译regex为pattern对象，再调用pattern的matcher方法生成一个匹配对象matcher，matcher的matches()返回是否完整匹配。
查找
查找就是在文本中寻找匹配正则表达式的子字符串，看个例子：
public static void find(){     string regex = \\d{4}-\\d{2}-\\d{2};     pattern pattern = pattern.compile(regex);     string str = today is 2017-06-02, yesterday is 2017-06-01;     matcher matcher = pattern.matcher(str);while(matcher.find()){         system.out.println(find +matcher.group()+ position: +matcher.start()+-+matcher.end());     } }
代码寻找所有类似2017-06-02这种格式的日期，输出为：
find 2017-06-02 position: 9-19find 2017-06-01 position: 34-44
matcher的内部记录有一个位置，起始为0，find()方法从这个位置查找匹配正则表达式的子字符串，找到后，返回true，并更新这个内部位置，匹配到的子字符串信息可以通过如下方法获取：
//匹配到的完整子字符串public string group()//子字符串在整个字符串中的起始位置public int start()//子字符串在整个字符串中的结束位置加1public int end()
group()其实调用的是group(0)，表示获取匹配的第0个分组的内容。我们在上节介绍过捕获分组的概念，分组0是一个特殊分组，表示匹配的整个子字符串。除了分组0，matcher还有如下方法，获取分组的更多信息：
//分组个数public int groupcount()//分组编号为group的内容public string group(int group)//分组命名为name的内容public string group(string name)//分组编号为group的起始位置public int start(int group)//分组编号为group的结束位置加1public int end(int group)
比如：
public static void findgroup() {     string regex = (\\d{4})-(\\d{2})-(\\d{2});     pattern pattern = pattern.compile(regex);     string str = today is 2017-06-02, yesterday is 2017-06-01;     matcher matcher = pattern.matcher(str);while (matcher.find()) {         system.out.println(year: + matcher.group(1)+ ,month: + matcher.group(2)+ ,day: + matcher.group(3));     } }
输出为：
year:2017,month:06,day:02year:2017,month:06,day:01
替换
replaceall和replacefirst
查找到子字符串后，一个常见的后续操作是替换。string有多个替换方法：
public string replace(char oldchar, char newchar)public string replace(charsequence target, charsequence replacement)public string replaceall(string regex, string replacement)public string replacefirst(string regex, string replacement)
第一个replace方法操作的是单个字符，第二个是charsequence，它们都是将参数看做普通字符。而replaceall和replacefirst则将参数regex看做正则表达式，它们的区别是，replaceall替换所有找到的子字符串，而replacefirst则只替换第一个找到的，看个简单的例子，将字符串中的多个连续空白字符替换为一个：
string regex = \\s+; string str = hello    world       good; system.out.println(str.replaceall(regex,  ));
输出为：
hello world good
在replaceall和replacefirst中，参数replacement也不是被看做普通的字符串，可以使用美元符号加数字的形式，比如$1，引用捕获分组，我们看个例子：
string regex = (\\d{4})-(\\d{2})-(\\d{2}); string str = today is 2017-06-02.; system.out.println(str.replacefirst(regex, $1/$2/$3));
输出为：
today is 2017/06/02.
这个例子将找到的日期字符串的格式进行了转换。所以，字符'$'在replacement中是元字符，如果需要替换为字符'$'本身，需要使用转义，看个例子：
string regex = #; string str = #this is a test; system.out.println(str.replaceall(regex, \\$));
如果替换字符串是用户提供的，为避免元字符的的干扰，可以使用matcher的如下静态方法将其视为普通字符串：
public static string quotereplacement(string s)
string的replaceall和replacefirst调用的其实是pattern和matcher中的方法，比如，replaceall的代码为：
public string replaceall(string regex, string replacement) {return pattern.compile(regex).matcher(this).replaceall(replacement); }
边查找边替换
replaceall和replacefirst都定义在matcher中，除了一次性的替换操作外，matcher还定义了边查找、边替换的方法：
public matcher appendreplacement(stringbuffer sb, string replacement)public stringbuffer appendtail(stringbuffer sb)
这两个方法用于和find()一起使用，我们先看个例子：
public static void replacecat() {     pattern p = pattern.compile(cat);     matcher m = p.matcher(one cat, two cat, three cat);     stringbuffer sb = new stringbuffer();int foundnum = 0;while (m.find()) {         m.appendreplacement(sb, dog);         foundnum++;if (foundnum == 2) {break;         }     }     m.appendtail(sb);     system.out.println(sb.tostring()); }
在这个例子中，我们将前两个cat替换为了dog，其他cat不变，输出为：
one dog, two dog, three cat
stringbuffer类型的变量sb存放最终的替换结果，matcher内部除了有一个查找位置，还有一个append位置，初始为0，当找到一个匹配的子字符串后，appendreplacement()做了三件事情：
将append位置到当前匹配之前的子字符串append到sb中，在第一次操作中，为one ，第二次为, two ;
将替换字符串append到sb中；
更新append位置为当前匹配之后的位置。
appendtail将append位置之后所有的字符append到sb中。
模板引擎
利用matcher的这几个方法，我们可以实现一个简单的模板引擎，模板是一个字符串，中间有一些变量，以{name}表示，如下例所示：
string template = hi {name}, your code is {code}.;
这里，模板字符串中有两个变量，一个是name，另一个是code。变量的实际值通过map提供，变量名称对应map中的键，模板引擎的任务就是接受模板和map作为参数，返回替换变量后的字符串，示例实现为：
private static pattern templatepattern = pattern.compile(\\{(\\w+)\\});public static string templateengine(string template, map<string, object> params) {     stringbuffer sb = new stringbuffer();     matcher matcher = templatepattern.matcher(template);while (matcher.find()) {         string key = matcher.group(1);         object value = params.get(key);         matcher.appendreplacement(sb, value != null ?matcher.quotereplacement(value.tostring()) : );     }     matcher.appendtail(sb);return sb.tostring(); }
代码寻找所有的模板变量，正则表达式为：
\{(\w+)\}
'{'是元字符，所以要转义，\w+表示变量名，为便于引用，加了括号，可以通过分组1引用变量名。
使用该模板引擎的示例代码为：
public static void templatedemo() {     string template = hi {name}, your code is {code}.;     map<string, object> params = new hashmap<string, object>();     params.put(name, 老马);     params.put(code, 6789);     system.out.println(templateengine(template, params)); }
输出为：
hi 老马, your code is 6789.
小结
本节介绍了正则表达式相关的主要java api，讨论了如何在java中表示正则表达式，如何利用它实现文本的切分、验证、查找和替换，对于替换，我们演示了一个简单的模板引擎。
下一节，我们继续探讨正则表达式，讨论和分析一些常见的正则表达式。
以上就是计算机程序的思维逻辑 (89) - 正则表达式 (中)的详细内容。

计算机程序的思维逻辑 (89) - 正则表达式 (中)

VIP推荐