简单的perl正则表达式文本替换多个空行
本帖最后由 dbxmcf 于 2014-01-08 01:10:53 编辑 最近正在看Jeffery的Mastering Regular Expressions一书p69,其中提到将多个连续的空行替换成一个<p>的例子:
$text =~s/^\s*$/<p>/mg;
目的是把一个纯文本中的多个空行(可含space和tab)替换为一个html的段落<p>
不过这个命令工作原理似乎不太明白,在增强多行模式/m下,总是出现两个<p><p>
比如我的文件txt(行号不算):
1 a
2
3
4
5 b
6
7
8
9
10 c
用如下的perl脚本t2h.pl:
#!/usr/bin/perl
undef $/;
$text=<>;
$text=~ s/^\s*$/<p>/m;
print "$text";
运行结果如下:
>./t2h.pl txt
a
<p><p>
b
<p><p>
c
有哪位大牛可以解释一下为什么出现两个<p><p>,如何更正吗?谢谢!
[解决办法]
理解一下:
By default, the "^" character is guaranteed to match only the beginning of the string, the "$" character only the end (or before the newline at the end), and Perl does certain optimizations with the assumption that the string contains only one line. Embedded newlines will not be matched by "^" or "$". You may, however, wish to treat a string as a multi-line buffer, such that the "^" will match after any newline within the string (except if the newline is the last character in the string), and "$" will match before any newline. At the cost of a little more overhead, you can do this by using the /m modifier on the pattern match operator. (Older programs did this by setting $* , but this option was removed in perl 5.10.)
[解决办法]
zhu@ubuntu-1204:~$ cat txt -n
1
2a
3
4
5
6b
7
8
9
10
11c
12
13
14
zhu@ubuntu-1204:~$ cat txt -An
1^I$
2a$
3^I$
4^I$
5^I$
6b$
7^I$
8^I$
9^I$
10^I$
11c$
12^I$
13^I$
14^I$
zhu@ubuntu-1204:~$ perl t2h.pl txt
<p>
a
<p>
b
<p>
c
<p>zhu@ubuntu-1204:~$ cat t2h.pl
#!/usr/bin/perl
undef $/;
$text=<>;
$text=~ s/^\s*$/<p>/mg;
print "$text";