团队
我编写了一个Perl程序来验证姓氏、名字和年份的格式(标点符号等)的准确性。如果某个特定条目没有遵循指定的模式,则该条目将被高亮显示为固定。
例如,我的输入文件有类似文本的行:
<bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>我的程序运行得很好,也就是说,如果任何条目不遵循模式,脚本就会生成一个错误。上面的输入文本不会产生任何错误。但是下面这个例子是一个错误的例子,因为Rose A. J.在Rose之后缺少一个逗号。
NOT FOUND: <bibliomixed id="bkrmbib120">Asher, S. R., & Rose A. J. (1997). Promoting children’s social-emotional adjustment with peers. In P. Salovey & D. Sluyter, (Eds). <emphasis>Emotional development and emotional intelligence: Educational implications.</emphasis> New York: Basic Books.</bibliomixed>从我的regex搜索模式中,是否有可能捕获所有的姓氏和年份,这样我就可以生成一个以每行为前缀的文本,如下所示。
<BIB>Abdo, Afif-Abdo, Otani, Machado, 2008</BIB><bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>我的regex搜索脚本如下:
while(<$INPUT_REF_XML_FH>){
$line_count += 1;
chomp;
if(/
# bibliomixed XML ID tag and attribute----<START>
<bibliomixed
\s+
id=".*?">
# bibliomixed XML ID tag and attribute----<END>
# --------2 OR MORE AUTHOR GROUP--------<START>
(?:
(?:
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
,\s)+
#---------------FINAL AUTHOR GROUP SEPATOR----<START>
&\s
#---------------FINAL AUTHOR GROUP SEPATOR----<END>
# --------2 OR MORE AUTHOR GROUP--------<END>
)?
# --------LAST AUTHOR GROUP--------<START>
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
(?: # pattern for editor notation----<START>
\s\(Ed(?:s)?\.\)\.
)? # pattern for editor notation----<END>
# --------LAST AUTHOR GROUP--------<END>
\s
\(
# pattern for a year----<START>
(?:[A-Za-z]+,\s)? # July, 1999
(?:[A-Za-z]+\s)? # July 1999
(?:[0-9]{4}\/)? # 1999\/2000
(?:\w+\s\d+,\s)?# August 18, 2003
(?:[0-9]{4}|in\spress|manuscript\sin\spreparation) # (1999) (in press) (manuscript in preparation)----REQUIRED
(?:[A-Za-z])? # 1999a
(?:,\s[A-Za-z]+\s[0-9]+)? # 1999, July 2
(?:,\s[A-Za-z]+\s[0-9]+\x{2013}[0-9]+)? # 2002, June 19–25
(?:,\s[A-Za-z]+)? # 1999, Spring
(?:,\s[A-Za-z]+\/[A-Za-z]+)? # 1999, Spring\/Winter
(?:,\s[A-Za-z]+-[A-Za-z]+)? # 2003, Mid-Winter
(?:,\s[A-Za-z]+\s[A-Za-z]+)? # 2007, Anniversary Issue
# pattern for a year----<END>
\)\.
/six){
print $FOUND_REPORT_FH "$line_count\tFOUND: $&\n";
$found_count += 1;
} else{
print $ERROR_REPORT_FH "$line_count\tNOT FOUND: $_\n";
$not_found_count += 1;
}谢谢你的帮忙,
前置
发布于 2013-03-07 19:33:41
改变这个位
# pattern for surname----<END>
,?\s这现在意味着一个可选的,后面是空白。如果这个人的姓是"Bunga Bunga“,那就行不通了
发布于 2013-03-08 16:09:12
您的所有子模式都是非捕获组,从(?:开始。这减少了许多因素的编译时间,其中一个因素是没有捕获子模式。
要捕获模式,只需在需要捕获的部分周围插入括号即可。因此,您可以删除未捕获的断言?:或将parens ()放置在需要它们的位置。http://perldoc.perl.org/perlretut.html#Non-capturing-groupings
我不确定,但是,从您的代码中,我认为您可能尝试使用前瞻性断言,例如,您用空格测试姓氏,如果没有,则使用连字符测试姓氏。这不是每次都从相同的点开始,它要么匹配第一个例子,然后继续用第二个姓模式测试下一个位置,然后regex是否会测试第一个子模式的第二个名称,这是我不确定的。http://perldoc.perl.org/perlretut.html#Looking-ahead-and-looking-behind
#!usr/bin/perl
use warnings;
use strict;
my $line = '123 456 7antelope89';
$line =~ /^(\d+\s\d+\s)?(\d+\w+\d+)?/;
my ($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7bealzelope89';
$line =~ /(?:\d+\s\d+\s)?(?:\d+\w+\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7canteloupe89';
$line =~ /((?:\d+\s\d+\s))?(?:\d+(\w+)\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
exit 0;对于捕获整个模式,第三个示例的第一个模式没有意义,因为这告诉正则表达式不要捕获模式组,同时也要捕获模式组。这在第二个模式中是有用的,它是一个细粒度的模式捕获,因为所捕获的模式是一个非捕获组的一部分。
a: 123 456 b: 7antelope89
a: nocapture b: nocapture
a: 123 456 b: canteloupe一个小的
id=".*?" 可能会更好
id="\w*?"id名称需要是_alphanumeric iirc。
https://stackoverflow.com/questions/15280001
复制相似问题