文章/答案/技术大牛

发布

社区首页 >问答首页 >如何从Regex模式中只获取姓氏？

问如何从Regex模式中只获取姓氏？
EN

Stack Overflow用户

提问于 2013-03-07 19:27:26

回答 2查看 204关注 0票数 0

团队

我编写了一个Perl程序来验证姓氏、名字和年份的格式(标点符号等)的准确性。如果某个特定条目没有遵循指定的模式，则该条目将被高亮显示为固定。

例如，我的输入文件有类似文本的行：

<bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., &amp; Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>

我的程序运行得很好，也就是说，如果任何条目不遵循模式，脚本就会生成一个错误。上面的输入文本不会产生任何错误。但是下面这个例子是一个错误的例子，因为Rose A. J.在Rose之后缺少一个逗号。

NOT FOUND: <bibliomixed id="bkrmbib120">Asher, S. R., &amp; Rose A. J. (1997). Promoting children’s social-emotional adjustment with peers. In P. Salovey &amp; D. Sluyter, (Eds). <emphasis>Emotional development and emotional intelligence: Educational implications.</emphasis> New York: Basic Books.</bibliomixed>

从我的regex搜索模式中，是否有可能捕获所有的姓氏和年份，这样我就可以生成一个以每行为前缀的文本，如下所示。

<BIB>Abdo, Afif-Abdo, Otani, Machado, 2008</BIB><bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., &amp; Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>

我的regex搜索脚本如下：

while(<$INPUT_REF_XML_FH>){
    $line_count += 1;
    chomp;
    if(/

    # bibliomixed XML ID tag and attribute----<START>
    <bibliomixed
    \s+
    id=".*?">
    # bibliomixed XML ID tag and attribute----<END>

    # --------2 OR MORE AUTHOR GROUP--------<START>
    (?:
    (?:
    # pattern for surname----<START>
    (?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
    (?:(?:[\w\x{2019}|\x{0027}]+-)+)?  # surnames with hyphens
    (?:[A-Z](?:\x{2019}|\x{0027}))?  # surnames with closing single quote or apostrophe O’Leary
    (?:St\.\s)? # pattern for St.
    (?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
    (?:[\w\x{2019}|\x{0027}]+)  # final surname pattern----REQUIRED
    # pattern for surname----<END>
    ,\s
    # pattern for forename----<START>
    (?:
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    (?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    [A-Z]\. #----REQUIRED
    # pattern for titles....<START>
    (?:,\s(?:Jr\.|Sr\.|II|III|IV))?
    # pattern for titles....<END>
    )
    # pattern for forename----<END>
    ,\s)+
    #---------------FINAL AUTHOR GROUP SEPATOR----<START>
    &amp;\s
    #---------------FINAL AUTHOR GROUP SEPATOR----<END>

    # --------2 OR MORE AUTHOR GROUP--------<END>
    )? 

    # --------LAST AUTHOR GROUP--------<START>

    # pattern for surname----<START>
    (?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
    (?:(?:[\w\x{2019}|\x{0027}]+-)+)?  # surnames with hyphens
    (?:[A-Z](?:\x{2019}|\x{0027}))?  # surnames with closing single quote or apostrophe O’Leary
    (?:St\.\s)? # pattern for St.
    (?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
    (?:[\w\x{2019}|\x{0027}]+)  # final surname pattern----REQUIRED
    # pattern for surname----<END>
    ,\s
    # pattern for forename----<START>
    (?:
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    (?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    [A-Z]\. #----REQUIRED
    # pattern for titles....<START>
    (?:,\s(?:Jr\.|Sr\.|II|III|IV))?
    # pattern for titles....<END>
    )
    # pattern for forename----<END>

    (?: # pattern for editor notation----<START>
    \s\(Ed(?:s)?\.\)\.
    )? # pattern for editor notation----<END>

    # --------LAST AUTHOR GROUP--------<END>
    \s
    \(
    # pattern for a year----<START>
    (?:[A-Za-z]+,\s)? # July, 1999
    (?:[A-Za-z]+\s)? # July 1999
    (?:[0-9]{4}\/)? # 1999\/2000
    (?:\w+\s\d+,\s)?# August 18, 2003
    (?:[0-9]{4}|in\spress|manuscript\sin\spreparation) # (1999) (in press) (manuscript in preparation)----REQUIRED
    (?:[A-Za-z])? # 1999a
    (?:,\s[A-Za-z]+\s[0-9]+)? # 1999, July 2
    (?:,\s[A-Za-z]+\s[0-9]+\x{2013}[0-9]+)? # 2002, June 19–25
    (?:,\s[A-Za-z]+)? # 1999, Spring
    (?:,\s[A-Za-z]+\/[A-Za-z]+)? # 1999, Spring\/Winter
    (?:,\s[A-Za-z]+-[A-Za-z]+)? # 2003, Mid-Winter
    (?:,\s[A-Za-z]+\s[A-Za-z]+)? # 2007, Anniversary Issue
    # pattern for a year----<END>
    \)\.
    /six){
        print $FOUND_REPORT_FH "$line_count\tFOUND: $&\n";
        $found_count += 1;
    } else{
        print $ERROR_REPORT_FH "$line_count\tNOT FOUND: $_\n";
        $not_found_count += 1;
    }

谢谢你的帮忙,

前置

regex

perl

回答 2

Stack Overflow用户

发布于 2013-03-07 19:33:41

改变这个位

# pattern for surname----<END>
    ,?\s

这现在意味着一个可选的，后面是空白。如果这个人的姓是"Bunga Bunga“，那就行不通了

票数 0

Stack Overflow用户

发布于 2013-03-08 16:09:12

您的所有子模式都是非捕获组，从(?:开始。这减少了许多因素的编译时间，其中一个因素是没有捕获子模式。

要捕获模式，只需在需要捕获的部分周围插入括号即可。因此，您可以删除未捕获的断言?:或将parens ()放置在需要它们的位置。http://perldoc.perl.org/perlretut.html#Non-capturing-groupings

我不确定，但是，从您的代码中，我认为您可能尝试使用前瞻性断言，例如，您用空格测试姓氏，如果没有，则使用连字符测试姓氏。这不是每次都从相同的点开始，它要么匹配第一个例子，然后继续用第二个姓模式测试下一个位置，然后regex是否会测试第一个子模式的第二个名称，这是我不确定的。http://perldoc.perl.org/perlretut.html#Looking-ahead-and-looking-behind

#!usr/bin/perl

use warnings;
use strict;


my $line = '123 456 7antelope89';

$line =~ /^(\d+\s\d+\s)?(\d+\w+\d+)?/;

my ($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');

print 'a: ',$ay,'b: ',$be,$/;

undef for ($ay,$be,$1,$2);


$line = '123 456 7bealzelope89';

$line =~ /(?:\d+\s\d+\s)?(?:\d+\w+\d+)?/;

($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');

print 'a: ',$ay,'b: ',$be,$/;

undef for ($ay,$be,$1,$2);


$line = '123 456 7canteloupe89';

$line =~ /((?:\d+\s\d+\s))?(?:\d+(\w+)\d+)?/;

($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');

print 'a: ',$ay,'b: ',$be,$/;

undef for ($ay,$be,$1,$2);

exit 0;

对于捕获整个模式，第三个示例的第一个模式没有意义，因为这告诉正则表达式不要捕获模式组，同时也要捕获模式组。这在第二个模式中是有用的，它是一个细粒度的模式捕获，因为所捕获的模式是一个非捕获组的一部分。

a: 123 456 b: 7antelope89
a: nocapture b: nocapture 
a: 123 456 b: canteloupe

一个小的

  id=".*?"

可能会更好

  id="\w*?"

id名称需要是_alphanumeric iirc。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/15280001

复制

相似问题

问如何从Regex模式中只获取姓氏？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从Regex模式中只获取姓氏？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从Regex模式中只获取姓氏？
EN