问正则表达式模式与字符串不匹配？
EN

Stack Overflow用户

提问于 2018-01-09 01:14:49

回答 2查看 0关注 0票数 0

试着匹配<input>使用此模式键入“隐藏”字段：

/<input type="hidden" name="([^"]*?)" value="([^"]*?)" />/

这是样本表格数据：

<input type="hidden" name="SaveRequired" value="False" /><input type="hidden" name="__VIEWSTATE1" value="1H4sIAAtzrkX7QfL5VEGj6nGi+nP" /><input type="hidden" name="__VIEWSTATE2" value="0351118MK" /><input type="hidden" name="__VIEWSTATE3" value="ZVVV91yjY" /><input type="hidden" name="__VIEWSTATE0" value="3" /><input type="hidden" name="__VIEWSTATE" value="" /><input type="hidden" name="__VIEWSTATE" value="" />

但不确定type，，，name，和value属性总是以相同的顺序出现。如果type属性，匹配将失败

如何更改我的模式，使其匹配，而不管属性在<input>标签？

回答 2

Stack Overflow用户

发布于 2018-01-09 09:41:32

下面是一些Perl(伪)代码：

my $html = readLargeInputFile();

my @input_tags = $html =~ m/
    (
        <input                      # Starts with "<input"
        (?=[^>]*?type="hidden")     # Use lookahead to make sure that type="hidden"
        [^>]+                       # Grab the rest of the tag...
        \/>                         # ...except for the />, which is grabbed here
    )/xgm;

# Now each member of @input_tags is something like <input type="hidden" name="SaveRequired" value="False" />

foreach my $input_tag (@input_tags)
{
  my $hash_ref = {};
  # Now extract each of the fields one at a time.

  ($hash_ref->{"name"}) = $input_tag =~ /name="([^"]*)"/;
  ($hash_ref->{"value"}) = $input_tag =~ /value="([^"]*)"/;

  # Put $hash_ref in a list or something, or otherwise process it
}

这里的基本原则是，不要试图对一个正则表达式做太多事情。正如您注意到的，正则表达式强制执行一定数量的顺序。因此，您需要做的是首先匹配要提取的内容的上下文，然后对所需的数据进行子匹配。

票数 0

Stack Overflow用户

发布于 2018-01-09 10:27:09

基于通用Regex的HTML解析解决方案

首先，将展示解析正则表达式的HTML，解析器的核心是：

for (;;) {
  given ($html) {
    last                    when (pos || 0) >= length;
    printf "\@%d=",              (pos || 0);
    print  "doctype "   when / \G (?&doctype)  $RX_SUBS  /xgc;
    print  "cdata "     when / \G (?&cdata)    $RX_SUBS  /xgc;
    print  "xml "       when / \G (?&xml)      $RX_SUBS  /xgc;
    print  "xhook "     when / \G (?&xhook)    $RX_SUBS  /xgc;
    print  "script "    when / \G (?&script)   $RX_SUBS  /xgc;
    print  "style "     when / \G (?&style)    $RX_SUBS  /xgc;
    print  "comment "   when / \G (?&comment)  $RX_SUBS  /xgc;
    print  "tag "       when / \G (?&tag)      $RX_SUBS  /xgc;
    print  "untag "     when / \G (?&untag)    $RX_SUBS  /xgc;
    print  "nasty "     when / \G (?&nasty)    $RX_SUBS  /xgc;
    print  "text "      when / \G (?&nontag)   $RX_SUBS  /xgc;
    default {
      die "UNCLASSIFIED: " .
        substr($_, pos || 0, (length > 65) ? 65 : length);
    }
  }
}

使用Regexes解决OP任务的演示

html_input_rx下面所包含的程序产生了以下输出，这样您就可以看到使用regexes解析HTML对您想要做的工作很好：

% html_input_rx Amazon.com-_Online_Shopping_for_Electronics,_Apparel,_Computers,_Books,_DVDs_\&_more.htm 
input tag #1 at character 9955:
       class => "searchSelect"
          id => "twotabsearchtextbox"
        name => "field-keywords"
        size => "50"
       style => "width:100%; background-color: #FFF;"
       title => "Search for"
        type => "text"
       value => ""

input tag #2 at character 10335:
         alt => "Go"
         src => "http://g-ecx.images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif"
        type => "image"

解析输入标签，请参阅无恶意输入

这是产生上述输出的程序的源代码。

#!/usr/bin/env perl
#
# html_input_rx - pull out all <input> tags from (X)HTML src
#                  via simple regex processing
#
# Tom Christiansen <tchrist@perl.com>
# Sat Nov 20 10:17:31 MST 2010
#
################################################################

use 5.012;

use strict;
use autodie;
use warnings FATAL => "all";    
use subs qw{
    see_no_evil
    parse_input_tags
    input descape dequote
    load_patterns
};    
use open        ":std",
          IN => ":bytes",
         OUT => ":utf8";    
use Encode qw< encode decode >;

    ###########################################################

                        parse_input_tags 
                           see_no_evil 
                              input  

    ###########################################################

until eof(); sub parse_input_tags {
    my $_ = shift();
    our($Input_Tag_Rx, $Pull_Attr_Rx);
    my $count = 0;
    while (/$Input_Tag_Rx/pig) {
        my $input_tag = $+{TAG};
        my $place     = pos() - length ${^MATCH};
        printf "input tag #%d at character %d:\n", ++$count, $place;
        my %attr = ();
        while ($input_tag =~ /$Pull_Attr_Rx/g) {
            my ($name, $value) = @+{ qw< NAME VALUE > };
            $value = dequote($value);
            if (exists $attr{$name}) {
                printf "Discarding dup attr value '%s' on %s attr\n",
                    $attr{$name} // "<undef>", $name;
            } 
            $attr{$name} = $value;
        } 
        for my $name (sort keys %attr) {
            printf "  %10s => ", $name;
            my $value = descape $attr{$name};
            my  @Q; given ($value) {
                @Q = qw[  " "  ]  when !/'/ && !/"/;
                @Q = qw[  " "  ]  when  /'/ && !/"/;
                @Q = qw[  ' '  ]  when !/'/ &&  /"/;
                @Q = qw[ q( )  ]  when  /'/ &&  /"/;
                default { die "NOTREACHED" }
            } 
            say $Q[0], $value, $Q[1];
        } 
        print "\n";
    } 

}

sub dequote {
    my $_ = $_[0];
    s{
        (?<quote>   ["']      )
        (?<BODY>    
          (?s: (?! \k<quote> ) . ) * 
        )
        \k<quote> 
    }{$+{BODY}}six;
    return $_;
} 

sub descape {
    my $string = $_[0];
    for my $_ ($string) {
        s{
            (?<! % )
            % ( \p{Hex_Digit} {2} )
        }{
            chr hex $1;
        }gsex;
        s{
            & \043 
            ( [0-9]+ )
            (?: ; 
              | (?= [^0-9] )
            )
        }{
            chr     $1;
        }gsex;
        s{
            & \043 x
            ( \p{ASCII_HexDigit} + )
            (?: ; 
              | (?= \P{ASCII_HexDigit} )
            )
        }{
            chr hex $1;
        }gsex;

    }
    return $string;
} 

sub input { 
    our ($RX_SUBS, $Meta_Tag_Rx);
    my $_ = do { local $/; <> };  
    my $encoding = "iso-8859-1";  # web default; wish we had the HTTP headers :(
    while (/$Meta_Tag_Rx/gi) {
        my $meta = $+{META};
        next unless $meta =~ m{             $RX_SUBS
            (?= http-equiv ) 
            (?&name) 
            (?&equals) 
            (?= (?&quote)? content-type )
            (?&value)    
        }six;
        next unless $meta =~ m{             $RX_SUBS
            (?= content ) (?&name) 
                          (?&equals) 
            (?<CONTENT>   (?&value)    )
        }six;
        next unless $+{CONTENT} =~ m{       $RX_SUBS
            (?= charset ) (?&name) 
                          (?&equals) 
            (?<CHARSET>   (?&value)    )
        }six;
        if (lc $encoding ne lc $+{CHARSET}) {
            say "[RESETTING ENCODING $encoding => $+{CHARSET}]";
            $encoding = $+{CHARSET};
        }
    } 
    return decode($encoding, $_);
}

sub see_no_evil {
    my $_ = shift();

    s{ <!    DOCTYPE  .*?         > }{}sx; 
    s{ <! \[ CDATA \[ .*?    \]\] > }{}gsx; 

    s{ <script> .*?  </script> }{}gsix; 
    s{ <!--     .*?        --> }{}gsx;

    return $_;
}

sub load_patterns { 

    our $RX_SUBS = qr{ (?(DEFINE)
        (?<nv_pair>         (?&name) (?&equals) (?&value)         ) 
        (?<name>            \b (?=  \pL ) [\w\-] + (?<= \pL ) \b  )
        (?<equals>          (?&might_white)  = (?&might_white)    )
        (?<value>           (?&quoted_value) | (?&unquoted_value) )
        (?<unwhite_chunk>   (?: (?! > ) \S ) +                    )
        (?<unquoted_value>  [\w\-] *                              )
        (?<might_white>     \s *                                  )
        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )
        (?<start_tag>  < (?&might_white) )
        (?<end_tag>          
            (?&might_white)
            (?: (?&html_end_tag) 
              | (?&xhtml_end_tag) 
             )
        )
        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )
    ) }six; 

    our $Meta_Tag_Rx = qr{                          $RX_SUBS 
        (?<META> 
            (?&start_tag) meta \b
            (?:
                (?&might_white) (?&nv_pair) 
            ) +
            (?&end_tag)
        )
    }six;

    our $Pull_Attr_Rx = qr{                         $RX_SUBS
        (?<NAME>  (?&name)      )
                  (?&equals) 
        (?<VALUE> (?&value)     )
    }six;

    our $Input_Tag_Rx = qr{                         $RX_SUBS 

        (?<TAG> (?&input_tag) )

        (?(DEFINE)

            (?<input_tag>
                (?&start_tag)
                input
                (?&might_white) 
                (?&attributes) 
                (?&might_white) 
                (?&end_tag)
            )

            (?<attributes>
                (?: 
                    (?&might_white) 
                    (?&one_attribute) 
                ) *
            )

            (?<one_attribute>
                \b
                (?&legal_attribute)
                (?&might_white) = (?&might_white) 
                (?:
                    (?&quoted_value)
                  | (?&unquoted_value)
                )
            )

            (?<legal_attribute> 
                (?: (?&optional_attribute)
                  | (?&standard_attribute)
                  | (?&event_attribute)
            # for LEGAL parse only, comment out next line 
                  | (?&illegal_attribute)
                )
            )

            (?<illegal_attribute>  (?&name) )

            (?<required_attribute> (?#no required attributes) )

            (?<optional_attribute>
                (?&permitted_attribute)
              | (?&deprecated_attribute)
            )

            # NB: The white space in string literals 
            #     below DOES NOT COUNT!   It's just 
            #     there for legibility.

            (?<permitted_attribute>
                  accept
                | alt
                | bottom
                | check box
                | checked
                | disabled
                | file
                | hidden
                | image
                | max length
                | middle
                | name
                | password
                | radio
                | read only
                | reset
                | right
                | size
                | src
                | submit
                | text
                | top
                | type
                | value
            )

            (?<deprecated_attribute>
                  align
            )

            (?<standard_attribute>
                  access key
                | class
                | dir
                | ltr
                | id
                | lang
                | style
                | tab index
                | title
                | xml:lang
            )

            (?<event_attribute>
                  on blur
                | on change
                | on click
                | on dbl   click
                | on focus
                | on mouse down
                | on mouse move
                | on mouse out
                | on mouse over
                | on mouse up
                | on key   down
                | on key   press
                | on key   up
                | on select
            )
        )
    }six;

}

UNITCHECK {
    load_patterns();
} 

END {
    close(STDOUT) 
        || die "can't close stdout: $!";
}

}

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/-100005211

复制

相似问题

问正则表达式模式与字符串不匹配？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

基于通用Regex的HTML解析解决方案

使用Regexes解决OP任务的演示

解析输入标签，请参阅无恶意输入

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问正则表达式模式与字符串不匹配？EN

回答 2

Stack Overflow用户

Stack Overflow用户

基于通用Regex的HTML解析解决方案

使用Regexes解决OP任务的演示

解析输入标签，请参阅无恶意输入

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问正则表达式模式与字符串不匹配？
EN