首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >正则表达式模式与字符串不匹配?

正则表达式模式与字符串不匹配?
EN

Stack Overflow用户
提问于 2018-01-09 01:14:49
回答 2查看 0关注 0票数 0

试着匹配<input>使用此模式键入“隐藏”字段:

代码语言:javascript
复制
/<input type="hidden" name="([^"]*?)" value="([^"]*?)" />/

这是样本表格数据:

代码语言:javascript
复制
<input type="hidden" name="SaveRequired" value="False" /><input type="hidden" name="__VIEWSTATE1" value="1H4sIAAtzrkX7QfL5VEGj6nGi+nP" /><input type="hidden" name="__VIEWSTATE2" value="0351118MK" /><input type="hidden" name="__VIEWSTATE3" value="ZVVV91yjY" /><input type="hidden" name="__VIEWSTATE0" value="3" /><input type="hidden" name="__VIEWSTATE" value="" /><input type="hidden" name="__VIEWSTATE" value="" />

但不确定type,,,name,和value属性总是以相同的顺序出现。如果type属性,匹配将失败

如何更改我的模式,使其匹配,而不管属性在<input>标签?

EN

回答 2

Stack Overflow用户

发布于 2018-01-09 09:41:32

下面是一些Perl(伪)代码:

代码语言:javascript
复制
my $html = readLargeInputFile();

my @input_tags = $html =~ m/
    (
        <input                      # Starts with "<input"
        (?=[^>]*?type="hidden")     # Use lookahead to make sure that type="hidden"
        [^>]+                       # Grab the rest of the tag...
        \/>                         # ...except for the />, which is grabbed here
    )/xgm;

# Now each member of @input_tags is something like <input type="hidden" name="SaveRequired" value="False" />

foreach my $input_tag (@input_tags)
{
  my $hash_ref = {};
  # Now extract each of the fields one at a time.

  ($hash_ref->{"name"}) = $input_tag =~ /name="([^"]*)"/;
  ($hash_ref->{"value"}) = $input_tag =~ /value="([^"]*)"/;

  # Put $hash_ref in a list or something, or otherwise process it
}

这里的基本原则是,不要试图对一个正则表达式做太多事情。正如您注意到的,正则表达式强制执行一定数量的顺序。因此,您需要做的是首先匹配要提取的内容的上下文,然后对所需的数据进行子匹配。

票数 0
EN

Stack Overflow用户

发布于 2018-01-09 10:27:09

基于通用Regex的HTML解析解决方案

首先,将展示解析正则表达式的HTML,解析器的核心是:

代码语言:javascript
复制
for (;;) {
  given ($html) {
    last                    when (pos || 0) >= length;
    printf "\@%d=",              (pos || 0);
    print  "doctype "   when / \G (?&doctype)  $RX_SUBS  /xgc;
    print  "cdata "     when / \G (?&cdata)    $RX_SUBS  /xgc;
    print  "xml "       when / \G (?&xml)      $RX_SUBS  /xgc;
    print  "xhook "     when / \G (?&xhook)    $RX_SUBS  /xgc;
    print  "script "    when / \G (?&script)   $RX_SUBS  /xgc;
    print  "style "     when / \G (?&style)    $RX_SUBS  /xgc;
    print  "comment "   when / \G (?&comment)  $RX_SUBS  /xgc;
    print  "tag "       when / \G (?&tag)      $RX_SUBS  /xgc;
    print  "untag "     when / \G (?&untag)    $RX_SUBS  /xgc;
    print  "nasty "     when / \G (?&nasty)    $RX_SUBS  /xgc;
    print  "text "      when / \G (?&nontag)   $RX_SUBS  /xgc;
    default {
      die "UNCLASSIFIED: " .
        substr($_, pos || 0, (length > 65) ? 65 : length);
    }
  }
}

使用Regexes解决OP任务的演示

html_input_rx下面所包含的程序产生了以下输出,这样您就可以看到使用regexes解析HTML对您想要做的工作很好:

代码语言:javascript
复制
% html_input_rx Amazon.com-_Online_Shopping_for_Electronics,_Apparel,_Computers,_Books,_DVDs_\&_more.htm 
input tag #1 at character 9955:
       class => "searchSelect"
          id => "twotabsearchtextbox"
        name => "field-keywords"
        size => "50"
       style => "width:100%; background-color: #FFF;"
       title => "Search for"
        type => "text"
       value => ""

input tag #2 at character 10335:
         alt => "Go"
         src => "http://g-ecx.images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif"
        type => "image"

解析输入标签,请参阅无恶意输入

这是产生上述输出的程序的源代码。

代码语言:javascript
复制
#!/usr/bin/env perl
#
# html_input_rx - pull out all <input> tags from (X)HTML src
#                  via simple regex processing
#
# Tom Christiansen <tchrist@perl.com>
# Sat Nov 20 10:17:31 MST 2010
#
################################################################

use 5.012;

use strict;
use autodie;
use warnings FATAL => "all";    
use subs qw{
    see_no_evil
    parse_input_tags
    input descape dequote
    load_patterns
};    
use open        ":std",
          IN => ":bytes",
         OUT => ":utf8";    
use Encode qw< encode decode >;

    ###########################################################

                        parse_input_tags 
                           see_no_evil 
                              input  

    ###########################################################

until eof(); sub parse_input_tags {
    my $_ = shift();
    our($Input_Tag_Rx, $Pull_Attr_Rx);
    my $count = 0;
    while (/$Input_Tag_Rx/pig) {
        my $input_tag = $+{TAG};
        my $place     = pos() - length ${^MATCH};
        printf "input tag #%d at character %d:\n", ++$count, $place;
        my %attr = ();
        while ($input_tag =~ /$Pull_Attr_Rx/g) {
            my ($name, $value) = @+{ qw< NAME VALUE > };
            $value = dequote($value);
            if (exists $attr{$name}) {
                printf "Discarding dup attr value '%s' on %s attr\n",
                    $attr{$name} // "<undef>", $name;
            } 
            $attr{$name} = $value;
        } 
        for my $name (sort keys %attr) {
            printf "  %10s => ", $name;
            my $value = descape $attr{$name};
            my  @Q; given ($value) {
                @Q = qw[  " "  ]  when !/'/ && !/"/;
                @Q = qw[  " "  ]  when  /'/ && !/"/;
                @Q = qw[  ' '  ]  when !/'/ &&  /"/;
                @Q = qw[ q( )  ]  when  /'/ &&  /"/;
                default { die "NOTREACHED" }
            } 
            say $Q[0], $value, $Q[1];
        } 
        print "\n";
    } 

}

sub dequote {
    my $_ = $_[0];
    s{
        (?<quote>   ["']      )
        (?<BODY>    
          (?s: (?! \k<quote> ) . ) * 
        )
        \k<quote> 
    }{$+{BODY}}six;
    return $_;
} 

sub descape {
    my $string = $_[0];
    for my $_ ($string) {
        s{
            (?<! % )
            % ( \p{Hex_Digit} {2} )
        }{
            chr hex $1;
        }gsex;
        s{
            & \043 
            ( [0-9]+ )
            (?: ; 
              | (?= [^0-9] )
            )
        }{
            chr     $1;
        }gsex;
        s{
            & \043 x
            ( \p{ASCII_HexDigit} + )
            (?: ; 
              | (?= \P{ASCII_HexDigit} )
            )
        }{
            chr hex $1;
        }gsex;

    }
    return $string;
} 

sub input { 
    our ($RX_SUBS, $Meta_Tag_Rx);
    my $_ = do { local $/; <> };  
    my $encoding = "iso-8859-1";  # web default; wish we had the HTTP headers :(
    while (/$Meta_Tag_Rx/gi) {
        my $meta = $+{META};
        next unless $meta =~ m{             $RX_SUBS
            (?= http-equiv ) 
            (?&name) 
            (?&equals) 
            (?= (?&quote)? content-type )
            (?&value)    
        }six;
        next unless $meta =~ m{             $RX_SUBS
            (?= content ) (?&name) 
                          (?&equals) 
            (?<CONTENT>   (?&value)    )
        }six;
        next unless $+{CONTENT} =~ m{       $RX_SUBS
            (?= charset ) (?&name) 
                          (?&equals) 
            (?<CHARSET>   (?&value)    )
        }six;
        if (lc $encoding ne lc $+{CHARSET}) {
            say "[RESETTING ENCODING $encoding => $+{CHARSET}]";
            $encoding = $+{CHARSET};
        }
    } 
    return decode($encoding, $_);
}

sub see_no_evil {
    my $_ = shift();

    s{ <!    DOCTYPE  .*?         > }{}sx; 
    s{ <! \[ CDATA \[ .*?    \]\] > }{}gsx; 

    s{ <script> .*?  </script> }{}gsix; 
    s{ <!--     .*?        --> }{}gsx;

    return $_;
}

sub load_patterns { 

    our $RX_SUBS = qr{ (?(DEFINE)
        (?<nv_pair>         (?&name) (?&equals) (?&value)         ) 
        (?<name>            \b (?=  \pL ) [\w\-] + (?<= \pL ) \b  )
        (?<equals>          (?&might_white)  = (?&might_white)    )
        (?<value>           (?&quoted_value) | (?&unquoted_value) )
        (?<unwhite_chunk>   (?: (?! > ) \S ) +                    )
        (?<unquoted_value>  [\w\-] *                              )
        (?<might_white>     \s *                                  )
        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )
        (?<start_tag>  < (?&might_white) )
        (?<end_tag>          
            (?&might_white)
            (?: (?&html_end_tag) 
              | (?&xhtml_end_tag) 
             )
        )
        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )
    ) }six; 

    our $Meta_Tag_Rx = qr{                          $RX_SUBS 
        (?<META> 
            (?&start_tag) meta \b
            (?:
                (?&might_white) (?&nv_pair) 
            ) +
            (?&end_tag)
        )
    }six;

    our $Pull_Attr_Rx = qr{                         $RX_SUBS
        (?<NAME>  (?&name)      )
                  (?&equals) 
        (?<VALUE> (?&value)     )
    }six;

    our $Input_Tag_Rx = qr{                         $RX_SUBS 

        (?<TAG> (?&input_tag) )

        (?(DEFINE)

            (?<input_tag>
                (?&start_tag)
                input
                (?&might_white) 
                (?&attributes) 
                (?&might_white) 
                (?&end_tag)
            )

            (?<attributes>
                (?: 
                    (?&might_white) 
                    (?&one_attribute) 
                ) *
            )

            (?<one_attribute>
                \b
                (?&legal_attribute)
                (?&might_white) = (?&might_white) 
                (?:
                    (?&quoted_value)
                  | (?&unquoted_value)
                )
            )

            (?<legal_attribute> 
                (?: (?&optional_attribute)
                  | (?&standard_attribute)
                  | (?&event_attribute)
            # for LEGAL parse only, comment out next line 
                  | (?&illegal_attribute)
                )
            )

            (?<illegal_attribute>  (?&name) )

            (?<required_attribute> (?#no required attributes) )

            (?<optional_attribute>
                (?&permitted_attribute)
              | (?&deprecated_attribute)
            )

            # NB: The white space in string literals 
            #     below DOES NOT COUNT!   It's just 
            #     there for legibility.

            (?<permitted_attribute>
                  accept
                | alt
                | bottom
                | check box
                | checked
                | disabled
                | file
                | hidden
                | image
                | max length
                | middle
                | name
                | password
                | radio
                | read only
                | reset
                | right
                | size
                | src
                | submit
                | text
                | top
                | type
                | value
            )

            (?<deprecated_attribute>
                  align
            )

            (?<standard_attribute>
                  access key
                | class
                | dir
                | ltr
                | id
                | lang
                | style
                | tab index
                | title
                | xml:lang
            )

            (?<event_attribute>
                  on blur
                | on change
                | on click
                | on dbl   click
                | on focus
                | on mouse down
                | on mouse move
                | on mouse out
                | on mouse over
                | on mouse up
                | on key   down
                | on key   press
                | on key   up
                | on select
            )
        )
    }six;

}

UNITCHECK {
    load_patterns();
} 

END {
    close(STDOUT) 
        || die "can't close stdout: $!";
} 

}

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/-100005211

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档