试着匹配<input>
使用此模式键入“隐藏”字段:
/<input type="hidden" name="([^"]*?)" value="([^"]*?)" />/
这是样本表格数据:
<input type="hidden" name="SaveRequired" value="False" /><input type="hidden" name="__VIEWSTATE1" value="1H4sIAAtzrkX7QfL5VEGj6nGi+nP" /><input type="hidden" name="__VIEWSTATE2" value="0351118MK" /><input type="hidden" name="__VIEWSTATE3" value="ZVVV91yjY" /><input type="hidden" name="__VIEWSTATE0" value="3" /><input type="hidden" name="__VIEWSTATE" value="" /><input type="hidden" name="__VIEWSTATE" value="" />
但不确定type
,,,name
,和value
属性总是以相同的顺序出现。如果type
属性,匹配将失败
如何更改我的模式,使其匹配,而不管属性在<input>
标签?
发布于 2018-01-09 09:41:32
下面是一些Perl(伪)代码:
my $html = readLargeInputFile();
my @input_tags = $html =~ m/
(
<input # Starts with "<input"
(?=[^>]*?type="hidden") # Use lookahead to make sure that type="hidden"
[^>]+ # Grab the rest of the tag...
\/> # ...except for the />, which is grabbed here
)/xgm;
# Now each member of @input_tags is something like <input type="hidden" name="SaveRequired" value="False" />
foreach my $input_tag (@input_tags)
{
my $hash_ref = {};
# Now extract each of the fields one at a time.
($hash_ref->{"name"}) = $input_tag =~ /name="([^"]*)"/;
($hash_ref->{"value"}) = $input_tag =~ /value="([^"]*)"/;
# Put $hash_ref in a list or something, or otherwise process it
}
这里的基本原则是,不要试图对一个正则表达式做太多事情。正如您注意到的,正则表达式强制执行一定数量的顺序。因此,您需要做的是首先匹配要提取的内容的上下文,然后对所需的数据进行子匹配。
发布于 2018-01-09 10:27:09
首先,将展示解析正则表达式的HTML,解析器的核心是:
for (;;) {
given ($html) {
last when (pos || 0) >= length;
printf "\@%d=", (pos || 0);
print "doctype " when / \G (?&doctype) $RX_SUBS /xgc;
print "cdata " when / \G (?&cdata) $RX_SUBS /xgc;
print "xml " when / \G (?&xml) $RX_SUBS /xgc;
print "xhook " when / \G (?&xhook) $RX_SUBS /xgc;
print "script " when / \G (?&script) $RX_SUBS /xgc;
print "style " when / \G (?&style) $RX_SUBS /xgc;
print "comment " when / \G (?&comment) $RX_SUBS /xgc;
print "tag " when / \G (?&tag) $RX_SUBS /xgc;
print "untag " when / \G (?&untag) $RX_SUBS /xgc;
print "nasty " when / \G (?&nasty) $RX_SUBS /xgc;
print "text " when / \G (?&nontag) $RX_SUBS /xgc;
default {
die "UNCLASSIFIED: " .
substr($_, pos || 0, (length > 65) ? 65 : length);
}
}
}
html_input_rx
下面所包含的程序产生了以下输出,这样您就可以看到使用regexes解析HTML对您想要做的工作很好:
% html_input_rx Amazon.com-_Online_Shopping_for_Electronics,_Apparel,_Computers,_Books,_DVDs_\&_more.htm
input tag #1 at character 9955:
class => "searchSelect"
id => "twotabsearchtextbox"
name => "field-keywords"
size => "50"
style => "width:100%; background-color: #FFF;"
title => "Search for"
type => "text"
value => ""
input tag #2 at character 10335:
alt => "Go"
src => "http://g-ecx.images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif"
type => "image"
这是产生上述输出的程序的源代码。
#!/usr/bin/env perl
#
# html_input_rx - pull out all <input> tags from (X)HTML src
# via simple regex processing
#
# Tom Christiansen <tchrist@perl.com>
# Sat Nov 20 10:17:31 MST 2010
#
################################################################
use 5.012;
use strict;
use autodie;
use warnings FATAL => "all";
use subs qw{
see_no_evil
parse_input_tags
input descape dequote
load_patterns
};
use open ":std",
IN => ":bytes",
OUT => ":utf8";
use Encode qw< encode decode >;
###########################################################
parse_input_tags
see_no_evil
input
###########################################################
until eof(); sub parse_input_tags {
my $_ = shift();
our($Input_Tag_Rx, $Pull_Attr_Rx);
my $count = 0;
while (/$Input_Tag_Rx/pig) {
my $input_tag = $+{TAG};
my $place = pos() - length ${^MATCH};
printf "input tag #%d at character %d:\n", ++$count, $place;
my %attr = ();
while ($input_tag =~ /$Pull_Attr_Rx/g) {
my ($name, $value) = @+{ qw< NAME VALUE > };
$value = dequote($value);
if (exists $attr{$name}) {
printf "Discarding dup attr value '%s' on %s attr\n",
$attr{$name} // "<undef>", $name;
}
$attr{$name} = $value;
}
for my $name (sort keys %attr) {
printf " %10s => ", $name;
my $value = descape $attr{$name};
my @Q; given ($value) {
@Q = qw[ " " ] when !/'/ && !/"/;
@Q = qw[ " " ] when /'/ && !/"/;
@Q = qw[ ' ' ] when !/'/ && /"/;
@Q = qw[ q( ) ] when /'/ && /"/;
default { die "NOTREACHED" }
}
say $Q[0], $value, $Q[1];
}
print "\n";
}
}
sub dequote {
my $_ = $_[0];
s{
(?<quote> ["'] )
(?<BODY>
(?s: (?! \k<quote> ) . ) *
)
\k<quote>
}{$+{BODY}}six;
return $_;
}
sub descape {
my $string = $_[0];
for my $_ ($string) {
s{
(?<! % )
% ( \p{Hex_Digit} {2} )
}{
chr hex $1;
}gsex;
s{
& \043
( [0-9]+ )
(?: ;
| (?= [^0-9] )
)
}{
chr $1;
}gsex;
s{
& \043 x
( \p{ASCII_HexDigit} + )
(?: ;
| (?= \P{ASCII_HexDigit} )
)
}{
chr hex $1;
}gsex;
}
return $string;
}
sub input {
our ($RX_SUBS, $Meta_Tag_Rx);
my $_ = do { local $/; <> };
my $encoding = "iso-8859-1"; # web default; wish we had the HTTP headers :(
while (/$Meta_Tag_Rx/gi) {
my $meta = $+{META};
next unless $meta =~ m{ $RX_SUBS
(?= http-equiv )
(?&name)
(?&equals)
(?= (?"e)? content-type )
(?&value)
}six;
next unless $meta =~ m{ $RX_SUBS
(?= content ) (?&name)
(?&equals)
(?<CONTENT> (?&value) )
}six;
next unless $+{CONTENT} =~ m{ $RX_SUBS
(?= charset ) (?&name)
(?&equals)
(?<CHARSET> (?&value) )
}six;
if (lc $encoding ne lc $+{CHARSET}) {
say "[RESETTING ENCODING $encoding => $+{CHARSET}]";
$encoding = $+{CHARSET};
}
}
return decode($encoding, $_);
}
sub see_no_evil {
my $_ = shift();
s{ <! DOCTYPE .*? > }{}sx;
s{ <! \[ CDATA \[ .*? \]\] > }{}gsx;
s{ <script> .*? </script> }{}gsix;
s{ <!-- .*? --> }{}gsx;
return $_;
}
sub load_patterns {
our $RX_SUBS = qr{ (?(DEFINE)
(?<nv_pair> (?&name) (?&equals) (?&value) )
(?<name> \b (?= \pL ) [\w\-] + (?<= \pL ) \b )
(?<equals> (?&might_white) = (?&might_white) )
(?<value> (?"ed_value) | (?&unquoted_value) )
(?<unwhite_chunk> (?: (?! > ) \S ) + )
(?<unquoted_value> [\w\-] * )
(?<might_white> \s * )
(?<quoted_value>
(?<quote> ["'] )
(?: (?! \k<quote> ) . ) *
\k<quote>
)
(?<start_tag> < (?&might_white) )
(?<end_tag>
(?&might_white)
(?: (?&html_end_tag)
| (?&xhtml_end_tag)
)
)
(?<html_end_tag> > )
(?<xhtml_end_tag> / > )
) }six;
our $Meta_Tag_Rx = qr{ $RX_SUBS
(?<META>
(?&start_tag) meta \b
(?:
(?&might_white) (?&nv_pair)
) +
(?&end_tag)
)
}six;
our $Pull_Attr_Rx = qr{ $RX_SUBS
(?<NAME> (?&name) )
(?&equals)
(?<VALUE> (?&value) )
}six;
our $Input_Tag_Rx = qr{ $RX_SUBS
(?<TAG> (?&input_tag) )
(?(DEFINE)
(?<input_tag>
(?&start_tag)
input
(?&might_white)
(?&attributes)
(?&might_white)
(?&end_tag)
)
(?<attributes>
(?:
(?&might_white)
(?&one_attribute)
) *
)
(?<one_attribute>
\b
(?&legal_attribute)
(?&might_white) = (?&might_white)
(?:
(?"ed_value)
| (?&unquoted_value)
)
)
(?<legal_attribute>
(?: (?&optional_attribute)
| (?&standard_attribute)
| (?&event_attribute)
# for LEGAL parse only, comment out next line
| (?&illegal_attribute)
)
)
(?<illegal_attribute> (?&name) )
(?<required_attribute> (?#no required attributes) )
(?<optional_attribute>
(?&permitted_attribute)
| (?&deprecated_attribute)
)
# NB: The white space in string literals
# below DOES NOT COUNT! It's just
# there for legibility.
(?<permitted_attribute>
accept
| alt
| bottom
| check box
| checked
| disabled
| file
| hidden
| image
| max length
| middle
| name
| password
| radio
| read only
| reset
| right
| size
| src
| submit
| text
| top
| type
| value
)
(?<deprecated_attribute>
align
)
(?<standard_attribute>
access key
| class
| dir
| ltr
| id
| lang
| style
| tab index
| title
| xml:lang
)
(?<event_attribute>
on blur
| on change
| on click
| on dbl click
| on focus
| on mouse down
| on mouse move
| on mouse out
| on mouse over
| on mouse up
| on key down
| on key press
| on key up
| on select
)
)
}six;
}
UNITCHECK {
load_patterns();
}
END {
close(STDOUT)
|| die "can't close stdout: $!";
}
}
https://stackoverflow.com/questions/-100005211
复制相似问题