Regex Details
Beginning |
Previous |
Next
Character Classes
Character classes represent a grouping of possible characters (alpha,
numeric, punctuation, etc) to form a match, not a
sequence. In other words, the character class "[123]" is
the same as " [132]" as far as matching a
single character, whose value can be either a
"1", "2" or
"3".
For example:
$answer =~ /[yY]/; # match either a 'y' or 'Y'
$answer =~ /y/i; # same as above
$start_alpha =~ /[a-zA-Z]\d+/; # a substring that has one alpha
# char followed by one or more digits
$q_not_u =~ /q[^u]/; # match a 'q' that is not followed by a 'u'.
# why doesn't this match 'Qantas' or 'Iraq'?
Alternation, and You
Alternation matches one of several subexpressions. It is presented by
the vertical bar, "|", and gives the regex two or more
alternatives, such as:
$version =~ /(version|revision)/; # match either way of writing version
$name =~ /bob|robert/i; # nickname or fullname
# match number of the form x., x, x.y or .z, where x, y, and z can be
# many digits
$floater =~ /^(\d+\.?\d*|\.\d+)$/;
Parenthesis Metacharacter
The parenthesis metacharacter are used for two things:
- Grouping stuff together and
- For backreferences.
We can see simple uses for grouping to use with qualifiers or
alternation, as in:
$time =~ /(am|pm)$/; # find either 'am' or 'pm' and end of string
$num =~ /0x([a-fA-F0-9]+) /; # one or more hex num separated by a space
The really interesting part is using parenthesis for backreferenes,
this is where what was matched in the parenthesis is remembered
for later use. Perl saves these remembered matches in variables,
with limited lifespans, named $1 for the first paren
expression, $2 for the second, and so on. Real life
examples can demonstrate this:
$header =~ /^Date: (.*)/; # take everything after 'Date: ' and
# store in $1
$date = $1;
$line =~ /version:\s+(\d+)\.(\d+)/i;
$major_version = $1;
$minor_version = $2;
# get the root dir and the next subdir from the current working dir
$ENV{PWD} =~ m"/(\w+)/(\w+)"; # note delimiters
$root = $1;
$next = $2;
print "root path = $root, next dir = $next\n";
Backtracking
Backtracking is a machanism that NFA regex engines use to place a
bookmark at a location in a string, where an optional match can
take place, so it can backtrack to where it found the last match. For example, say we want to match a string,
"ac" with the regex /ab?c/. After the first
character was tried, the match would look like:
String regex
a c a b? c
^ ^
(with spaces between character to show where we are in the match, ie,
we just finished the first character match test). At
this point, since the "b" is optional, the NFA
engine remembers where it is, in the event of a non-match.
Now shifting gears, we move to the next character, try and
match a "c" with a "b",
fail, and the NFA knows where the last successful match was because of
the bookmark it saved. We then go to the last character, find a
match and complete the total match.
Greediness of Qualifiers
Greediness refers to the tendency of the regex engine to suck up as
many characters as possible, that match what they are qualifying,
before checking the rest of the regex (if there is
any). The qualifiers "?",
"*" and "+" are greedy,
while "|" is not. (This is true for traditional
NFA but not for POSIX NFA.)
For example, the regex "/.*[0-9]/" operating on
the string:
"eating more than 2 or 3 mangoes a day requires lots of
flossing"
will include the following sequence of events:
- The dot metachracter followed by the star qualifier tells the
regex engine to read the entire line
- The engine then compares the last character to see if it is a
digit
- After it fails, it starts from the second to last character and
test to see if it is a digit
- The engine keeps backing up a character in search of a digit
- When the "
3" is reached, the search end
- The matching string is "
eating more than 2 or 3".
Pitfalls - Near Death Regex Experiences
Study these examples and see if you can tell why they go south.
We get the wrong match here...we want the area code:
$phone = 'Phone: 602-555-1212';
$phone =~ m/.*([0-9][0-9][0-9])/;
$area_code = $1; # wrong! $area_code holds '212', NOT '602'
Both of these string match the regex "/[0-9]*/"
when we may not have wanted it to:
$a = 'an integer 1234 here';
$b = 'no numbers here';
print 'y' if ($a =~ /[0-9]*/); # prints 'y'
print 'y' if ($b =~ /[0-9]*/); # also, prints 'y'...doh!
One version of a regex to match a quoted strings with possible embedded
escape codes is:
/"(\\.|[^"\\])*"/
Using this regex on the string:
"Error 143\nCall your system operator"
Since there are more non-escaped character than escaped, there would
be much less backtracking if the order of the alternation was swapped,
such as:
/"([^"\\]|\\.)*"/
Bottom Line
For small pattern matching examples where time is not an issue, don't
worry. For wading through megabytes of data, pay attention to the
construction of your regex (and read the section in
"Programming Perl" on efficiency).
Regex Design Guidelines
- Study context in which they are used (it will tailor the regex)
- Determine exactly what you want to match and make the match happen
as quickly as possible (ie, narrow the "search")
- Watch out for greedy qualifiers
Beginning |
Previous |
Next
Last Modified: $Date: 1997/09/18 08:54:21 $