Regex Details


Beginning | Previous | Next

Character Classes

Character classes represent a grouping of possible characters (alpha, numeric, punctuation, etc) to form a match, not a sequence. In other words, the character class "[123]" is the same as " [132]" as far as matching a single character, whose value can be either a "1", "2" or "3". For example: $answer =~ /[yY]/; # match either a 'y' or 'Y' $answer =~ /y/i; # same as above $start_alpha =~ /[a-zA-Z]\d+/; # a substring that has one alpha # char followed by one or more digits $q_not_u =~ /q[^u]/; # match a 'q' that is not followed by a 'u'. # why doesn't this match 'Qantas' or 'Iraq'?

Alternation, and You

Alternation matches one of several subexpressions. It is presented by the vertical bar, "|", and gives the regex two or more alternatives, such as: $version =~ /(version|revision)/; # match either way of writing version $name =~ /bob|robert/i; # nickname or fullname # match number of the form x., x, x.y or .z, where x, y, and z can be # many digits $floater =~ /^(\d+\.?\d*|\.\d+)$/;

Parenthesis Metacharacter

The parenthesis metacharacter are used for two things:
  1. Grouping stuff together and
  2. For backreferences.
We can see simple uses for grouping to use with qualifiers or alternation, as in: $time =~ /(am|pm)$/; # find either 'am' or 'pm' and end of string $num =~ /0x([a-fA-F0-9]+) /; # one or more hex num separated by a space The really interesting part is using parenthesis for backreferenes, this is where what was matched in the parenthesis is remembered for later use. Perl saves these remembered matches in variables, with limited lifespans, named $1 for the first paren expression, $2 for the second, and so on. Real life examples can demonstrate this: $header =~ /^Date: (.*)/; # take everything after 'Date: ' and # store in $1 $date = $1; $line =~ /version:\s+(\d+)\.(\d+)/i; $major_version = $1; $minor_version = $2; # get the root dir and the next subdir from the current working dir $ENV{PWD} =~ m"/(\w+)/(\w+)"; # note delimiters $root = $1; $next = $2; print "root path = $root, next dir = $next\n";

Backtracking

Backtracking is a machanism that NFA regex engines use to place a bookmark at a location in a string, where an optional match can take place, so it can backtrack to where it found the last match. For example, say we want to match a string, "ac" with the regex /ab?c/. After the first character was tried, the match would look like:

String  regex
a c a b? c ^ ^ (with spaces between character to show where we are in the match, ie, we just finished the first character match test). At this point, since the "b" is optional, the NFA engine remembers where it is, in the event of a non-match.

Now shifting gears, we move to the next character, try and match a "c" with a "b", fail, and the NFA knows where the last successful match was because of the bookmark it saved. We then go to the last character, find a match and complete the total match.

Greediness of Qualifiers

Greediness refers to the tendency of the regex engine to suck up as many characters as possible, that match what they are qualifying, before checking the rest of the regex (if there is any). The qualifiers "?", "*" and "+" are greedy, while "|" is not. (This is true for traditional NFA but not for POSIX NFA.)

For example, the regex "/.*[0-9]/" operating on the string:

"eating more than 2 or 3 mangoes a day requires lots of flossing"

will include the following sequence of events:

  1. The dot metachracter followed by the star qualifier tells the regex engine to read the entire line
  2. The engine then compares the last character to see if it is a digit
  3. After it fails, it starts from the second to last character and test to see if it is a digit
  4. The engine keeps backing up a character in search of a digit
  5. When the "3" is reached, the search end
  6. The matching string is "eating more than 2 or 3".

Pitfalls - Near Death Regex Experiences

Study these examples and see if you can tell why they go south.

We get the wrong match here...we want the area code:

$phone = 'Phone: 602-555-1212'; $phone =~ m/.*([0-9][0-9][0-9])/; $area_code = $1; # wrong! $area_code holds '212', NOT '602' Both of these string match the regex "/[0-9]*/" when we may not have wanted it to: $a = 'an integer 1234 here'; $b = 'no numbers here'; print 'y' if ($a =~ /[0-9]*/); # prints 'y' print 'y' if ($b =~ /[0-9]*/); # also, prints 'y'...doh! One version of a regex to match a quoted strings with possible embedded escape codes is:

/"(\\.|[^"\\])*"/

Using this regex on the string:

"Error 143\nCall your system operator"

Since there are more non-escaped character than escaped, there would be much less backtracking if the order of the alternation was swapped, such as:

/"([^"\\]|\\.)*"/

Bottom Line

For small pattern matching examples where time is not an issue, don't worry. For wading through megabytes of data, pay attention to the construction of your regex (and read the section in "Programming Perl" on efficiency).

Regex Design Guidelines

Beginning | Previous | Next
Last Modified: $Date: 1997/09/18 08:54:21 $