Regular Expressions (Regex)


Beginning | Previous | Next

Where Did They Come From?*

Two neuro-physiologists, W. McCulloch and W. Pitts developed models of how they believed the nervous system worked on the neuron level. A mathematician, Stephen Kleene took the ball and ran with it to produce an algebraic representation called regular sets. It wasn't until 1968 that the first reference to its use in computer science was made in an article Regular Expression Search Algorithm by the top dog hacker, Ken Thompson.

He wrote qed, an editor that was the basis for the UNIX editor ed...not to be confused with Mr. Ed.

Perl's Regex Flavor

For the record, Perl uses a traditional Nondeterministic Finite Automation (NFA) regex match engine, should someone quiz you on the street. It differs somewhat from POSIX NFA and differs greatly (in underlying mechanics) from Deterministic Finite Automation (DFA).

Note: See the recommended reading for more details on regex flavors and usage.

What Does A Regex Do?

They match patterns. These patterns can be represented by a language containing characters, called metacharacters, that have special meaning.

What Are These Metacharacters?

A list of metacharacters (not including the one in character classes, to be covered later) are given below: \ Quote the next metacharacter (backslash) ^ Match the beginning of the line (caret) . Match any character, except newline (period) $ Match the end of the line (dollar) | Alternation (vertical bar) () Grouping or backreference (parenthesis) [] Character class (square brackets) \w Match a "word" character (alphanumeric plus "_", same as [a-zA-Z0-9_]) \W Match a non-word character \s Match a whitespace character \S Match a non-whitespace character \d Match a digit character \D Match a non-digit character \b Match a word boundary \B Match a non-(word boundary) \A Match only at beginning of string \Z Match only at end of string \G Match only where previous m//g left off

How Perl Uses These Metacharacters

Perl uses these metacharacters with the following constructs: $var =~ m/pattern/; # "returns" true or false if "pattern" is in $var $var =~ /pattern/; # same as above $string =~ s/look_for_this_pattern/replace_with_this/; # replace string $name =~ tr/[A-Z]/[a-z]/; # make lowercase m// and s/// have the general form: m/pattern/modifier; # pattern match operation s/pattern/replacement/modifier; # substitution operation where modifier is one or more of the following values: i Do case-insensitive pattern matching. m Treat string as multiple lines. s Treat string as single line. x Use extended regular expressions. g Match globally, that is, find all occurrences o Only compile pattern once

Character Classes

Character classes, denoted by "[stuff_inside]", represents matching any one of several characters.

Character Class Metacharacters

Within character classes, certain characters have special meaning, such as: - range thingy (defines a range of digits or alpha chars, unless it's the first char, then take literally) ^ negates the class (NOT these characters) . just a dot (no special meaning)

Metacharacter Qualifiers

The following standard quantifiers are recognized: * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times {n} Match exactly n times {n,} Match at least n times {n,m} Match at least n but not more than m times and come directly after whatever they are meant to qualify, such as: /\.\.?/; # match one or two dots $sp =~ /^ +/; # match string beginning with one or more spaces * Courtesy of "Mastering Regular Expressions" by Jeffrey Friedl.

Beginning | Previous | Next
Last Modified: $Date: 1997/05/02 07:17:48 $