Regular Expressions (Regex)
Beginning |
Previous |
Next
Where Did They Come From?*
Two neuro-physiologists, W. McCulloch and W. Pitts developed models of
how they believed the nervous system worked on the neuron level. A
mathematician, Stephen Kleene took the ball and ran with it to produce
an algebraic representation called regular sets. It wasn't
until 1968 that the first reference to its use in computer science was
made in an article Regular Expression Search Algorithm by the
top dog hacker, Ken Thompson.
He wrote qed, an editor
that was the basis for the UNIX editor ed...not to be
confused with Mr. Ed.
Perl's Regex Flavor
For the record, Perl uses a traditional Nondeterministic Finite
Automation (NFA) regex match engine, should someone quiz you on the
street. It differs somewhat from POSIX NFA and differs greatly (in
underlying mechanics) from Deterministic Finite Automation (DFA).
Note: See the recommended reading for more details on regex flavors and
usage.
What Does A Regex Do?
They match patterns. These patterns can be represented by a language
containing characters, called metacharacters, that have special meaning.
What Are These Metacharacters?
A list of metacharacters (not including the one in character classes,
to be covered later) are given below:
\ Quote the next metacharacter (backslash)
^ Match the beginning of the line (caret)
. Match any character, except newline (period)
$ Match the end of the line (dollar)
| Alternation (vertical bar)
() Grouping or backreference (parenthesis)
[] Character class (square brackets)
\w Match a "word" character
(alphanumeric plus "_", same as [a-zA-Z0-9_])
\W Match a non-word character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character
\b Match a word boundary
\B Match a non-(word boundary)
\A Match only at beginning of string
\Z Match only at end of string
\G Match only where previous m//g left off
How Perl Uses These Metacharacters
Perl uses these metacharacters with the following constructs:
$var =~ m/pattern/; # "returns" true or false if "pattern" is in $var
$var =~ /pattern/; # same as above
$string =~ s/look_for_this_pattern/replace_with_this/; # replace string
$name =~ tr/[A-Z]/[a-z]/; # make lowercase
m// and s/// have the general form:
m/pattern/modifier; # pattern match operation
s/pattern/replacement/modifier; # substitution operation
where modifier is one or more of the following values:
i Do case-insensitive pattern matching.
m Treat string as multiple lines.
s Treat string as single line.
x Use extended regular expressions.
g Match globally, that is, find all occurrences
o Only compile pattern once
Character Classes
Character classes, denoted by "[stuff_inside]",
represents matching any one of several characters.
Character Class Metacharacters
Within character classes, certain characters have special meaning,
such as:
- range thingy (defines a range of digits or alpha chars,
unless it's the first char, then take literally)
^ negates the class (NOT these characters)
. just a dot (no special meaning)
Metacharacter Qualifiers
The following standard quantifiers are recognized:
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
and come directly after whatever they are meant to qualify, such as:
/\.\.?/; # match one or two dots
$sp =~ /^ +/; # match string beginning with one or more spaces
* Courtesy of
"Mastering Regular Expressions" by Jeffrey Friedl.
Beginning |
Previous |
Next
Last Modified: $Date: 1997/05/02 07:17:48 $