Effective Perl by Joseph N. Hall

Observations and Tips from the author of Effective Perl Programming

Thursday, February 02, 2006

Tip: Regular Expression Precedence Made Simple (as Arithmetic)

The basic operations in Perl regular expressions are repetition, sequence, and alternation. That is also - from highest to lowest (tightest-binding to loosest-binding) - their precedence. A super-quick review first:

a*     # repetition - the character a repeated zero or more times
b+ # repetition - the character b repeated one or more times
x{1,3} # repetition - the character x repeated one to three times
abc # sequence - the character a, then the character b, then the character c
a|b|c # alternation - the character a, or the character b, or the character c

It's important to understand precedence in regular expressions. For example:

abc{3}

means the characters 'ab' followed by three instances of the character 'c'. When I see something like abc{3} I usually think that the author really meant "three instances of the characters 'abc'" - which is written differently:

(abc){3}

As you can see, you can use parentheses to control the order in which the bits of a regular expression are interpreted. I like to make an analogy to mathematical (algebraic) expressions. Even though a regular expression isn't a mathematical expression, the syntax is at least somewhat similar, especially where precedence is concerned. From the standpoint of precedence, you can think of a{3} as being something like x10 - exponentation, the highest-precedence operation in algebraic notation. abc is like xyz (the variables x, y, and z multiplied together) - multiplication having intermediate precedence - and a|b|c is like x + y + z - addition having low precedence. This becomes useful when you try to figure out things like:

a|b|c      # the character a, or the character b, or the character c
a|b|c{2} # the character a, the character b, or two c's in a row
# like a + b + c2
(a|b|c){2} # one of a or b or c followed by one of a or b or c
# like (a + b + c)2
(a|b|c)+ # one or more a or b or c
(abc)+ # abc one or more times in a row (abc, abcabc, abcabcabc, etc.)

So, think:

  • Repetition: exponentiation (highest)
  • Sequence: multiplication (middle)
  • Alternation: addition (lowest)

Now, the usefulness of all this depends on arithmetic (or algebra) being easy, which may be something else altogether.