Best Practice: Regular Expressions and their Pitfalls
Regular expressions are a powerful tool in any engineer's toolkit - but with Falcon LogScale backtracking implementation, they can sometimes take a toll on the time and resources available. The following is a collection of ways to avoid some of the most common pitfalls and patterns.
Overview
There is always an implicit
.*?
appended to every regex, there is no need to add it yourself.Never use back references such as
\1
.Using
*?
instead of*
is usually better, because*
is problematic, and the more you use it the worse the problem gets.However,
*?
is only better than*
when the desired input matches. Both are equally bad when the input doesn't match. To both match and not match effectively, avoid overlap between quantifiers and any input that comes after.
Common pitfalls
Regular expressions can be tricky. Below are a few examples of pitfalls you may see with explanations why.
Note
They aren't necessarily "bad". Rather, based on their use and their input, they have the potential to be bad.
Backreferences
Backreferences are the ability to match against a capture from
earlier on in the input using a regular expression. They are written
as \n
where
n
is an integer. Here is an example:
(\w*) \1
This configuration matches a word repeated twice with a space
between the two, i.e. it matches
boutros
but not
boutros ghali
.
So, why are backreferences a problem?
This means that backreferences are a problem, in that backreferences are demonstrably impossible to implement efficiently, leading to considerable expense.
Due to maintenance and implementation costs, backreferences will eventually be deprecated. As a result, dependence on them is not recommended, particularly for customers.
A note on efficiently using regexes
You can reduce satisfiability, which is NP hard, to regex matching with backreferences.
Matching this regex:
[01]??(?<a>[01])[01]? [01]??(?<b>[01])[01]? <[01 ]*\1([01])[01 ]*> <[01 ]*\3\2([01])[01 ]*> <[01 ]*\1\4([01])[01 ]*> \5
against this input:
01 01 <01 10> <000 010 100 111> <000 011 101 111> 1
solves
a OR b (NOT a AND b)
Proof is left as a fun exercise for the reader.
Using *, the greedy quantifier
Operators like *
,
+
, and
{}
consume a tremendous amount of
input as they work. In particular,
.*
is the most inefficient. For
example, say you want to match input that contains the letter
A
and the letter
D
. You might start with something
easy like
A.*D
While this does work, it does so inefficiently. Consider this input:
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
This input matches the parameters, and does so in the first 4 characters. However, because the regex uses greedy repetition, the process to get to that match looks something like this:
- none
A.*D ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
- none
A.*D ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
- none
A.*D ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
In step 2, notice that the greedy quantifier consumes as much input
as possible before 'noticing' that the
D
is right there. It will only step
backward a character at a time
after consuming that input. As
a result, the second 'D' won't be discovered until much later. While
in some cases this level of consumption is required, that is not the
case here.
Now let's change the regex to use the reluctant qualifier
*?
, which looks like this:
A.*?D
Now, results look like this:
- none
A.*?D ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
- none
A.*?D ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
- none
A.*?D ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
The difference between greedy quantifiers and reluctant quantifiers
is that reluctant ones consume as little as possible to make what
comes after match. Unless you have a specific reason why
.*
is the right choice, you should
use .*?
instead.
Whereas .*
is almost always
wasteful, there are many cases where using greedy quantifiers can be
appropriate. For example, using \s+
to skip a sequence of spaces is an appropriate use, because it's
likely that there will only be a few spaces- scanning over them is
probably not an issue. However, \s+?
has the same effect, so there is no harm in applying the same
methodology: unless you have a specific reason, always use the
reluctant version of a quantifier.
Overlapping and its issues
There are a few potential issues with regular expression elements overlapping each other.
Overlapping *
At this point we already know that *
can be problematic. With that being said, there are some cases that
support its use.
For example, say you want to recognize three words in a row separated by spaces, like this:
\w*\s+\w*\s+\w*
Matching against this is very efficient, even though it's full of greedy quantifiers:
- none
\w*\s+\w*\s+\w* foo bar baz
- none
\w*\s+\w*\s+\w* foo_bar baz
- none
\w*\s+\w*\s+\w* foo_bar baz ...
In this example, each part of the regex exactly matches the input it
corresponds to and no more. The reason why it works here whereas in
the previous example it doesn't is because
the repeating parts don't
overlap. There are no characters that are both in
\w
and
\s
, which means there is no
confusion about what parts of the input belong to
\w.
, and which belong to
\s+
. This is the opposite of
A.*D
, because
.
matches
D
, so the
.*
consumes the
D
that is needed for the regex to
match and then has to backtrack.
Now say we want to match not just three words in a row, but also
allow other characters like numbers, underscores, etc. We might
accomplish this by changing \w
to
.
to allow anything:
.*\s+.*\s+.*
While it may look like it does more or less the same thing, because
there are now characters that are both within
.
and
\s
(even
.
contains
\s
as a subset), the process of
matching becomes dramatically less efficient:
- none
1 .*\s+.*\s+.*
- none
foo bar baz 2 .*\s+.*\s+.*
- none
foo bar_baz 3 .*\s+.*\s+.* foo bar_baz
- none
...
As you can see, .
also matches
spaces, creating confusion about how much the first quantifier
should consume and which part the spaces belong to. It starts off
matching everything, which doesn't work. Then it backtracks and
tries to match the second word in the wrong place, requiring that it
backtrack again, and so on. It turns out that for each
.*
you add to the regex, the amount
of work to match a string more than doubles. So while 3 words is
manageable, and 10 maybe too, it doesn't take long before a regex
takes hours or days to match. (Note: LogScale doesn't allow this and
so the match will fail instead).
The general rule is: any time you have a repeating part of a regex, it shouldn't be able to match what comes after it. When repeating, it should be clear when to stop. While, that isn't always possible, with some creative thinking it is often achievable.
That rule also gives us a way to fix another problem: ensuring the
repeating parts don't overlap. In the example, we should replace
.
with something that doesn't
overlap with \s
. The obvious choice
is [^\s]
, as that is the character
class that doesn't contain characters in
\s
, which can also be written as
\S
:
\S*\s+\S*\s+\S*
Similarly, revising A.*D
to
A[^D]*D
also solves the problem
using that regex.
Another solution is to use *?
and
+?
instead of
*
and
+
, but as you'll soon find out,
that option has its own issues that removing overlap doesn't.
Avoid overlapping *?
Now we already know that *?
is
better than *
. However, there is a
caveat; it's only better when the input matches the regex, because
the difference between *
and
*?
is the order in which they try
matching, not in what they match. The way input fails to match a
regex is by attempting all ways to potentially match and not finding
any, and while this can be a quick process if the input clearly
doesn't match, it can also be costly for input that
almost matches. So when trying
all all potential possibilities, the set that needs to be tried is
the exact same for *
and
*?
.
As discussed, this regex can potentially be costly to match against, both when the input matches and when it doesn't. Let's look at an example:
.*\s+.*\s+.*\s+.*\s+.*\s+.*\s+.*\s+.*
Using reluctant quantifiers helps tremendously to find input that matches:
.*?\s+?.*?\s+?.*?\s+?.*?\s+?.*?\s+?.*?\s+?.*?\s+?.*?
Using the simulator at regex101.com to match against the following:
foo bar baz qux quux quuz corge grault
matches in 952 steps using the greedy version, whereas the reluctant takes only 51, which is clearly an improvement. Conversely, here is a string that almost but not quite matches:
foo bar baz qux quux quuz corgegrault
With this input the greedy version takes 1132 steps to fail, while
the reluctant version takes — guessed it, 1132 steps, because
*
and
*?
always take the same number of
steps to fail.
In conclusion, while it's better to use
*?
instead of
*
, you're still vulnerable to
costly failures if you don't also consider overlap between
quantifiers and what comes after them.
If we rewrite the regex again to also remove overlap on the same inputs:
\S*?\s+?\S*?\s+?\S*?\s+?\S*?\s+?\S*?\s+?\S*?\s+?\S*?\s+?\S*?
it takes only 16 steps (!) to match and 236 to not match.