Best Practice: Regular Expressions and their Pitfalls

Regular expressions are a powerful tool in any engineer's toolkit - but with Falcon LogScale backtracking implementation, they can sometimes take a toll on the time and resources available. The following is a collection of ways to avoid some of the most common pitfalls and patterns.

Overview

  • There is always an implicit .*? appended to every regex, there is no need to add it yourself.

  • Never use back references such as \1 .

  • Using *? instead of * is usually better, because * is problematic, and the more you use it the worse the problem gets.

    • However, *? is only better than * when the desired input matches. Both are equally bad when the input doesn't match. To both match and not match effectively, avoid overlap between quantifiers and any input that comes after.

Common pitfalls

Regular expressions can be tricky. Below are a few examples of pitfalls you may see with explanations why.

Note

They aren't necessarily "bad". Rather, based on their use and their input, they have the potential to be bad.

Backreferences

Backreferences are the ability to match against a capture from earlier on in the input using a regular expression. They are written as \n where n is an integer. Here is an example:

none
(\w*) \1

This configuration matches a word repeated twice with a space between the two, i.e. it matches boutros but not boutros ghali .

So, why are backreferences a problem?

  • This means that backreferences are a problem, in that backreferences are demonstrably impossible to implement efficiently, leading to considerable expense.

  • Due to maintenance and implementation costs, backreferences will eventually be deprecated. As a result, dependence on them is not recommended, particularly for customers.

A note on efficiently using regexes

You can reduce satisfiability, which is NP hard, to regex matching with backreferences.

Matching this regex:

[01]??(?<a>[01])[01]? [01]??(?<b>[01])[01]? <[01 ]*\1([01])[01 ]*> <[01 ]*\3\2([01])[01 ]*> <[01 ]*\1\4([01])[01 ]*> \5

against this input:

01 01 <01 10> <000 010 100 111> <000 011 101 111> 1

solves

a OR b (NOT a AND b)

Proof is left as a fun exercise for the reader.

Using *, the greedy quantifier

Operators like * , + , and {} consume a tremendous amount of input as they work. In particular, .* is the most inefficient. For example, say you want to match input that contains the letter A and the letter D . You might start with something easy like

A.*D

While this does work, it does so inefficiently. Consider this input:

ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ

This input matches the parameters, and does so in the first 4 characters. However, because the regex uses greedy repetition, the process to get to that match looks something like this:

  1. none
    A.*D
    ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
  2. none
    A.*D
    ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
  3. none
    A.*D
    ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ

In step 2, notice that the greedy quantifier consumes as much input as possible before 'noticing' that the D is right there. It will only step backward a character at a time after consuming that input. As a result, the second 'D' won't be discovered until much later. While in some cases this level of consumption is required, that is not the case here.

Now let's change the regex to use the reluctant qualifier *? , which looks like this:

none
A.*?D

Now, results look like this:

  1. none
    A.*?D
    ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
  2. none
    A.*?D
    ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
  3. none
    A.*?D
    ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ

The difference between greedy quantifiers and reluctant quantifiers is that reluctant ones consume as little as possible to make what comes after match. Unless you have a specific reason why .* is the right choice, you should use .*? instead.

Whereas .* is almost always wasteful, there are many cases where using greedy quantifiers can be appropriate. For example, using \s+ to skip a sequence of spaces is an appropriate use, because it's likely that there will only be a few spaces- scanning over them is probably not an issue. However, \s+? has the same effect, so there is no harm in applying the same methodology: unless you have a specific reason, always use the reluctant version of a quantifier.

Overlapping and its issues

There are a few potential issues with regular expression elements overlapping each other.

Overlapping *

At this point we already know that * can be problematic. With that being said, there are some cases that support its use.

For example, say you want to recognize three words in a row separated by spaces, like this:

none
\w*\s+\w*\s+\w*

Matching against this is very efficient, even though it's full of greedy quantifiers:

  1. none
    \w*\s+\w*\s+\w* foo bar baz
  2. none
    \w*\s+\w*\s+\w* foo_bar baz
  3. none
    \w*\s+\w*\s+\w* foo_bar baz ...

In this example, each part of the regex exactly matches the input it corresponds to and no more. The reason why it works here whereas in the previous example it doesn't is because the repeating parts don't overlap. There are no characters that are both in \w and \s , which means there is no confusion about what parts of the input belong to \w. , and which belong to \s+ . This is the opposite of A.*D , because . matches D , so the .* consumes the D that is needed for the regex to match and then has to backtrack.

Now say we want to match not just three words in a row, but also allow other characters like numbers, underscores, etc. We might accomplish this by changing \w to . to allow anything:

none
.*\s+.*\s+.*

While it may look like it does more or less the same thing, because there are now characters that are both within . and \s (even . contains \s as a subset), the process of matching becomes dramatically less efficient:

  1. none
    1 .*\s+.*\s+.*
  2. none
    foo bar baz 2 .*\s+.*\s+.*
  3. none
    foo bar_baz 3 .*\s+.*\s+.* foo bar_baz
  4. none
    ...

As you can see, . also matches spaces, creating confusion about how much the first quantifier should consume and which part the spaces belong to. It starts off matching everything, which doesn't work. Then it backtracks and tries to match the second word in the wrong place, requiring that it backtrack again, and so on. It turns out that for each .* you add to the regex, the amount of work to match a string more than doubles. So while 3 words is manageable, and 10 maybe too, it doesn't take long before a regex takes hours or days to match. (Note: LogScale doesn't allow this and so the match will fail instead).

The general rule is: any time you have a repeating part of a regex, it shouldn't be able to match what comes after it. When repeating, it should be clear when to stop. While, that isn't always possible, with some creative thinking it is often achievable.

That rule also gives us a way to fix another problem: ensuring the repeating parts don't overlap. In the example, we should replace . with something that doesn't overlap with \s . The obvious choice is [^\s] , as that is the character class that doesn't contain characters in \s , which can also be written as \S :

none
\S*\s+\S*\s+\S*

Similarly, revising A.*D to A[^D]*D also solves the problem using that regex.

Another solution is to use *? and +? instead of * and + , but as you'll soon find out, that option has its own issues that removing overlap doesn't.

Avoid overlapping *?

Now we already know that *? is better than * . However, there is a caveat; it's only better when the input matches the regex, because the difference between * and *? is the order in which they try matching, not in what they match. The way input fails to match a regex is by attempting all ways to potentially match and not finding any, and while this can be a quick process if the input clearly doesn't match, it can also be costly for input that almost matches. So when trying all all potential possibilities, the set that needs to be tried is the exact same for * and *? .

As discussed, this regex can potentially be costly to match against, both when the input matches and when it doesn't. Let's look at an example:

none
.*\s+.*\s+.*\s+.*\s+.*\s+.*\s+.*\s+.*

Using reluctant quantifiers helps tremendously to find input that matches:

none
.*?\s+?.*?\s+?.*?\s+?.*?\s+?.*?\s+?.*?\s+?.*?\s+?.*?

Using the simulator at regex101.com to match against the following:

none
foo bar baz qux quux quuz corge grault

matches in 952 steps using the greedy version, whereas the reluctant takes only 51, which is clearly an improvement. Conversely, here is a string that almost but not quite matches:

none
foo bar baz qux quux quuz corgegrault

With this input the greedy version takes 1132 steps to fail, while the reluctant version takes — guessed it, 1132 steps, because * and *? always take the same number of steps to fail.

In conclusion, while it's better to use *? instead of * , you're still vulnerable to costly failures if you don't also consider overlap between quantifiers and what comes after them.

If we rewrite the regex again to also remove overlap on the same inputs:

none
\S*?\s+?\S*?\s+?\S*?\s+?\S*?\s+?\S*?\s+?\S*?\s+?\S*?\s+?\S*?

it takes only 16 steps (!) to match and 236 to not match.