Where Did the Bug Probably Happen? Regular Expressions - Jan Švec

Regular expressions are popular among many programmers. In this part of the debugging series we will look at what is wrong with the regular expression [+-0123456789].

Regular expressions are a relatively powerful formalism for searching and replacing text. I use them sometimes, but I also count them among write-only programming languages together with Perl and Brainfuck. In Python, string operations and methods are often clearer and can avoid exactly the kind of bug shown here.

The Task Is Clear

We want to use a regular expression to process a string that should contain an int or float, positive or negative. We work in Python and only match the expression against the input string.

import re

RE = r'[+-0123456789]+(\.[0123456789]+)?'

for i in ['+1.0', '-1.0', '1.0', 'abcd', 'a.b']:
    matches = re.match(RE, i)
    print i, bool(matches)

The output looks right:

+1.0 True
-1.0 True
1.0 True
abcd False
a.b False

Later in Development

Now we want to process the integer part and the decimal part separately. The only group in the regular expression should tell us whether the number has a decimal part:

for i in ['+1.0', '-1.0', '1.0', '10']:
    m = re.match(RE, i)
    desetiny = m.group(1)
    print i, bool(desetiny)

The output is surprisingly wrong:

+1.0 False
-1.0 False
1.0 False
10 False

Where Did the Bug Probably Happen?

The practical problem where I hit this bug was much less friendly to experimentation. One debugging experiment was to remove the group from the regular expression:

RE = r'[+-0123456789]+'

for i in ['+1.0', '-1.0', '1.0', 'abcd', 'a.b']:
    matches = re.match(RE, i)
    print i, bool(matches)

The result was already interesting:

+1.0 True
-1.0 True
1.0 True
abcd False
a.b False

Then I tried:

for i in ['1,0', '1/0']:
    matches = re.match(RE, i)
    print i, bool(matches)

And got:

1,0 True
1/0 True

After many attempts near the end of the working day, one starts believing in bugs in Python’s regular-expression implementation. After another reading of the source code, the cause finally appeared: the minus sign in the regular expression is not a minus sign, it defines a range of characters. The expression accepts characters 1 to 9 and also any ASCII character between + and 0, which includes comma, minus, dot, and slash.

Now everything is clear. The group never matched, because regular expressions are greedy and the first part already matched the dot and the following digits. Because the character range contained the desired minus sign, the first examples seemed to work. The expression also matched 1,0 and 1/0.

The fix is simple: escape the minus sign. In a more cryptic, write-only style, it is enough to swap the order of plus and minus:

RE = r'[-+0123456789]+(\.[0123456789]+)?'

for i in ['+1.0', '-1.0', '1.0', '10']:
    m = re.match(RE, i)
    desetiny = m.group(1)
    print i, bool(desetiny)

And the output is as expected:

+1.0 True
-1.0 True
1.0 True
10 False

Conclusion

The fix by swapping two characters feels very cryptic to me. I cannot help thinking that a combination of split(), isdigit(), and other string methods could achieve this behaviour more elegantly in many cases. Regular expressions do have their place in modern programming, usually with high rent and expensive maintenance fees.

Footnote

Although formal language theory tells us that a regular grammar is accepted by a finite automaton, it is important to realise that practical regular expressions are not regular grammars. Most current implementations use backtracking and support features such as named backreferences, lookahead assertions, positive lookbehind assertions, and more; see the re module documentation.

Try this example:

re.search("(a+)+b", "aaaaaaaaaaaaaaaaaaaaaaaaaa")

How long does that single match take? A hundredth of a second? A second? A hundred seconds? It depends.