Reimplementing
Regular Expressions

How and Why

Created by Andrew Clarkson @_andrewclarkson

Why in the F***?

As a learning opportunity

Because everybody's doing it wrong

Backtracking

Example:


                        import re
                        csv = "first,last,phone"
                        re.match("^(.*,){2}p", csv)
                        

Backtracking:


                        csv = "first,last,something else"
                        

Asymptotic Complexity

Exponential time is fine for small n

Let's Be Pathological!


                        import re
                        n = 5
                        text = "a" * n
                        regex = "^(a?){n}a{n}".replace("n", str(n))
                        re.match(regex, text)
                        

P.S. I've never actually finished n = 30

Exponential Time

Written Using Default Regex

./pathological-backtracking.py

Denial of Service?

Enter Finite State Automata

(aw-tom-uh-tuh)

Written Using Pure Python

./pathological-nfa.py

Let's Compare

What is a Finite State Automaton?

Turnstile Example:

Thompson Construction Algorithm

Generates a Non-Deterministic Finite State Automaton

1. Convert to Postfix

(ab|c)*|d

becomes

ab.c|*d|

2. Break Into Subexpressions

3. Compose Automata

Atomic:

a

Concatenation:

ab.

Union:

ab|

Kleene Closure:

a*

Kleene Plus:

a+

Optional:

a?

Final Thoughts

  • Backtracking is NP-complete
  • But it can be useful (longest match)
  • Questions?