More Explanation for Regular Expression Animator

Doug's dog theory of regular expression matching.

My metaphor for regular expression (RE) matching is a pack of dogs running around the RE. The red letters in the RE of the animator show where the dogs are. Here's what the dogs are:

Imagine a dog at the start of RE abc. If the input string (IS) is also abc, then the dog will go from a to b to c, and reach the accept-state, signifying a match.

The dog disappears if it's at a RE character that doesn't match an IS character. If the RE is abc and the IS is axy then the dog will go from a to b, but instead of going on to c, it will disappear.

A dog can split into two dogs, which is why I said pack of dogs, above. A dog splits up whenever you have alternatives. So for RE ab(cd|ef|gh) if the IS is abef then a dog will go from a to b, then it will split up to c, e, and g. Only the dog at e will match, though, so the dogs at c and g will disappear.

A new dog appears at the beginning, and any top level alternatives, at each step, because any substring of the IS can match the RE. If I allowed an anchor at the beginning of the RE, as in ^abc, then new dogs wouldn't appear at each step.

Two dogs merge when they meet. Consider the RE (a|aa)bc and the IS aabc. When the IS goes from the second a to the b, there will be dogs at the first a and the third a in the RE. They will merge and go to the b in the RE.

Each dog represents a potential match of a substring of the IS with the RE. Each dog can have a different substring. It starts where the IS was when the dog entered the RE. They all end where the IS is now. When a dog reaches the accept-state, its substring is the matching substring. When two dogs merge, the longer substring goes with the merged dog.

Special characters in the regular expression animator.

+
Matches 1 or more repetitions of the regular expression preceding the +.
*
Matches 0 or more repetitions of the regular expression preceding the *.
?
Makes the regular expression preceding the ? optional.
|
Match either the regular expression to the left or to the right of the |.
()
Change precedences. Normally the +, *, and ? apply to the preceding character. To apply them to more, surround the more with parentheses. Normally | applies to everything up to the ends of the regular expression, or another |. To limit it, use parentheses.

This being a simple demo to demonstrate Doug's dog theory of regular expression matching, it doesn't handle character classes (e.g. [^a-z]), anchors (i.e. ^ and $), or empty regular expressions (e.g. () or (foo|) or (|foo)).

Matching in the regular expression animator.

Sometimes you just want to know if any substring of an input string matches a regular expression or not. You don't care what the substring was, nor which part of the regular expression matched. The purpose of the regular expression animator is to demonstrate Doug's dog theory of regular expression matching (see above), so it just does this simple yes-or-no matching.

So when a dog reaches the accept-state, the animator stops. If we wanted to see the matched substring, then the animator would keep going whenever there was a dog with a substring that started no later than the substring of the dog that reached the accept-state. That would return the match that started earliest, and, for ties, went the longest.

What if you wanted to see the matches in progress as the animator ran? It would be messy to show the matches in the original strings because of all the overlap, but splitting them out individually would be easy. I did that for the input string matches for a client. I put numbers above the regular expression to identify each dog, then in a separate list I showed the substring matched by each dog. These substrings grew, split, and disappeared as the dogs moved, split, and disappeared. I didn't do that here because I wanted to concentrate on animating the basic idea of the dogs moving about, matching in parallel.

Is this an NFA or a DFA?

This assumes you already know what an NFA and DFA are, basically.

A DFA is susceptible to an exponential explosion of states, requiring exponential time to generate the DFA and exponential space to store it. An NFA is susceptible to exponential backtracking, requiring exponential time to analyze an input string. The regular expression engine shown in the animator is in the gray area between the two.

If it has to be one or the other, then maybe it's a parallel NFA that obviates backtracking and eliminates redundant paths, or maybe it's a DFA that only generates the states it needs. Each dog (see dog theory, above) represents a different backtrack in a non-parallel NFA-based matcher. Each combination of dogs represents a state in the DFA.

Comments to doug[at]DougAndJean[dot]com
Back to regular expression animator
Back to DougAndJean.com