When you're working with regular languages specified in regular expression form, there's a really cool idea that you can use for building regular expression matchers, and for describing how to convert from a regular expression to a NFA. It's called the Brzozozwksi derivative of a regular expression - or just simply the derivative of a regexp.
The basic idea of the derivative is that given a regular expression, \(r\), you can derive a new regular expression called the derivative with respect to symbol \(c\), \(D_c(r)\). \(D_c(r)\) is a regular expression describing the string matched by \(r\) after it's matched an \(r\).
To define the derivative, we first need a helper, which we'll call \(delta\). What \(delta\) does is tell us if a given regular expression canmatch the empty string. We'll use it a few ways - both as a part of the derivative, and as part of the process of turning a regular expression into a finite state machine. For convenience, we'll define it so that \(delta(r)\) returns \(epsilon\) (a pattern matching the empty string) if \(r\) can match the empty string, or \(emptyset\) (the null pattern, a regular expression which never matches anything) if it can't.
Given a regular expression \(r\), we define \(delta(r)\) as follows:
- if \(r\) is \(emptyset\), then \(delta(r) = emptyset\); since the void pattern can't ever match anything, it doesn't match the empty string.
- if \(r\) is \(epsilon\), then \(delta(r) = epsilon\); obviously, the pattern that (by definition) matches only the empty string does match the empty string.
- if \(r\) is a single-character pattern, then \(delta(r) = emptyset\). A pattern which matches a specific single character can't match anything but that single character - so it can't match the empty string.
- if \(r\) is a sequence \(r_1r_2\) then \(delta(r) = delta(r_1)delta(r_2)\). A sequence matches the empty string if all of the elements of the sequence match the empty string.
- if \(r\) is a choice \(r_1 | r_2\), then \(delta(r) = delta(r_1) | delta(r_2)\). A sequence matches empty if any of its elements match empty.
- if \(r\) is a starred regular expression, \(r_1^*\), then \(delta(r) = epsilon\). Starred regular expressions are sequences of zero or more repetitions of some other pattern. Zero repetitions is the same as the empty string - so all starred patterns match empty.
To make that useful, we also need to define how empty and void patterns combine with other patterns:
- For any regular expression \(r\), the regular expression \(epsilon r = r\); in a sequence, concatenating an empty pattern with any regular expression is equivalent to the regular expression without the pattern.
- For any regular expression \(r\), \(emptyset r = epsilon\). If you've got a sequence of void with a pattern, it's equivalent to void.
- For any regular expression \(r\), \(emptyset | r = r\). A choice between void and \(r\) is equivalent to just \(r\).
- \(emptyset^* = emptyset\).
- \(epsilon^* = epsilon\).
Now, it's really easy to define the derivative:
- If r is the void pattern, then any derivative of it is void.
- If r is the empty pattern, then any derivative of it is void.
- If r is a character pattern matching character \(c\), then \(D_c(r)==epsilon\), and the derivative of \(r\) with respect to any other character is void.
- If \(r\) is a choice pattern between \(r_1\) and \(r_2\), then for all characters c, \(D_c(r) = D_c(r_1) | D_c(r_2)\).
- If \(r\) is a sequence pattern consisting of \(r_1\) followed by \(r_2\), then for all characters \(c\), \(D_c(r) = D_c(r_1)r_2 | delta(r_1)D_c(r_2)\). This one might need a bit of explanation. What that means is that for two patterns put together sequentially, you've got a choice. You could match the first pattern in the sequence - producing \(D_c(r_1)\) followed by \(r_2\). Or, if \(r_1\) could match the empty pattern, then you can drop it, and match \(r_2\). With the rules for combining empty and void patterns with other patterns, the statement above using \(delta\) is the same thing as this explanation.
The beauty of this is that it is really simple. A lot of the earlier mechanisms for decomposing regular expressions were rather complicated. This simple construct makes it very easy. For example, to convert a regular expression \(r\) to a finite state machine, you do the following:
- Create an initial state, labeled with the complete regular expression \(r\).
- While there are states \(r_i\) in the machine which haven't been processed yet:
- For each character, \(c\) in the alphabet, compute the derivative r_i'.
- If there is a state \(r_i'\) already in the machine, then add a transition from \(r_i\) to \(r_i'\) labeled with symbol \(c\).
- If there is no state \(r_i'\), then add it, and add a transition from \(r_i\) to \(r_i'\) labeled with the character \(c\).
- For each state in the machine labeled with a regular expression \(r\), it is a final state if and only if \(delta(r) = epsilon\).
For your amusement, I threw together a really quick implementation of the regular expression derivative in Haskell:
data Regexp = CharRE Char | ChoiceRE [ Regexp ] | SeqRE [ Regexp ] | StarRE Regexp | VoidRE | EmptyRE deriving (Eq, Show) delta :: Regexp -> Bool delta (CharRE c) = False delta (ChoiceRE (re:res)) = if (delta re) then True else (delta (ChoiceRE res)) delta (ChoiceRE ) = False delta (SeqRE ) = True delta (SeqRE (re:res)) = (delta re) && (delta (SeqRE res)) delta VoidRE = False delta EmptyRE = True delta (StarRE r) = True derivative :: Regexp -> Char -> Regexp derivative VoidRE c = VoidRE derivative EmptyRE c = VoidRE derivative (CharRE c) d = if c == d then EmptyRE else VoidRE derivative (SeqRE (re:res)) c = let re' = (derivative re c) in case re' of VoidRE -> VoidRE EmptyRE -> (SeqRE res) _ -> (SeqRE (re':res)) derivative (SeqRE ) c = VoidRE derivative (ChoiceRE ) c = VoidRE derivative (ChoiceRE res) c = let derivs = filter ( x -> x /= VoidRE) (map ( r -> derivative r c) res) in case derivs of  -> VoidRE [re] -> re (r:res) -> (ChoiceRE (r:res)) derivative (StarRE r) c = let r' = derivative r c in case r' of EmptyRE -> (StarRE r) VoidRE -> VoidRE _ -> SeqRE [r', (StarRE r)]
This can easily be used to implement a regular expression matcher. In fact, it can be used to build an RE matcher that's nearly (not quite, but nearly) as efficient as a traditional table-based FSM implementation - only without the extra step of generating the table, and building code that can interpret it. Basically, for each input symbol, you just take the derivative of the expression. If it's not the void expression, then go on to the next character, using the derivative to process the rest of the string. When you get to the end of your input, if \(delta\) of the final RE is empty, then you accept the string.
If you do this intelligently - i.e., you do something like memoize the derivative function, so that you're not constantly recomputing derivatives, this ends up being a reasonably efficient way of processing regular expressions. (Memoization is a technique where you save the results of every invocation of a function, so that if you call it repeatedly with the same input, it doesn't redo the computation, but just returns the same result as the last time it wass called with that input.)
The obvious question about all of this is: why is this called the derivative?
If you think about differential calculus in continuous number mathematics, a simple explanation of the derivative of a function \(f\) is a function \(f'\), which tells you how the the value of \(f(x)\) will change. The derivative of a regular expression is sort-of similar, if you squint at it just right: what it does is take a regular expression \(r\), and show you how it changes.