See any bugs/typos/confusing explanations? Open a GitHub issue. You can also comment below

★ See also the PDF version of this chapter (better formatting/references) ★

Functions with Infinite domains, Automata, and Regular expressions

  • Define functions on unbounded length inputs, that cannot be described by a finite size table of inputs and outputs.
  • Equivalence with the task of deciding membership in a language.
  • Deterministic finite automatons (optional): A simple example for a model for unbounded computation.
  • Equivalence with regular expressions.

“An algorithm is a finite answer to an infinite number of questions.”, Attributed to Stephen Kleene.

The model of Boolean circuits (or equivalently, the NAND-CIRC programming language) has one very significant drawback: a Boolean circuit can only compute a finite function ff. In particular, since every gate has two inputs, a size ss circuit can compute on an input of length at most 2s2s. Thus this model does not capture our intuitive notion of an algorithm as a single recipe to compute a potentially infinite function. For example, the standard elementary school multiplication algorithm is a single algorithm that multiplies numbers of all lengths. However, we cannot express this algorithm as a single circuit, but rather need a different circuit (or equivalently, a NAND-CIRC program) for every input length (see Figure 6.1).

6.1: Once you know how to multiply multi-digit numbers, you can do so for every number nn of digits, but if you had to describe multiplication using Boolean circuits or NAND-CIRC programs, you would need a different program/circuit for every length nn of the input.

In this chapter, we extend our definition of computational tasks to consider functions with the unbounded domain of {0,1}\{0,1\}^*. We focus on the question of defining what tasks to compute, mostly leaving the question of how to compute them to later chapters, where we will see Turing machines and other computational models for computing on unbounded inputs. However, we will see one example of a simple restricted model of computation - deterministic finite automata (DFAs).

In this chapter, we discuss functions that take as input strings of arbitrary length. We will often focus on the special case of Boolean functions, where the output is a single bit. These are still infinite functions since their inputs have unbounded length and hence such a function cannot be computed by any single Boolean circuit.

In the second half of this chapter, we discuss finite automata, a computational model that can compute unbounded length functions. Finite automata are not as powerful as Python or other general-purpose programming languages but can serve as an introduction to these more general models. We also show a beautiful result - the functions computable by finite automata are precisely the ones that correspond to regular expressions. However, the reader can also feel free to skip automata and go straight to our discussion of Turing machines in Chapter 7.

Functions with inputs of unbounded length

Up until now, we considered the computational task of mapping some string of length nn into a string of length mm. However, in general, computational tasks can involve inputs of unbounded length. For example, the following Python function computes the function XOR:{0,1}{0,1}\ensuremath{\mathit{XOR}}:\{0,1\}^* \rightarrow \{0,1\}, where XOR(x)\ensuremath{\mathit{XOR}}(x) equals 11 iff the number of 11’s in xx is odd. (In other words, XOR(x)=i=0x1ximod  2\ensuremath{\mathit{XOR}}(x) = \sum_{i=0}^{|x|-1} x_i \mod 2 for every x{0,1}x\in \{0,1\}^*.) As simple as it is, the XOR\ensuremath{\mathit{XOR}} function cannot be computed by a Boolean circuit. Rather, for every nn, we can compute XORn\ensuremath{\mathit{XOR}}_n (the restriction of XOR\ensuremath{\mathit{XOR}} to {0,1}n\{0,1\}^n) using a different circuit (e.g., see Figure 6.2).

def XOR(X):
    '''Takes list X of 0's and 1's
       Outputs 1 if the number of 1's is odd and outputs 0 otherwise'''
    result = 0
    for i in range(len(X)):
        result = (result + X[i]) % 2
    return result
6.2: The NAND circuit and NAND-CIRC program for computing the XOR of 55 bits. Note how the circuit for XOR5\ensuremath{\mathit{XOR}}_5 merely repeats four times the circuit to compute the XOR of 22 bits.

Previously in this book, we studied the computation of finite functions f:{0,1}n{0,1}mf:\{0,1\}^n \rightarrow \{0,1\}^m. Such a function ff can always be described by listing all the 2n2^n values it takes on inputs x{0,1}nx\in \{0,1\}^n. In this chapter, we consider functions such as XOR\ensuremath{\mathit{XOR}} that take inputs of unbounded size. While we can describe XOR\ensuremath{\mathit{XOR}} using a finite number of symbols (in fact, we just did so above), it takes infinitely many possible inputs, and so we cannot just write down all of its values. The same is true for many other functions capturing important computational tasks, including addition, multiplication, sorting, finding paths in graphs, fitting curves to points, and so on. To contrast with the finite case, we will sometimes call a function F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\} (or F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\}^*) infinite. However, this does not mean that FF takes as input strings of infinite length! It just means that FF can take as input a string that can be arbitrarily long, and so we cannot simply write down a table of all the outputs of FF on different inputs.

A function F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\}^* specifies the computational task mapping an input x{0,1}x\in \{0,1\}^* into the output F(x)F(x).

As we have seen before, restricting attention to functions that use binary strings as inputs and outputs does not detract from our generality, since other objects, including numbers, lists, matrices, images, videos, and more, can be encoded as binary strings.

As before, it is essential to differentiate between specification and implementation. For example, consider the following function:

TWINP(x)={1pN s.t.p,p+2 are primes and p>x0otherwise \ensuremath{\mathit{TWINP}}(x) = \begin{cases} 1 & \exists_{p \in \N} \text{ s.t.} p,p+2 \text{ are primes and } p>|x| \\ 0 & \text{otherwise} \end{cases}

This is a mathematically well-defined function. For every xx, TWINP(x)\ensuremath{\mathit{TWINP}}(x) has a unique value which is either 00 or 11. However, at the moment, no one knows of a Python program that computes this function. The Twin prime conjecture posits that for every nn there exists p>np>n such that both pp and p+2p+2 are primes. If this conjecture is true, then TT is easy to compute indeed - the program def T(x): return 1 will do the trick. However, mathematicians have tried unsuccessfully to prove this conjecture since 1849. That said, whether or not we know how to implement the function TWINP\ensuremath{\mathit{TWINP}}, the definition above provides its specification.

Varying inputs and outputs

Many of the functions that interest us take more than one input. For example, the function

MULT(x,y)=xy \ensuremath{\mathit{MULT}}(x,y) = x \cdot y

takes the binary representation of a pair of integers x,yNx,y \in \N, and outputs the binary representation of their product xyx \cdot y. However, since we can represent a pair of strings as a single string, we will consider functions such as MULT as mapping {0,1}\{0,1\}^* to {0,1}\{0,1\}^*. We will typically not be concerned with low-level details such as the precise way to represent a pair of integers as a string, since virtually all choices will be equivalent for our purposes.

Another example of a function we want to compute is

PALINDROME(x)={1i[x]xi=xxi0otherwise \ensuremath{\mathit{PALINDROME}}(x) = \begin{cases} 1 & \forall_{i \in [|x|]} x_i = x_{|x|-i} \\ 0 & \text{otherwise} \end{cases}

PALINDROME\ensuremath{\mathit{PALINDROME}} has a single bit as output. Functions with a single bit of output are known as Boolean functions. Boolean functions are central to the theory of computation, and we will discuss them often in this book. Note that even though Boolean functions have a single bit of output, their input can be of arbitrary length. Thus they are still infinite functions that cannot be described via a finite table of values.

“Booleanizing” functions. Sometimes it might be convenient to obtain a Boolean variant for a non-Boolean function. For example, the following is a Boolean variant of MULT\ensuremath{\mathit{MULT}}.

BMULT(x,y,i)={ith bit of xyi<xy0otherwise \ensuremath{\mathit{BMULT}}(x,y,i) = \begin{cases} i^{th} \text{ bit of } x\cdot y & i <|x \cdot y| \\ 0 & \text{otherwise} \end{cases}

If we can compute BMULT\ensuremath{\mathit{BMULT}} via any programming language such as Python, C, Java, etc., we can compute MULT\ensuremath{\mathit{MULT}} as well, and vice versa.

Show that for every function F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\}^*, there exists a Boolean function BF:{0,1}{0,1}\ensuremath{\mathit{BF}}:\{0,1\}^* \rightarrow \{0,1\} such that a Python program to compute BF\ensuremath{\mathit{BF}} can be transformed into a program to compute FF and vice versa.

For every F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\}^*, we can define

BF(x,i,b)={F(x)ii<F(x),b=01i<F(x),b=10iF(x) \ensuremath{\mathit{BF}}(x,i,b) = \begin{cases} F(x)_i & i<|F(x)|, b=0 \\ 1 & i<|F(x)|, b=1 \\ 0 & i \geq |F(x)| \end{cases}

to be the function that on input x{0,1},iN,b{0,1}x \in \{0,1\}^*, i \in \N, b\in \{0,1\} outputs the ithi^{th} bit of F(x)F(x) if b=0b=0 and i<F(x)i<|F(x)|. If b=1b=1, then BF(x,i,b)\ensuremath{\mathit{BF}}(x,i,b) outputs 11 iff i<F(x)i<|F(x)| and hence this allows to compute the length of F(x)F(x).

Computing BF\ensuremath{\mathit{BF}} from FF is straightforward. For the other direction, given a Python function BF that computes BF\ensuremath{\mathit{BF}}, we can compute FF as follows:

def F(x):
    res = []
    i = 0
    while BF(x,i,1):
        res.append(BF(x,i,0))
        i += 1
    return res

Formal Languages

For every Boolean function F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\}, we can define the set LF={xF(x)=1}L_F = \{ x | F(x) = 1 \} of strings on which FF outputs 11. Such sets are known as languages. This name is rooted in formal language theory as pursued by linguists such as Noam Chomsky. A formal language is a subset L{0,1}L \subseteq \{0,1\}^* (or more generally LΣL \subseteq \Sigma^* for some finite alphabet Σ\Sigma). The membership or decision problem for a language LL, is the task of determining, given x{0,1}x\in \{0,1\}^*, whether or not xLx\in L. If we can compute the function FF, then we can decide membership in the language LFL_F and vice versa. Hence, many texts such as (Sipser, 1997) refer to the task of computing a Boolean function as “deciding a language”. In this book, we mostly describe computational tasks using the function notation, which is easier to generalize to computation with more than one bit of output. However, since the language terminology is so popular in the literature, we will sometimes mention it.

Restrictions of functions

If F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\} is a Boolean function and nNn\in \N then the restriction of FF to inputs of length nn, denoted as FnF_n, is the finite function f:{0,1}n{0,1}f:\{0,1\}^n \rightarrow \{0,1\} such that f(x)=F(x)f(x) = F(x) for every x{0,1}nx\in \{0,1\}^n. That is, FnF_n is the finite function that is only defined on inputs in {0,1}n\{0,1\}^n, but agrees with FF on those inputs. Since FnF_n is a finite function, it can be computed by a Boolean circuit, implying the following theorem:

Let F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\}. Then there is a collection {Cn}n{1,2,}\{ C_n \}_{n\in \{1,2,\ldots\}} of circuits such that for every n>0n>0, CnC_n computes the restriction FnF_n of FF to inputs of length nn.

This is an immediate corollary of the universality of Boolean circuits. Indeed, since FnF_n maps {0,1}n\{0,1\}^n to {0,1}\{0,1\}, Theorem 4.15 implies that there exists a Boolean circuit CnC_n to compute it. In fact, the size of this circuit is at most c2n/nc \cdot 2^n / n gates for some constant c10c \leq 10.

In particular, Theorem 6.1 implies that there exists such a circuit collection {Cn}\{ C_n \} even for the TWINP\ensuremath{\mathit{TWINP}} function we described before, even though we do not know of any program to compute it. Indeed, this is not that surprising: for every particular nNn\in \N, TWINPn\ensuremath{\mathit{TWINP}}_n is either the constant zero function or the constant one function, both of which can be computed by very simple Boolean circuits. Hence a collection of circuits {Cn}\{ C_n \} that computes TWINP\ensuremath{\mathit{TWINP}} certainly exists. The difficulty in computing TWINP\ensuremath{\mathit{TWINP}} using Python or any other programming language arises from the fact that we do not know for each particular nn what is the circuit CnC_n in this collection.

Deterministic finite automata (optional)

All our computational models so far - Boolean circuits and straight-line programs - were only applicable for finite functions.

In Chapter 7, we will present Turing machines, which are the central models of computation for unbounded input length functions. However, in this section we present the more basic model of deterministic finite automata (DFA). Automata can serve as a good stepping-stone for Turing machines, though they will not be used much in later parts of this book, and so the reader can feel free to skip ahead to Chapter 7. DFAs turn out to be equivalent in power to regular expressions: a powerful mechanism to specify patterns, which is widely used in practice. Our treatment of automata is relatively brief. There are plenty of resources that help you get more comfortable with DFAs. In particular, Chapter 1 of Sipser’s book (Sipser, 1997) contains an excellent exposition of this material. There are also many websites with online simulators for automata, as well as translators from regular expressions to automata and vice versa (see for example here and here).

At a high level, an algorithm is a recipe for computing an output from an input via a combination of the following steps:

  1. Read a bit from the input
  2. Update the state (working memory)
  3. Stop and produce an output

For example, recall the Python program that computes the XOR\ensuremath{\mathit{XOR}} function:

def XOR(X):
    '''Takes list X of 0's and 1's
       Outputs 1 if the number of 1's is odd and outputs 0 otherwise'''
    result = 0
    for i in range(len(X)):
        result = (result + X[i]) % 2
    return result

In each step, this program reads a single bit X[i] and updates its state result based on that bit (flipping result if X[i] is 11 and keeping it the same otherwise). When it is done transversing the input, the program outputs result. In computer science, such a program is called a single-pass constant-memory algorithm since it makes a single pass over the input and its working memory is finite. (Indeed, in this case, result can either be 00 or 11.) Such an algorithm is also known as a Deterministic Finite Automaton or DFA (another name for DFAs is a finite state machine). We can think of such an algorithm as a “machine” that can be in one of CC states, for some constant CC. The machine starts in some initial state and then reads its input x{0,1}x\in \{0,1\}^* one bit at a time. Whenever the machine reads a bit σ{0,1}\sigma \in \{0,1\}, it transitions into a new state based on σ\sigma and its prior state. The output of the machine is based on the final state. Every single-pass constant-memory algorithm corresponds to such a machine. If an algorithm uses cc bits of memory, then the contents of its memory can be represented as a string of length cc. Therefore such an algorithm can be in one of at most 2c2^c states at any point in the execution.

We can specify a DFA of CC states by a list of C2C \cdot 2 rules. Each rule will be of the form “If the DFA is in state vv and the bit read from the input is σ\sigma then the new state is vv'”. At the end of the computation, we will also have a rule of the form “If the final state is one of the following … then output 11, otherwise output 00”. For example, the Python program above can be represented by a two-state automaton for computing XOR\ensuremath{\mathit{XOR}} of the following form:

  • Initialize in the state 00.
  • For every state s{0,1}s \in \{0,1\} and input bit σ\sigma read, if σ=1\sigma =1 then change to state 1s1-s, otherwise stay in state ss.
  • At the end output 11 iff s=1s=1.

We can also describe a CC-state DFA as a labeled graph of CC vertices. For every state ss and bit σ\sigma, we add a directed edge labeled with σ\sigma between ss and the state ss' such that if the DFA is at state ss and reads σ\sigma then it transitions to ss'. (If the state stays the same then this edge will be a self-loop; similarly, if ss transitions to ss' in both the case σ=0\sigma=0 and σ=1\sigma=1 then the graph will contain two parallel edges.) We also label the set S\mathcal{S} of states on which the automaton will output 11 at the end of the computation. This set is known as the set of accepting states. See Figure 6.3 for the graphical representation of the XOR automaton.

6.3: A deterministic finite automaton that computes the XOR\ensuremath{\mathit{XOR}} function. It has two states 00 and 11, and when it observes σ\sigma it transitions from vv to vσv \oplus \sigma.

Formally, a DFA is specified by (1) the table of the C2C \cdot 2 rules, which can be represented as a transition function TT that maps a state s[C]s \in [C] and bit σ{0,1}\sigma \in \{0,1\} to the state s[C]s' \in [C] which the DFA will transition to from state ss on input σ\sigma and (2) the set S\mathcal{S} of accepting states. This leads to the following definition.

A deterministic finite automaton (DFA) with CC states over {0,1}\{0,1\} is a pair (T,S)(T,\mathcal{S}) with T:[C]×{0,1}[C]T:[C]\times \{0,1\} \rightarrow [C] and S[C]\mathcal{S} \subseteq [C]. The finite function TT is known as the transition function of the DFA. The set S\mathcal{S} is known as the set of accepting states.

Let F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\} be a Boolean function with the infinite domain {0,1}\{0,1\}^*. We say that (T,S)(T,\mathcal{S}) computes a function F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\} if for every nNn\in\N and x{0,1}nx\in \{0,1\}^n, if we define s0=0s_0=0 and si+1=T(si,xi)s_{i+1} = T(s_i,x_i) for every i[n]i\in [n], then

snSF(x)=1 s_n \in \mathcal{S} \Leftrightarrow F(x)=1

Make sure not to confuse the transition function of an automaton (TT in Definition 6.2), which is a finite function specifying the table of “rules” which it follows, with the function the automaton computes (FF in Definition 6.2) which is an infinite function.

Deterministic finite automata can be defined in several equivalent ways. In particular Sipser (Sipser, 1997) defines a DFA as a five-tuple (Q,Σ,δ,q0,F)(Q,\Sigma,\delta,q_0,F) where QQ is the set of states, Σ\Sigma is the alphabet, δ\delta is the transition function, q0q_0 is the initial state, and FF is the set of accepting states. In this book the set of states is always of the form Q={0,,C1}Q=\{0,\ldots,C-1 \} and the initial state is always q0=0q_0 = 0, but this makes no difference to the computational power of these models. Also, we restrict our attention to the case that the alphabet Σ\Sigma is equal to {0,1}\{0,1\}.

Prove that there is a DFA that computes the following function FF:

F(x)={13 divides x and i[x/3]x3ix3i+1x3i+2=0100otherwiseF(x) = \begin{cases} 1 & 3 \text{ divides } |x| \text{ and } \forall_{i\in [|x|/3]} x_{3i} x_{3i+1} x_{3i+2} = 010 \\ 0 & \text{otherwise} \end{cases}

When asked to construct a deterministic finite automaton, it is often useful to start by constructing a single-pass constant-memory algorithm using a more general formalism (for example, using pseudocode or a Python program). Once we have such an algorithm, we can mechanically translate it into a DFA. Here is a simple Python program for computing FF:

def F(X):
    '''Return 1 iff X is a concatenation of zero/more copies of [0,1,0]'''
    if len(X) % 3 != 0:
        return False
    ultimate = 0
    penultimate = 1
    antepenultimate = 0
    for idx, b in enumerate(X):
        antepenultimate = penultimate
        penultimate = ultimate
        ultimate = b
        if idx % 3 == 2 and ((antepenultimate, penultimate, ultimate) != (0,1,0)):
            return False
    return True

Since we keep three Boolean variables, the working memory can be in one of 23=82^3 = 8 configurations, and so the program above can be directly translated into an 88 state DFA. While this is not needed to solve the question, by examining the resulting DFA, we can see that we can merge some states and obtain a 44 state automaton, described in Figure 6.4. See also Figure 6.5, which depicts the execution of this DFA on a particular input.

6.4: A DFA that outputs 11 only on inputs x{0,1}x\in \{0,1\}^* that are a concatenation of zero or more copies of 010010. The state 00 is both the starting state and the only accepting state. The table denotes the transition function of TT, which maps the current state and symbol read to the new symbol.

Anatomy of an automaton (finite vs. unbounded)

Now that we are considering computational tasks with unbounded input sizes, it is crucial to distinguish between the components of our algorithm that have fixed length and the components that grow with the input size. For the case of DFAs these are the following:

Constant size components: Given a DFA AA, the following quantities are fixed independent of the input size:

  • The number of states CC in AA.

  • The transition function TT (which has 2C2C inputs, and so can be specified by a table of 2C2C rows, each entry in which is a number in [C][C]).

  • The set S[C]\mathcal{S} \subseteq [C] of accepting states. This set can be described by a string in {0,1}C\{0,1\}^C specifiying which states are in S\mathcal{S} and which are not.

Together the above means that we can fully describe an automaton using finitely many symbols. This is a property we require out of any notion of “algorithm”: we should be able to write down a complete specification of how it produces an output from an input.

Components of unbounded size: The following quantities relating to a DFA are not bounded by any constant. We stress that these are still finite for any given input.

  • The length of the input x{0,1}x\in \{0,1\}^* that the DFA is provided. The input length is always finite, but not a priori bounded.

  • The number of steps that the DFA takes can grow with the length of the input. Indeed, a DFA makes a single pass on the input and so it takes precisely x|x| steps on an input x{0,1}x\in \{0,1\}^*.

6.5: Execution of the DFA of Figure 6.4. The number of states and the transition function size are bounded, but the input can be arbitrarily long. If the DFA is at state ss and observes the value σ\sigma then it moves to the state T(s,σ)T(s,\sigma). At the end of the execution the DFA accepts iff the final state is in S\mathcal{S}.

DFA-computable functions

We say that a function F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\} is DFA computable if there exists some DFA\ensuremath{\mathit{DFA}} that computes FF. In Chapter 4 we saw that every finite function is computable by some Boolean circuit. Thus, at this point, you might expect that every infinite function is computable by some DFA. However, this is very much not the case. We will soon see some simple examples of infinite functions that are not computable by DFAs, but for starters, let us prove that such functions exist.

Let DFACOMP\ensuremath{\mathit{DFACOMP}} be the set of all Boolean functions F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\} such that there exists a DFA computing FF. Then DFACOMP\ensuremath{\mathit{DFACOMP}} is countable.

Every DFA can be described by a finite length string, which yields an onto map from {0,1}\{0,1\}^* to DFACOMP\ensuremath{\mathit{DFACOMP}}: namely, the function that maps a string describing an automaton AA to the function that it computes.

Every DFA can be described by a finite string, representing the transition function TT and the set of accepting states, and every DFA AA computes some function F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\}. Thus we can define the following function StDC:{0,1}DFACOMPStDC:\{0,1\}^* \rightarrow \ensuremath{\mathit{DFACOMP}}:

StDC(a)={Fa represents automaton A and F is the function A computes ONEotherwise StDC(a) = \begin{cases} F & a \text{ represents automaton } A \text{ and } F \text{ is the function } A \text{ computes } \\ \ensuremath{\mathit{ONE}} & \text{otherwise} \end{cases}
where ONE:{0,1}{0,1}\ensuremath{\mathit{ONE}}:\{0,1\}^* \rightarrow \{0,1\} is the constant function that outputs 11 on all inputs (and is a member of DFACOMP\ensuremath{\mathit{DFACOMP}}). Since by definition, every function FF in DFACOMP\ensuremath{\mathit{DFACOMP}} is computable by some automaton, StDCStDC is an onto function from {0,1}\{0,1\}^* to DFACOMP\ensuremath{\mathit{DFACOMP}}, which means that DFACOMP\ensuremath{\mathit{DFACOMP}} is countable (see Section 2.4.2).

Since the set of all Boolean functions is uncountable, we get the following corollary:

There exists a Boolean function F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\} that is not computable by any DFA.

If every Boolean function FF is computable by some DFA, then DFACOMP\ensuremath{\mathit{DFACOMP}} equals the set ALL\ensuremath{\mathit{ALL}} of all Boolean functions, but by Theorem 2.12, the latter set is uncountable, contradicting Theorem 6.4.

Regular expressions

Searching for a piece of text is a common task in computing. At its heart, the search problem is quite simple. We have a collection X={x0,,xk}X = \{ x_0, \ldots, x_k \} of strings (e.g., files on a hard-drive, or student records in a database), and the user wants to find out the subset of all the xXx \in X that are matched by some pattern (e.g., all files whose names end with the string .txt). In full generality, we can allow the user to specify the pattern by specifying a (computable) function F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\}, where F(x)=1F(x)=1 corresponds to the pattern matching xx. That is, the user provides a program PP in a programming language such as Python, and the system returns all xXx \in X such that P(x)=1P(x)=1. For example, one could search for all text files that contain the string important document or perhaps (letting PP correspond to a neural-network based classifier) all images that contain a cat. However, we don’t want our system to get into an infinite loop just trying to evaluate the program PP! For this reason, typical systems for searching files or databases do not allow users to specify the patterns using full-fledged programming languages. Rather, such systems use restricted computational models that on the one hand are rich enough to capture many of the queries needed in practice (e.g., all filenames ending with .txt, or all phone numbers of the form (617) xxx-xxxx), but on the other hand are restricted enough so that queries can be evaluated very efficiently on huge files and in particular cannot result in an infinite loop.

One of the most popular such computational models is regular expressions. If you ever used an advanced text editor, a command-line shell, or have done any kind of manipulation of text files, then you have probably come across regular expressions.

A regular expression over some alphabet Σ\Sigma is obtained by combining elements of Σ\Sigma with the operation of concatenation, as well as | (corresponding to or) and * (corresponding to repetition zero or more times). (Common implementations of regular expressions in programming languages and shells typically include some extra operations on top of | and *, but these operations can be implemented as “syntactic sugar” using the operators | and *.) For example, the following regular expression over the alphabet {0,1}\{0,1\} corresponds to the set of all strings x{0,1}x\in \{0,1\}^* where every digit is repeated at least twice:

(00(0)11(1))  . (00(0^*)|11(1^*))^* \;.

The following regular expression over the alphabet {a,,z,0,,9}\{ a,\ldots,z,0,\ldots,9 \} corresponds to the set of all strings that consist of a sequence of one or more of the letters aa-dd followed by a sequence of one or more digits (without a leading zero):

(abcd)(abcd)(123456789)(0123456789)  .    (6.1) (a|b|c|d)(a|b|c|d)^*(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)^* \;. \;\;(6.1)

Formally, regular expressions are defined by the following recursive definition:

A regular expression ee over an alphabet Σ\Sigma is a string over Σ{(,),,,,""}\Sigma \cup \{ (,),|,*,\emptyset, \ensuremath{\text{\texttt{""}}} \} that has one of the following forms:

  1. e=σe = \sigma where σΣ\sigma \in \Sigma

  2. e=(ee)e = (e' | e'') where e,ee', e'' are regular expressions.

  3. e=(e)(e)e = (e')(e'') where e,ee',e'' are regular expressions. (We often drop the parentheses when there is no danger of confusion and so write this as e  ee' \; e''.)

  4. e=(e)e = (e')^* where ee' is a regular expression.

Finally we also allow the following “edge cases”: e=e = \emptyset and e=""e = \ensuremath{\text{\texttt{""}}}. These are the regular expressions corresponding to accepting no strings, and accepting only the empty string respectively.

We will drop parentheses when they can be inferred from the context. We also use the convention that OR and concatenation are left-associative, and we give highest precedence to *, then concatenation, and then OR. Thus for example we write 001100^*|11 instead of ((0)(0))((1)(1))((0)(0^*))|((1)(1)).

Every regular expression ee corresponds to a function Φe:Σ{0,1}\Phi_{e}:\Sigma^* \rightarrow \{0,1\} where Φe(x)=1\Phi_{e}(x)=1 if xx matches the regular expression. For example, if e=(0011)e = (00|11)^* then Φe(110011)=1\Phi_e(110011)=1 but Φe(101)=0\Phi_e(101)=0 (can you see why?).

The formal definition of Φe\Phi_{e} is one of those definitions that is more cumbersome to write than to grasp. Thus it might be easier for you first to work out the definition on your own, and then check that it matches what is written below.

Let ee be a regular expression over the alphabet Σ\Sigma. The function Φe:Σ{0,1}\Phi_{e}:\Sigma^* \rightarrow \{0,1\} is defined as follows:

  1. If e=σe = \sigma then Φe(x)=1\Phi_{e}(x)=1 iff x=σx=\sigma.

  2. If e=(ee)e = (e' | e'') then Φe(x)=Φe(x)Φe(x)\Phi_{e}(x) = \Phi_{e'}(x) \vee \Phi_{e''}(x) where \vee is the OR operator.

  3. If e=(e)(e)e = (e')(e'') then Φe(x)=1\Phi_{e}(x) = 1 iff there is some x,xΣx',x'' \in \Sigma^* such that xx is the concatenation of xx' and xx'' and Φe(x)=Φe(x)=1\Phi_{e'}(x')=\Phi_{e''}(x'')=1.

  4. If e=(e)e= (e')* then Φe(x)=1\Phi_{e}(x)=1 iff there is some kNk\in \N and some x0,,xk1Σx_0,\ldots,x_{k-1} \in \Sigma^* such that xx is the concatenation x0xk1x_0 \cdots x_{k-1} and Φe(xi)=1\Phi_{e'}(x_i)=1 for every i[k]i\in [k].

  5. Finally, for the edge cases Φ\Phi_{\emptyset} is the constant zero function, and Φ""\Phi_{\ensuremath{\text{\texttt{""}}}} is the function that only outputs 11 on the empty string ""\ensuremath{\text{\texttt{""}}}.

We say that a regular expression ee over Σ\Sigma matches a string xΣx \in \Sigma^* if Φe(x)=1\Phi_{e}(x)=1.

The definitions above are not inherently difficult but are a bit cumbersome. So you should pause here and go over it again until you understand why it corresponds to our intuitive notion of regular expressions. This is important not just for understanding regular expressions themselves (which are used time and again in a great many applications) but also for getting better at understanding recursive definitions in general.

A Boolean function is called “regular” if it outputs 11 on precisely the set of strings that are matched by some regular expression. That is,

Let Σ\Sigma be a finite set and F:Σ{0,1}F:\Sigma^* \rightarrow \{0,1\} be a Boolean function. We say that FF is regular if F=ΦeF=\Phi_{e} for some regular expression ee.

Similarly, for every formal language LΣL \subseteq \Sigma^*, we say that LL is regular if and only if there is a regular expression ee such that xLx\in L iff ee matches xx.

Let Σ={a,b,c,d,0,1,2,3,4,5,6,7,8,9}\Sigma=\{ a,b,c,d,0,1,2,3,4,5,6,7,8,9 \} and F:Σ{0,1}F:\Sigma^* \rightarrow \{0,1\} be the function such that F(x)F(x) outputs 11 iff xx consists of one or more of the letters aa-dd followed by a sequence of one or more digits (without a leading zero). Then FF is a regular function, since F=ΦeF=\Phi_e where

e=(abcd)(abcd)(123456789)(0123456789)e = (a|b|c|d)(a|b|c|d)^*(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)^*
is the expression we saw in Equation 6.1.

If we wanted to verify, for example, that Φe(abc12078)=1\Phi_e(abc12078)=1, we can do so by noticing that the expression (abcd)(a|b|c|d) matches the string aa, (abcd)(a|b|c|d)^* matches bcbc, (123456789)(1|2|3|4|5|6|7|8|9) matches the string 11, and the expression (0123456789)(0|1|2|3|4|5|6|7|8|9)^* matches the string 20782078. Each one of those boils down to a simpler expression. For example, the expression (abcd)(a|b|c|d)^* matches the string bcbc because both of the one-character strings bb and cc are matched by the expression abcda|b|c|d.

Regular expression can be defined over any finite alphabet Σ\Sigma, but as usual, we will mostly focus our attention on the binary case, where Σ={0,1}\Sigma = \{0,1\}. Most (if not all) of the theoretical and practical general insights about regular expressions can be gleaned from studying the binary case.

Algorithms for matching regular expressions

Regular expressions would not be very useful for search if we could not evaluate, given a regular expression ee, whether a string xx is matched by ee. Luckily, there is an algorithm to do so. Specifically, there is an algorithm (think “Python program” though later we will formalize the notion of algorithms using Turing machines) that on input a regular expression ee over the alphabet {0,1}\{0,1\} and a string x{0,1}x\in \{0,1\}^*, outputs 11 iff ee matches xx (i.e., outputs Φe(x)\Phi_e(x)).

Indeed, Definition 6.7 actually specifies a recursive algorithm for computing Φe\Phi_{e}. Specifically, each one of our operations -concatenation, OR, and star- can be thought of as reducing the task of testing whether an expression ee matches a string xx to testing whether some sub-expressions of ee match substrings of xx. Since these sub-expressions are always shorter than the original expression, this yields a recursive algorithm for checking if ee matches xx, which will eventually terminate at the base cases of the expressions that correspond to a single symbol or the empty string.

Algorithm 6.10 Regular expression matching

Input: Regular expression ee over Σ\Sigma^*, xΣx\in \Sigma^*

Output: Φe(x)\Phi_e(x)

Procedure Match\mathsf{Match}(ee,xx)

if {e=e=\emptyset} return 00 ;

if {x=""x=""} return MatchEmpty(e)\mathsf{MatchEmpty}(e);

if {eΣe \in \Sigma} return 11 iff x=ex=e ;

if {e=(ee)e = (e' | e'')} return {Match(e,x)\mathsf{Match}(e',x) or Match(e,x)\mathsf{Match}(e'',x)} ;

if {e=(e)(e)e= (e')(e'')}

for {i[x]i \in [|x|]}

if {Match(e,x0xi1)\mathsf{Match}(e',x_0 \cdots x_{i-1)} and Match(e,xixx1)\mathsf{Match}(e'',x_i \cdots x_{|x|-1)}} return 11 ;

endfor

endif

if {e=(e)e = (e')^*}

if {e=""e'=""} return Match("",x)\mathsf{Match}("",x) ;

# ("")("")^* is the same as """"

for {i[x]i \in [|x|]}

# x0xi1x_0 \cdots x_{i-1} is shorter than xx

if {Match(e,x0xi1)\mathsf{Match}(e,x_0 \cdots x_{i-1)} and Match(e,xixx1)\mathsf{Match}(e',x_i \cdots x_{|x|-1)}} return 11 ;

endfor

endif

return 00

endproc

We assume above that we have a procedure \textscMatchEmpty\text{\textsc{MatchEmpty}} that on input a regular expression ee outputs 11 if and only if ee matches the empty string ""\ensuremath{\text{\texttt{""}}}.

The key observation is that in our recursive definition of regular expressions, whenever ee is made up of one or two expressions e,ee',e'' then these two regular expressions are smaller than ee. Eventually (when they have size 11) then they must correspond to the non-recursive case of a single alphabet symbol. Correspondingly, the recursive calls made in Algorithm 6.10 always correspond to a shorter expression or (in the case of an expression of the form (e)(e')^*) a shorter input string. Thus, we can prove the correctness of Algorithm 6.10 on inputs of the form (e,x)(e,x) by induction over min{e,x}\min \{ |e|, |x| \}. The base case is when either x=""x=\ensuremath{\text{\texttt{""}}} or ee is a single alphabet symbol, ""\ensuremath{\text{\texttt{""}}} or \emptyset. In the case the expression is of the form e=(ee)e=(e'|e'') or e=(e)(e)e=(e')(e''), we make recursive calls with the shorter expressions e,ee',e''. In the case the expression is of the form e=(e)e=(e')^*, we make recursive calls with either a shorter string xx and the same expression, or with the shorter expression ee' and a string xx' that is equal in length or shorter than xx.

Give an algorithm that on input a regular expression ee, outputs 11 if and only if Φe("")=1\Phi_e(\ensuremath{\text{\texttt{""}}})=1.

We can obtain such a recursive algorithm by using the following observations:

  1. An expression of the form ""\ensuremath{\text{\texttt{""}}} or (e)(e')^* always matches the empty string.

  2. An expression of the form σ\sigma, where σΣ\sigma \in \Sigma is an alphabet symbol, never matches the empty string.

  3. The regular expression \emptyset does not match the empty string.

  4. An expression of the form eee'|e'' matches the empty string if and only if one of ee' or ee'' matches it.

  5. An expression of the form (e)(e)(e')(e'') matches the empty string if and only if both ee' and ee'' match it.

Given the above observations, we see that the following algorithm will check if ee matches the empty string:

Algorithm 6.11 Check for empty string

Input: Regular expression ee over Σ\Sigma^*, xΣx\in \Sigma^*

Output: 11 iff ee matches the emptry string.

Procedure MatchEmpty\mathsf{MatchEmpty}(ee)

if {e=""e=""} return 11 ;

if {e=e=\emptyset or eΣe \in \Sigma} return 00 ;

if {e=(ee)e=(e'|e'')} return MatchEmpty(e)\mathsf{MatchEmpty}(e') or MatchEmpty(e)\mathsf{MatchEmpty}(e'') ;

if {e=(e)(e)e=(e')(e'')} return MatchEmpty(e)\mathsf{MatchEmpty}(e') and MatchEmpty(e)\mathsf{MatchEmpty}(e'') ;

if {e=(e)e=(e')^*} return 11 ;

endproc

Efficient matching of regular expressions (optional)

Algorithm 6.10 is not very efficient. For example, given an expression involving concatenation or the “star” operation and a string of length nn, it can make nn recursive calls, and hence it can be shown that in the worst case Algorithm 6.10 can take time exponential in the length of the input string xx. Fortunately, it turns out that there is a much more efficient algorithm that can match regular expressions in linear (i.e., O(n)O(n)) time. Since we have not yet covered the topics of time and space complexity, we describe this algorithm in high level terms, without making the computational model precise. Rather we will use the colloquial notion of O(n)O(n) running time as used in introduction to programming courses and whiteboard coding interviews. We will see a formal definition of time complexity in Chapter 13.

Let ee be a regular expression. Then there is an O(n)O(n) time algorithm that computes Φe\Phi_{e}.

The implicit constant in the O(n)O(n) term of Theorem 6.12 depends on the expression ee. Thus, another way to state Theorem 6.12 is that for every expression ee, there is some constant cc and an algorithm AA that computes Φe\Phi_e on nn-bit inputs using at most cnc\cdot n steps. This makes sense since in practice we often want to compute Φe(x)\Phi_e(x) for a small regular expression ee and a large document xx. Theorem 6.12 tells us that we can do so with running time that scales linearly with the size of the document, even if it has (potentially) worse dependence on the size of the regular expression.

We prove Theorem 6.12 by obtaining more efficient recursive algorithm, that determines whether ee matches a string x{0,1}nx\in \{0,1\}^n by reducing this task to determining whether a related expression ee' matches x0,,xn2x_0,\ldots,x_{n-2}. This will result in an expression for the running time of the form T(n)=T(n1)+O(1)T(n) = T(n-1) + O(1) which solves to T(n)=O(n)T(n)=O(n).

Restrictions of regular expressions. The central definition for the algorithm behind Theorem 6.12 is the notion of a restriction of a regular expression. The idea is that for every regular expression ee and symbol σ\sigma in its alphabet, it is possible to define a regular expression e[σ]e[\sigma] such that e[σ]e[\sigma] matches a string xx if and only if ee matches the string xσx\sigma. For example, if ee is the regular expression (01)(01)(01)^*(01) (i.e., one or more occurrences of 0101) then e[1]e[1] is equal to (01)0(01)^*0 and e[0]e[0] will be \emptyset. (Can you see why?)

Algorithm 6.13 computes the restriction e[σ]e[\sigma] given a regular expression ee and an alphabet symbol σ\sigma. It always terminates, since the recursive calls it makes are always on expressions smaller than the input expression. Its correctness can be proven by induction on the length of the regular expression ee, with the base cases being when ee is ""\ensuremath{\text{\texttt{""}}}, \emptyset, or a single alphabet symbol τ\tau.

Algorithm 6.13 Restricting regular expression

Input: Regular expression ee over Σ\Sigma, symbol σΣ\sigma \in \Sigma

Output: Regular expression e=e[σ]e'=e[\sigma] such that Φe(x)=Φe(xσ)\Phi_{e'}(x) = \Phi_e(x \sigma) for every xΣx\in \Sigma^*

Procedure Restrict\mathsf{Restrict}(ee,σ\sigma)

if {e=""e="" or e=e=\emptyset} return \emptyset ;

if {e=τe=\tau for τΣ\tau \in \Sigma} return """" if τ=σ\tau=\sigma and return \emptyset otherwise ;

if {e=(ee)e=(e'|e'')} return (Restrict(e,σ)Restrict(e,σ))(\mathsf{Restrict}(e',\sigma)| \mathsf{Restrict}(e'',\sigma)) ;

if {e=(e)e=(e')^*} return (e)(Restrict(e,σ))(e')^* (\mathsf{Restrict}(e',\sigma)) ;

if {e=(e)(e)e= (e')(e'') and Φe("")=0\Phi_{e''}("")=0} return (e)(Restrict(e,σ))(e')(\mathsf{Restrict}(e'',\sigma)) ;

if {e=(e)(e)e= (e')(e'') and Φe("")=1\Phi_{e''}("")=1} return (eRestrict(e,σ))    Restrict(e,σ)(e' \mathsf{Restrict}(e'',\sigma)) \; | \; \mathsf{Restrict}(e',\sigma) ;

endproc

Using this notion of restriction, we can define the following recursive algorithm for regular expression matching:

Algorithm 6.14 Regular expression matching in linear time

Input: Regular expression ee over Σ\Sigma^*, xΣnx\in \Sigma^n where nNn\in\mathbb{N}

Output: Φe(x)\Phi_e(x)

Procedure FMatch\mathsf{FMatch}(ee,xx)

if {x=""x=""} return MatchEmpty(e)\mathsf{MatchEmpty}(e) ;

Let eRestrict(e,xn1)e' \leftarrow \mathsf{Restrict}(e,x_{n-1)}

return FMatch(e,x0xn2)\mathsf{FMatch}(e',x_0 \cdots x_{n-2)}

endproc

By the definition of a restriction, for every σΣ\sigma\in \Sigma and xΣx'\in \Sigma^*, the expression ee matches xσx'\sigma if and only if e[σ]e[\sigma] matches xx'. Hence for every ee and xΣnx\in \Sigma^n, Φe[xn1](x0xn2)=Φe(x)\Phi_{e[x_{n-1}]}(x_0\cdots x_{n-2}) = \Phi_e(x) and Algorithm 6.14 does return the correct answer. The only remaining task is to analyze its running time. Note that Algorithm 6.14 uses the \textscMatchEmpty\text{\textsc{MatchEmpty}} procedure of Solved Exercise 6.3 in the base case that x=""x=\ensuremath{\text{\texttt{""}}}. However, this is OK since this procedure’s running time depends only on ee and is independent of the length of the original input.

For simplicity, let us restrict our attention to the case that the alphabet Σ\Sigma is equal to {0,1}\{0,1\}. Define C()C(\ell) to be the maximum number of operations that Algorithm 6.13 takes when given as input a regular expression ee over {0,1}\{0,1\} of at most \ell symbols. The value C()C(\ell) can be shown to be polynomial in \ell, though this is not important for this theorem, since we only care about the dependence of the time to compute Φe(x)\Phi_e(x) on the length of xx and not about the dependence of this time on the length of ee.

Algorithm 6.14 is a recursive algorithm that input an expression ee and a string x{0,1}nx\in \{0,1\}^n, does computation of at most C(e)C(|e|) steps and then calls itself with input some expression ee' and a string xx' of length n1n-1. It will terminate after nn steps when it reaches a string of length 00. So, the running time T(e,n)T(e,n) that it takes for Algorithm 6.14 to compute Φe\Phi_e for inputs of length nn satisfies the recursive equation:

T(e,n)=max{T(e[0],n1),T(e[1],n1)}+C(e)    (6.2)T(e,n) = \max \{ T(e[0],n-1) , T(e[1],n-1) \} + C(|e|) \;\;(6.2)

(In the base case n=0n=0, T(e,0)T(e,0) is equal to some constant depending only on ee.) To get some intuition for the expression Equation 6.2, let us open up the recursion for one level, writing T(e,n)T(e,n) as

T(e,n)=max{T(e[0][0],n2)+C(e[0]),T(e[0][1],n2)+C(e[0]),T(e[1][0],n2)+C(e[1]),T(e[1][1],n2)+C(e[1])}+C(e)  .\begin{aligned}T(e,n) &= \max \{ T(e[0][0],n-2) + C(|e[0]|), \\ &T(e[0][1],n-2) + C(|e[0]|), \\ &T(e[1][0],n-2) + C(|e[1]|), \\ &T(e[1][1],n-2) + C(|e[1]|) \} + C(|e|)\;.\end{aligned}

Continuing this way, we can see that T(e,n)nC(L)+O(1)T(e,n) \leq n \cdot C(L) + O(1) where LL is the largest length of any expression ee' that we encounter along the way. Therefore, the following claim suffices to show that Algorithm 6.14 runs in O(n)O(n) time:

Claim: Let ee be a regular expression over {0,1}\{0,1\}, then there is a number L(e)NL(e) \in \N, such that for every sequence of symbols α0,,αn1\alpha_0,\ldots,\alpha_{n-1}, if we define e=e[α0][α1][αn1]e' = e[\alpha_0][\alpha_1]\cdots [\alpha_{n-1}] (i.e., restricting ee to α0\alpha_0, and then α1\alpha_1 and so on and so forth), then eL(e)|e'| \leq L(e).

Proof of claim: For a regular expression ee over {0,1}\{0,1\} and α{0,1}m\alpha\in \{0,1\}^m, we denote by e[α]e[\alpha] the expression e[α0][α1][αm1]e[\alpha_0][\alpha_1]\cdots [\alpha_{m-1}] obtained by restricting ee to α0\alpha_0 and then to α1\alpha_1 and so on. We let S(e)={e[α]α{0,1}}S(e) = \{ e[\alpha] | \alpha \in \{0,1\}^* \}. We will prove the claim by showing that for every ee, the set S(e)S(e) is finite, and hence so is the number L(e)L(e) which is the maximum length of ee' for eS(e)e'\in S(e).

We prove this by induction on the structure of ee. If ee is a symbol, the empty string, or the empty set, then this is straightforward to show as the most expressions S(e)S(e) can contain are the expression itself, ""\ensuremath{\text{\texttt{""}}}, and \emptyset. Otherwise we split to the two cases (i) e=ee = e'^* and (ii) e=eee = e'e'', where e,ee',e'' are smaller expressions (and hence by the induction hypothesis S(e)S(e') and S(e)S(e'') are finite). In the case (i), if e=(e)e = (e')^* then e[α]e[\alpha] is either equal to (e)e[α](e')^* e'[\alpha] or it is simply the empty set if e[α]=e'[\alpha]=\emptyset. Since e[α]e'[\alpha] is in the set S(e)S(e'), the number of distinct expressions in S(e)S(e) is at most S(e)+1|S(e')|+1. In the case (ii), if e=eee = e' e'' then all the restrictions of ee to strings α\alpha will either have the form ee[α]e' e''[\alpha] or the form ee[α]e[α]e' e''[\alpha] | e'[\alpha'] where α\alpha' is some string such that α=αα\alpha = \alpha' \alpha'' and e[α]e''[\alpha''] matches the empty string. Since e[α]S(e)e''[\alpha] \in S(e'') and e[α]S(e)e'[\alpha'] \in S(e'), the number of the possible distinct expressions of the form e[α]e[\alpha] is at most S(e)+S(e)S(e)|S(e'')| + |S(e'')|\cdot |S(e')|. This completes the proof of the claim.

The bottom line is that while running Algorithm 6.14 on a regular expression ee, all the expressions we ever encounter are in the finite set S(e)S(e), no matter how large the input xx is, and so the running time of Algorithm 6.14 satisfies the equation T(n)=T(n1)+CT(n) = T(n-1) + C' for some constant CC' depending on ee. This solves to O(n)O(n) where the implicit constant in the O notation can (and will) depend on ee but crucially, not on the length of the input xx.

Matching regular expressions using DFAs

Theorem 6.12 is already quite impressive, but we can do even better. Specifically, no matter how long the string xx is, we can compute Φe(x)\Phi_e(x) by maintaining only a constant amount of memory and moreover making a single pass over xx. That is, the algorithm will scan the input xx once from start to finish, and then determine whether or not xx is matched by the expression ee. This is important in the common case of trying to match a short regular expression over a huge file or document that might not even fit in our computer’s memory. Of course, as we have seen before, a single-pass constant-memory algorithm is simply a deterministic finite automaton. As we will see in Theorem 6.17, a function can be computed by regular expression if and only if it can be computed by a DFA. We start with showing the “only if” direction:

Let ee be a regular expression. Then there is an algorithm that on input x{0,1}x\in \{0,1\}^* computes Φe(x)\Phi_e(x) while making a single pass over xx and maintaining a constant amount of memory.

The single-pass constant-memory for checking if a string matches a regular expression is presented in Algorithm 6.16. The idea is to replace the recursive algorithm of Algorithm 6.14 with a dynamic program, using the technique of memoization. If you haven’t taken yet an algorithms course, you might not know these techniques. This is OK; while this more efficient algorithm is crucial for the many practical applications of regular expressions, it is not of great importance for this book.

Algorithm 6.16 Regular expression matching by a DFA

Input: Regular expression ee over Σ\Sigma^*, xΣnx\in \Sigma^n where nNn\in\mathbb{N}

Output: Φe(x)\Phi_e(x)

Procedure DFAMatch\mathsf{DFAMatch}(ee,xx)

Let SS(e)S \leftarrow S(e) be the set {e[α]αΣ}\{ e[\alpha] | \alpha\in \Sigma^* \} as defined in the proof of the linear-time matching theorem.

for {eSe' \in S}

Let ve1v_{e'} \leftarrow 1 if Φe("")=1\Phi_{e'}("")=1 and ve0v_{e'} \leftarrow 0 otherwise

endfor

for {i[n]i \in [n]}

Let lastevelast_{e'} \leftarrow v_{e'} for all eSe' \in S

Let velaste[xi]v_{e'} \leftarrow last_{e'[x_i]} for all eSe' \in S

endfor

return vev_e

endproc

Algorithm 6.16 checks if a given string xΣx\in \Sigma^* is matched by the regular expression ee. For every regular expression ee, this algorithm has a constant number of Boolean variables (specifically a variable vev_{e'} for every eS(e)e' \in S(e) and a variable lastelast_{e'} for every ee' in S(e)S(e), using the fact that e[xi]e'[x_i] is in S(e)S(e) for every eS(e)e'\in S(e)). It makes a single pass over the input string. Hence it corresponds to a DFA. We prove its correctness by induction on the length nn of the input. Specifically, we will argue that before reading xix_i, the variable vev_{e'} is equal to Φe(x0xi1)\Phi_{e'}(x_0 \cdots x_{i-1}) for every eS(e)e' \in S(e). In the case i=0i=0 this holds since we initialize ve=Φe("")v_{e'} = \Phi_{e'}(\ensuremath{\text{\texttt{""}}}) for all eS(e)e' \in S(e). For i>0i>0 this holds by induction since the inductive hypothesis implies that laste=Φe(x0xi2)last_{e'} = \Phi_{e'}(x_0 \cdots x_{i-2}) for all eS(e)e' \in S(e) and by the definition of the set S(e)S(e'), for every eS(e)e' \in S(e) and xi1Σx_{i-1} \in \Sigma, e=e[xi1]e'' = e'[x_{i-1}] is in S(e)S(e) and Φe(x0xi1)=Φe(x0xi)\Phi_{e'}(x_0 \cdots x_{i-1}) = \Phi_{e''}(x_0 \cdots x_i).

Equivalence of regular expressions and automata

Recall that a Boolean function F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\} is defined to be regular if it is equal to Φe\Phi_e for some regular expression ee. (Equivalently, a language L{0,1}L \subseteq \{0,1\}^* is defined to be regular if there is a regular expression ee such that ee matches xx iff xLx\in L.) The following theorem is the central result of automata theory:

Let F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\}. Then FF is regular if and only if there exists a DFA (T,S)(T,\mathcal{S}) that computes FF.

One direction follows from Theorem 6.15, which shows that for every regular expression ee, the function Φe\Phi_e can be computed by a DFA (see for example Figure 6.6). For the other direction, we show that given a DFA (T,S)(T,\mathcal{S}) for every v,w[C]v,w \in [C] we can find a regular expression that would match x{0,1}x\in \{0,1\}^* if and only if the DFA starting in state vv, will end up in state ww after reading xx.

6.6: A deterministic finite automaton that computes the function Φ(01)\Phi_{(01)^*}.
6.7: Given a DFA of CC states, for every v,w[C]v,w \in [C] and number t{0,,C}t\in \{0,\ldots,C\} we define the function Fv,wt:{0,1}{0,1}F^t_{v,w}:\{0,1\}^* \rightarrow \{0,1\} to output one on input x{0,1}x\in \{0,1\}^* if and only if when the DFA is initialized in the state vv and is given the input xx, it will reach the state ww while going only through the intermediate states {0,,t1}\{0,\ldots,t-1\}.

Since Theorem 6.15 proves the “only if” direction, we only need to show the “if” direction. Let A=(T,S)A=(T,\mathcal{S}) be a DFA with CC states that computes the function FF. We need to show that FF is regular.

For every v,w[C]v,w \in [C], we let Fv,w:{0,1}{0,1}F_{v,w}:\{0,1\}^* \rightarrow \{0,1\} be the function that maps x{0,1}x\in \{0,1\}^* to 11 if and only if the DFA AA, starting at the state vv, will reach the state ww if it reads the input xx. We will prove that Fv,wF_{v,w} is regular for every v,wv,w. This will prove the theorem, since by Definition 6.2, F(x)F(x) is equal to the OR of F0,w(x)F_{0,w}(x) for every wSw\in \mathcal{S}. Hence if we have a regular expression for every function of the form Fv,wF_{v,w} then (using the | operation), we can obtain a regular expression for FF as well.

To give regular expressions for the functions Fv,wF_{v,w}, we start by defining the following functions Fv,wtF_{v,w}^t: for every v,w[C]v,w \in [C] and 0tC0 \leq t \leq C, Fv,wt(x)=1F_{v,w}^t(x)=1 if and only if starting from vv and observing xx, the automata reaches ww with all intermediate states being in the set [t]={0,,t1}[t]=\{0,\ldots, t-1\} (see Figure 6.7). That is, while v,wv,w themselves might be outside [t][t], Fv,wt(x)=1F_{v,w}^t(x)=1 if and only if throughout the execution of the automaton on the input xx (when initiated at vv) it never enters any of the states outside [t][t] and still ends up at ww. If t=0t=0 then [t][t] is the empty set, and hence Fv,w0(x)=1F^0_{v,w}(x)=1 if and only if the automaton reaches ww from vv directly on xx, without any intermediate state. If t=Ct=C then all states are in [t][t], and hence Fv,wt=Fv,wF_{v,w}^t= F_{v,w}.

We will prove the theorem by induction on tt, showing that Fv,wtF^t_{v,w} is regular for every v,wv,w and tt. For the base case of t=0t=0, Fv,w0F^0_{v,w} is regular for every v,wv,w since it can be described as one of the expressions ""\ensuremath{\text{\texttt{""}}}, \emptyset, 00, 11 or 010|1. Specifically, if v=wv=w then Fv,w0(x)=1F^0_{v,w}(x)=1 if and only if xx is the empty string. If vwv\neq w then Fv,w0(x)=1F^0_{v,w}(x)=1 if and only if xx consists of a single symbol σ{0,1}\sigma \in \{0,1\} and T(v,σ)=wT(v,\sigma)=w. Therefore in this case Fv,w0F^0_{v,w} corresponds to one of the four regular expressions 010|1, 00, 11 or \emptyset, depending on whether AA transitions to ww from vv when it reads either 00 or 11, only one of these symbols, or neither.

Inductive step: Now that we’ve seen the base case, let us prove the general case by induction. Assume, via the induction hypothesis, that for every v,w[C]v',w' \in [C], we have a regular expression Rv,wtR_{v',w'}^t that computes Fv,wtF_{v',w'}^t. We need to prove that Fv,wt+1F_{v,w}^{t+1} is regular for every v,wv,w. If the automaton arrives from vv to ww using the intermediate states [t+1][t+1], then it visits the tt-th state zero or more times. If the path labeled by xx causes the automaton to get from vv to ww without visiting the tt-th state at all, then xx is matched by the regular expression Rv,wtR_{v,w}^t. If the path labeled by xx causes the automaton to get from vv to ww while visiting the tt-th state k>0k>0 times, then we can think of this path as:

  • First travel from vv to tt using only intermediate states in [t1][t-1].

  • Then go from tt back to itself k1k-1 times using only intermediate states in [t1][t-1]

  • Then go from tt to ww using only intermediate states in [t1][t-1].

Therefore in this case the string xx is matched by the regular expression Rv,tt(Rt,tt)Rt,wtR_{v,t}^t(R_{t,t}^t)^* R_{t,w}^t. (See also Figure 6.8.)

Therefore we can compute Fv,wt+1F_{v,w}^{t+1} using the regular expression

Rv,wt    Rv,tt(Rt,tt)Rt,wt  .R_{v,w}^t \;|\; R_{v,t}^t(R_{t,t}^t)^* R_{t,w}^t\;.
This completes the proof of the inductive step and hence of the theorem.

6.8: If we have regular expressions Rv,wtR_{v',w'}^{t} corresponding to Fv,wtF_{v',w'}^{t} for every v,w[C]v',w' \in [C], we can obtain a regular expression Rv,wt+1R_{v,w}^{t+1} corresponding to Fv,wt+1F_{v,w}^{t+1}. The key observation is that a path from vv to ww using {0,,t}\{0,\ldots, t \} either does not touch tt at all, in which case it is captured by the expression Rv,wtR_{v,w}^{t}, or it goes from vv to tt, comes back to tt zero or more times, and then goes from tt to ww, in which case it is captured by the expression Rv,tt(Rt,tt)Rt,wtR_{v,t}^{t}(R_{t,t}^{t})^* R_{t,w}^t.

Closure properties of regular expressions

If FF and GG are regular functions computed by the expressions ee and ff respectively, then the expression efe|f computes the function H=FGH = F \vee G defined as H(x)=F(x)G(x)H(x) = F(x) \vee G(x). Another way to say this is that the set of regular functions is closed under the OR operation. That is, if FF and GG are regular then so is FGF \vee G. An important corollary of Theorem 6.17 is that this set is also closed under the NOT operation:

If F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\} is regular then so is the function F\overline{F}, where F(x)=1F(x)\overline{F}(x) = 1 - F(x) for every x{0,1}x\in \{0,1\}^*.

If FF is regular then by Theorem 6.12 it can be computed by a DFA AA. But we can then construct a DFA A\overline{A} which does the same computation but flips the set of accepted states. The DFA A\overline{A} will compute F\overline{F}. By Theorem 6.17 this implies that F\overline{F} is regular as well.

Since ab=aba \wedge b = \overline{\overline{a} \vee \overline{b}}, Lemma 6.18 implies that the set of regular functions is closed under the AND operation as well. Moreover, since OR, NOT and AND are a universal basis, this set is also closed under NAND, XOR, and any other finite function. That is, we have the following corollary:

Let f:{0,1}k{0,1}f:\{0,1\}^k \rightarrow \{0,1\} be any finite Boolean function, and let F0,,Fk1:{0,1}{0,1}F_0,\ldots,F_{k-1} : \{0,1\}^* \rightarrow \{0,1\} be regular functions. Then the function G(x)=f(F0(x),F1(x),,Fk1(x))G(x) = f(F_0(x),F_1(x),\ldots,F_{k-1}(x)) is regular.

This is a direct consequence of the closure of regular functions under OR and NOT (and hence AND), combined with Theorem 4.13, that states that every ff can be computed by a Boolean circuit (which is simply a combination of the AND, OR, and NOT operations).

Limitations of regular expressions and the pumping lemma

The efficiency of regular expression matching makes them very useful. This is why operating systems and text editors often restrict their search interface to regular expressions and do not allow searching by specifying an arbitrary function. However, this efficiency comes at a cost. As we have seen, regular expressions cannot compute every function. In fact, there are some very simple (and useful!) functions that they cannot compute. Here is one example:

Let Σ={,}\Sigma = \{\langle ,\rangle \} and MATCHPAREN:Σ{0,1}\ensuremath{\mathit{MATCHPAREN}}:\Sigma^* \rightarrow \{0,1\} be the function that given a string of parentheses, outputs 11 if and only if every opening parenthesis is matched by a corresponding closed one. Then there is no regular expression over Σ\Sigma that computes MATCHPAREN\ensuremath{\mathit{MATCHPAREN}}.

Lemma 6.20 is a consequence of the following result, which is known as the pumping lemma:

Let ee be a regular expression over some alphabet Σ\Sigma. Then there is some number n0n_0 such that for every wΣw\in \Sigma^* with w>n0|w|>n_0 and Φe(w)=1\Phi_{e}(w)=1, we can write w=xyzw=xyz for strings x,y,zΣx,y,z \in \Sigma^* satisfying the following conditions:

  1. y1|y| \geq 1.

  2. xyn0|xy| \leq n_0.

  3. Φe(xykz)=1\Phi_{e}(xy^kz)=1 for every kNk\in \N.

6.9: To prove the “pumping lemma” we look at a word ww that is much larger than the regular expression ee that matches it. In such a case, part of ww must be matched by some sub-expression of the form (e)(e')^*, since this is the only operator that allows matching words longer than the expression. If we look at the “leftmost” such sub-expression and define yky^k to be the string that is matched by it, we obtain the partition needed for the pumping lemma.

The idea behind the proof is the following. Let n0n_0 be twice the number of symbols that are used in the expression ee, then the only way that there is some ww with w>n0|w|>n_0 and Φe(w)=1\Phi_{e}(w)=1 is that ee contains the * (i.e. star) operator and that there is a non-empty substring yy of ww that was matched by (e)(e')^* for some sub-expression ee' of ee. We can now repeat yy any number of times and still get a matching string. See also Figure 6.9.

The pumping lemma is a bit cumbersome to state, but one way to remember it is that it simply says the following: “if a string matching a regular expression is long enough, one of its substrings must be matched using the * operator”.

To prove the lemma formally, we use induction on the length of the expression. Like all induction proofs, this will be somewhat lengthy, but at the end of the day it directly follows the intuition above that somewhere we must have used the star operation. Reading this proof, and in particular understanding how the formal proof below corresponds to the intuitive idea above, is a very good way to get more comfortable with inductive proofs of this form.

Our inductive hypothesis is that for an nn length expression, n0=2nn_0=2n satisfies the conditions of the lemma. The base case is when the expression is a single symbol σΣ\sigma \in \Sigma or that the expression is \emptyset or ""\ensuremath{\text{\texttt{""}}}. In all these cases the conditions of the lemma are satisfied simply because n0=2n_0=2, and there exists no string xx of length larger than n0n_0 that is matched by the expression.

We now prove the inductive step. Let ee be a regular expression with n>1n>1 symbols. We set n0=2nn_0=2n and let wΣw\in \Sigma^* be a string satisfying w>n0|w|>n_0. Since ee has more than one symbol, it has one of the forms (a) eee' | e'', (b), (e)(e)(e')(e''), or (c) (e)(e')^* where in all these cases the subexpressions ee' and ee'' have fewer symbols than ee and hence satisfy the induction hypothesis.

In the case (a), every string ww matched by ee must be matched by either ee' or ee''. If ee' matches ww then, since w>2e|w|>2|e'|, by the induction hypothesis there exist x,y,zx,y,z with y1|y| \geq 1 and xy2e<n0|xy| \leq 2|e'| <n_0 such that ee' (and therefore also e=eee=e'|e'') matches xykzxy^kz for every kk. The same arguments works in the case that ee'' matches ww.

In the case (b), if ww is matched by (e)(e)(e')(e'') then we can write w=www=w'w'' where ee' matches ww' and ee'' matches ww''. We split to subcases. If w>2e|w'|>2|e'| then by the induction hypothesis there exist x,y,zx,y,z' with y1|y| \geq 1, xy2e<n0|xy| \leq 2|e'| < n_0 such that w=xyzw'=xyz' and ee' matches xykzxy^kz' for every kNk\in \N. This completes the proof since if we set z=zwz=z'w'' then we see that w=ww=xyzw=w'w''=xyz and e=(e)(e)e=(e')(e'') matches xykzxy^kz for every kNk\in \N. Otherwise, if w2e|w'| \leq 2|e'| then since w=w+w>n0=2(e+e)|w|=|w'|+|w''|>n_0=2(|e'|+|e''|), it must be that w>2e|w''|>2|e''|. Hence by the induction hypothesis there exist x,y,zx',y,z such that y1|y| \geq 1, xy2e|x'y| \leq 2|e''| and ee'' matches xykzx'y^kz for every kNk\in \N. But now if we set x=wxx=w'x' we see that xy=w+xy2e+2e=n0|xy| = |w'| + |x'y| \leq 2|e'| + 2|e''| =n_0 and on the other hand the expression e=(e)(e)e=(e')(e'') matches xykz=wxykzxy^kz = w'x'y^kz for every kNk\in \N.

In case (c), if ww is matched by (e)(e')^* then w=w0wtw= w_0\cdots w_t where for every i[t]i\in [t], wiw_i is a nonempty string matched by ee'. If w0>2e|w_0|>2|e'|, then we can use the same approach as in the concatenation case above. Otherwise, we simply note that if xx is the empty string, y=w0y=w_0, and z=w1wtz=w_1\cdots w_t then xyn0|xy| \leq n_0 and xykzxy^kz is matched by (e)(e')^* for every kNk\in \N.

When an object is recursively defined (as in the case of regular expressions) then it is natural to prove properties of such objects by induction. That is, if we want to prove that all objects of this type have property PP, then it is natural to use an inductive step that says that if o,o,oo',o'',o''' etc have property PP then so is an object oo that is obtained by composing them.

Using the pumping lemma, we can easily prove Lemma 6.20 (i.e., the non-regularity of the “matching parenthesis” function):

Suppose, towards the sake of contradiction, that there is an expression ee such that Φe=MATCHPAREN\Phi_{e}= \ensuremath{\mathit{MATCHPAREN}}. Let n0n_0 be the number obtained from Theorem 6.21 and let w=n0n0w =\langle^{n_0}\rangle^{n_0} (i.e., n0n_0 left parenthesis followed by n0n_0 right parenthesis). Then we see that if we write w=xyzw=xyz as in Theorem 6.21, the condition xyn0|xy| \leq n_0 implies that yy consists solely of left parenthesis. Hence the string xy2zxy^2z will contain more left parenthesis than right parenthesis. Hence MATCHPAREN(xy2z)=0\ensuremath{\mathit{MATCHPAREN}}(xy^2z)=0 but by the pumping lemma Φe(xy2z)=1\Phi_{e}(xy^2z)=1, contradicting our assumption that Φe=MATCHPAREN\Phi_{e}=\ensuremath{\mathit{MATCHPAREN}}.

The pumping lemma is a very useful tool to show that certain functions are not computable by a regular expression. However, it is not an “if and only if” condition for regularity: there are non-regular functions that still satisfy the pumping lemma conditions. To understand the pumping lemma, it is crucial to follow the order of quantifiers in Theorem 6.21. In particular, the number n0n_0 in the statement of Theorem 6.21 depends on the regular expression (in the proof we chose n0n_0 to be twice the number of symbols in the expression). So, if we want to use the pumping lemma to rule out the existence of a regular expression ee computing some function FF, we need to be able to choose an appropriate input w{0,1}w\in \{0,1\}^* that can be arbitrarily large and satisfies F(w)=1F(w)=1. This makes sense if you think about the intuition behind the pumping lemma: we need ww to be large enough as to force the use of the star operator.

6.10: A cartoon of a proof using the pumping lemma that a function FF is not regular. The pumping lemma states that if FF is regular then there exists a number n0n_0 such that for every large enough ww with F(w)=1F(w)=1, there exists a partition of ww to w=xyzw=xyz satisfying certain conditions such that for every kNk\in \N, F(xykz)=1F(xy^kz)=1. You can imagine a pumping-lemma based proof as a game between you and the adversary. Every there exists quantifier corresponds to an object you are free to choose on your own (and base your choice on previously chosen objects). Every for every quantifier corresponds to an object the adversary can choose arbitrarily (and again based on prior choices) as long as it satisfies the conditions. A valid proof corresponds to a strategy by which no matter what the adversary does, you can win the game by obtaining a contradiction which would be a choice of kk that would result in F(xykz)=0F(xy^kz)=0, hence violating the conclusion of the pumping lemma.

Prove that the following function over the alphabet {0,1,;}\{0,1,; \} is not regular: PAL(w)=1\ensuremath{\mathit{PAL}}(w)=1 if and only if w=u;uRw = u;u^R where u{0,1}u \in \{0,1\}^* and uRu^R denotes uu “reversed”: the string uu1u0u_{|u|-1}\cdots u_0. (The Palindrome function is most often defined without an explicit separator character ;;, but the version with such a separator is a bit cleaner, and so we use it here. This does not make much difference, as one can easily encode the separator as a special binary string instead.)

We use the pumping lemma. Suppose toward the sake of contradiction that there is a regular expression ee computing PAL\ensuremath{\mathit{PAL}}, and let n0n_0 be the number obtained by the pumping lemma (Theorem 6.21). Consider the string w=0n0;0n0w = 0^{n_0};0^{n_0}. Since the reverse of the all zero string is the all zero string, PAL(w)=1\ensuremath{\mathit{PAL}}(w)=1. Now, by the pumping lemma, if PAL\ensuremath{\mathit{PAL}} is computed by ee, then we can write w=xyzw=xyz such that xyn0|xy| \leq n_0, y1|y|\geq 1 and PAL(xykz)=1\ensuremath{\mathit{PAL}}(xy^kz)=1 for every kNk\in \N. In particular, it must hold that PAL(xz)=1\ensuremath{\mathit{PAL}}(xz)=1, but this is a contradiction, since xz=0n0y;0n0xz=0^{n_0-|y|};0^{n_0} and so its two parts are not of the same length and in particular are not the reverse of one another.

For yet another example of a pumping-lemma based proof, see Figure 6.10 which illustrates a cartoon of the proof of the non-regularity of the function F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\} which is defined as F(x)=1F(x)=1 iff x=0n1nx=0^n1^n for some nNn\in \N (i.e., xx consists of a string of consecutive zeroes, followed by a string of consecutive ones of the same length).

Answering semantic questions about regular expressions

Regular expressions have applications beyond search. For example, regular expressions are often used to define tokens (such as what is a valid variable identifier, or keyword) in the design of parsers, compilers and interpreters for programming languages. Regular expressions have other applications too: for example, in recent years, the world of networking moved from fixed topologies to “software defined networks”. Such networks are routed by programmable switches that can implement policies such as “if packet is secured by SSL then forward it to A, otherwise forward it to B”. To represent such policies we need a language that is on one hand sufficiently expressive to capture the policies we want to implement, but on the other hand sufficiently restrictive so that we can quickly execute them at network speed and also be able to answer questions such as “can C see the packets moved from A to B?”. The NetKAT network programming language uses a variant of regular expressions to achieve precisely that. For this application, it is important that we are not merely able to answer whether an expression ee matches a string xx but also answer semantic questions about regular expressions such as “do expressions ee and ee' compute the same function?” and “does there exist a string xx that is matched by the expression ee?”. The following theorem shows that we can answer the latter question:

There is an algorithm that given a regular expression ee, outputs 11 if and only if Φe\Phi_{e} is the constant zero function.

The idea is that we can directly observe this from the structure of the expression. The only way a regular expression ee computes the constant zero function is if ee has the form \emptyset or is obtained by concatenating \emptyset with other expressions.

Define a regular expression to be “empty” if it computes the constant zero function. Given a regular expression ee, we can determine if ee is empty using the following rules:

  • If ee has the form σ\sigma or ""\ensuremath{\text{\texttt{""}}} then it is not empty.

  • If ee is not empty then eee|e' is not empty for every ee'.

  • If ee is not empty then ee^* is not empty.

  • If ee and ee' are both not empty then e  ee\; e' is not empty.

  • \emptyset is empty.

Using these rules, it is straightforward to come up with a recursive algorithm to determine emptiness.

Using Theorem 6.23, we can obtain an algorithm that determines whether or not two regular expressions ee and ee' are equivalent, in the sense that they compute the same function.

Let REGEQ:{0,1}{0,1}\ensuremath{\mathit{REGEQ}}:\{0,1\}^* \rightarrow \{0,1\} be the function that on input (a string representing) a pair of regular expressions e,ee,e', REGEQ(e,e)=1\ensuremath{\mathit{REGEQ}}(e,e')=1 if and only if Φe=Φe\Phi_{e} = \Phi_{e'}. Then there is an algorithm that computes REGEQ\ensuremath{\mathit{REGEQ}}.

The idea is to show that given a pair of regular expressions ee and ee' we can find an expression ee'' such that Φe(x)=1\Phi_{e''}(x)=1 if and only if Φe(x)Φe(x)\Phi_e(x) \neq \Phi_{e'} (x). Therefore Φe\Phi_{e''} is the constant zero function if and only if ee and ee' are equivalent, and thus we can test for emptiness of ee'' to determine equivalence of ee and ee'.

We will prove Theorem 6.24 from Theorem 6.23. (The two theorems are in fact equivalent: it is easy to prove Theorem 6.23 from Theorem 6.24, since checking for emptiness is the same as checking equivalence with the expression \emptyset.) Given two regular expressions ee and ee', we will compute an expression ee'' such that Φe(x)=1\Phi_{e''}(x) =1 if and only if Φe(x)Φe(x)\Phi_e(x) \neq \Phi_{e'}(x). One can see that ee is equivalent to ee' if and only if ee'' is empty.

We start with the observation that for every bit a,b{0,1}a,b \in \{0,1\}, aba \neq b if and only if

(ab)    (ab)  . (a \wedge \overline{b}) \; \vee \; (\overline{a} \wedge b) \;.

Hence we need to construct ee'' such that for every xx,

Φe(x)=(Φe(x)Φe(x))    (Φe(x)Φe(x))  .    (6.3) \Phi_{e''}(x) = (\Phi_{e}(x) \wedge \overline{\Phi_{e'}(x)}) \; \vee \; (\overline{\Phi_{e}(x)} \wedge \Phi_{e'}(x)) \;. \;\;(6.3)

To construct the expression ee'', we will show how given any pair of expressions ee and ee', we can construct expressions eee\wedge e' and e\overline{e} that compute the functions ΦeΦe\Phi_{e} \wedge \Phi_{e'} and Φe\overline{\Phi_{e}} respectively. (Computing the expression for eee \vee e' is straightforward using the | operation of regular expressions.)

Specifically, by Lemma 6.18, regular functions are closed under negation, which means that for every regular expression ee, there is an expression e\overline{e} such that Φe(x)=1Φe(x)\Phi_{\overline{e}}(x) = 1 - \Phi_{e}(x) for every x{0,1}x\in \{0,1\}^*. Now, for every two expressions ee and ee', the expression

ee=(ee) e \wedge e' = \overline{(\overline{e} | \overline{e'})}
computes the AND of the two expressions. Given these two transformations, we see that for every regular expressions ee and ee' we can find a regular expression ee'' satisfying Equation 6.3 such that ee'' is empty if and only if ee and ee' are equivalent.

  • We model computational tasks on arbitrarily large inputs using infinite functions F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\}^*.
  • Such functions take an arbitrarily long (but still finite!) string as input, and cannot be described by a finite table of inputs and outputs.
  • A function with a single bit of output is known as a Boolean function, and the task of computing it is equivalent to deciding a language L{0,1}L\subseteq \{0,1\}^*.
  • Deterministic finite automata (DFAs) are one simple model for computing (infinite) Boolean functions.
  • There are some functions that cannot be computed by DFAs.
  • The set of functions computable by DFAs is the same as the set of languages that can be recognized by regular expressions.

Exercises

Suppose that F,G:{0,1}{0,1}F,G:\{0,1\}^* \rightarrow \{0,1\} are regular. For each one of the following definitions of the function HH, either prove that HH is always regular or give a counterexample for regular F,GF,G that would make HH not regular.

  1. H(x)=F(x)G(x)H(x) = F(x) \vee G(x).

  2. H(x)=F(x)G(x)H(x) = F(x) \wedge G(x)

  3. H(x)=NAND(F(x),G(x))H(x) = \ensuremath{\mathit{NAND}}(F(x),G(x)).

  4. H(x)=F(xR)H(x) = F(x^R) where xRx^R is the reverse of xx: xR=xn1xn2xox^R = x_{n-1}x_{n-2} \cdots x_o for n=xn=|x|.

  5. H(x)={1x=uv s.t. F(u)=G(v)=10otherwiseH(x) = \begin{cases}1 & x=uv \text{ s.t. } F(u)=G(v)=1 \\ 0 & \text{otherwise} \end{cases}

  6. H(x)={1x=uu s.t. F(u)=G(u)=10otherwiseH(x) = \begin{cases}1 & x=uu \text{ s.t. } F(u)=G(u)=1 \\ 0 & \text{otherwise} \end{cases}

  7. H(x)={1x=uuR s.t. F(u)=G(u)=10otherwiseH(x) = \begin{cases}1 & x=uu^R \text{ s.t. } F(u)=G(u)=1 \\ 0 & \text{otherwise} \end{cases}

One among the following two functions that map {0,1}\{0,1\}^* to {0,1}\{0,1\} can be computed by a regular expression, and the other one cannot. For the one that can be computed by a regular expression, write the expression that does it. For the one that cannot, prove that this cannot be done using the pumping lemma.

  • F(x)=1F(x)=1 if 44 divides i=0x1xi\sum_{i=0}^{|x|-1} x_i and F(x)=0F(x)=0 otherwise.

  • G(x)=1G(x) = 1 if and only if i=0x1xix/4\sum_{i=0}^{|x|-1} x_i \geq |x|/4 and G(x)=0G(x)=0 otherwise.

  1. Prove that the following function F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\} is not regular. For every x{0,1}x\in \{0,1\}^*, F(x)=1F(x)=1 iff xx is of the form x=13ix=1^{3^i} for some i>0i>0.

  2. Prove that the following function F:{0,1}{0,1}F:\{0,1\}^* \rightarrow \{0,1\} is not regular. For every x{0,1}x\in \{0,1\}^*, F(x)=1F(x)=1 iff jxj=3i\sum_j x_j = 3^i for some i>0i>0.

Bibliographical notes

The relation of regular expressions with finite automata is a beautiful topic, on which we only touch upon in this text. It is covered more extensively in (Sipser, 1997) (Hopcroft, Motwani, Ullman, 2014) (Kozen, 1997) . These texts also discuss topics such as non-deterministic finite automata (NFA) and the relation between context-free grammars and pushdown automata.

The automaton of Figure 6.4 was generated using the FSM simulator of Ivan Zuzak and Vedrana Jankovic. Our proof of Theorem 6.12 is closely related to the Myhill-Nerode Theorem. One direction of the Myhill-Nerode theorem can be stated as saying that if ee is a regular expression then there is at most a finite number of strings z0,,zk1z_0,\ldots,z_{k-1} such that Φe[zi]Φe[zj]\Phi_{e[z_i]} \neq \Phi_{e[z_j]} for every 0ij<k0 \leq i\neq j < k.

Comments

Comments are posted on the GitHub repository using the utteranc.es app. A GitHub login is required to comment. If you don't want to authorize the app to post on your behalf, you can also comment directly on the GitHub issue for this page.

Compiled on 12/06/2023 00:07:03

Copyright 2023, Boaz Barak. Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Produced using pandoc and panflute with templates derived from gitbook and bookdown.