See any bugs/typos/confusing explanations? Open a GitHub issue. You can also comment below
★ See also the PDF version of this chapter (better formatting/references) ★
Functions with Infinite domains, Automata, and Regular expressions
- Define functions on unbounded length inputs, that cannot be described by a finite size table of inputs and outputs.
- Equivalence with the task of deciding membership in a language.
- Deterministic finite automatons (optional): A simple example for a model for unbounded computation.
- Equivalence with regular expressions.
“An algorithm is a finite answer to an infinite number of questions.”, Attributed to Stephen Kleene.
The model of Boolean circuits (or equivalently, the NAND-CIRC programming language) has one very significant drawback: a Boolean circuit can only compute a finite function . In particular, since every gate has two inputs, a size circuit can compute on an input of length at most . Thus this model does not capture our intuitive notion of an algorithm as a single recipe to compute a potentially infinite function. For example, the standard elementary school multiplication algorithm is a single algorithm that multiplies numbers of all lengths. However, we cannot express this algorithm as a single circuit, but rather need a different circuit (or equivalently, a NAND-CIRC program) for every input length (see Figure 6.1).

In this chapter, we extend our definition of computational tasks to consider functions with the unbounded domain of . We focus on the question of defining what tasks to compute, mostly leaving the question of how to compute them to later chapters, where we will see Turing machines and other computational models for computing on unbounded inputs. However, we will see one example of a simple restricted model of computation - deterministic finite automata (DFAs).
In this chapter, we discuss functions that take as input strings of arbitrary length. We will often focus on the special case of Boolean functions, where the output is a single bit. These are still infinite functions since their inputs have unbounded length and hence such a function cannot be computed by any single Boolean circuit.
In the second half of this chapter, we discuss finite automata, a computational model that can compute unbounded length functions. Finite automata are not as powerful as Python or other general-purpose programming languages but can serve as an introduction to these more general models. We also show a beautiful result - the functions computable by finite automata are precisely the ones that correspond to regular expressions. However, the reader can also feel free to skip automata and go straight to our discussion of Turing machines in Chapter 7.
Functions with inputs of unbounded length
Up until now, we considered the computational task of mapping some string of length into a string of length . However, in general, computational tasks can involve inputs of unbounded length. For example, the following Python function computes the function , where equals iff the number of ’s in is odd. (In other words, for every .) As simple as it is, the function cannot be computed by a Boolean circuit. Rather, for every , we can compute (the restriction of to ) using a different circuit (e.g., see Figure 6.2).
def XOR(X):
'''Takes list X of 0's and 1's
Outputs 1 if the number of 1's is odd and outputs 0 otherwise'''
result = 0
for i in range(len(X)):
result = (result + X[i]) % 2
return result

Previously in this book, we studied the computation of finite functions . Such a function can always be described by listing all the values it takes on inputs . In this chapter, we consider functions such as that take inputs of unbounded size. While we can describe using a finite number of symbols (in fact, we just did so above), it takes infinitely many possible inputs, and so we cannot just write down all of its values. The same is true for many other functions capturing important computational tasks, including addition, multiplication, sorting, finding paths in graphs, fitting curves to points, and so on. To contrast with the finite case, we will sometimes call a function (or ) infinite. However, this does not mean that takes as input strings of infinite length! It just means that can take as input a string that can be arbitrarily long, and so we cannot simply write down a table of all the outputs of on different inputs.
A function specifies the computational task mapping an input into the output .
As we have seen before, restricting attention to functions that use binary strings as inputs and outputs does not detract from our generality, since other objects, including numbers, lists, matrices, images, videos, and more, can be encoded as binary strings.
As before, it is essential to differentiate between specification and implementation. For example, consider the following function:
This is a mathematically well-defined function. For every , has a unique value which is either or . However, at the moment, no one knows of a Python program that computes this function. The Twin prime conjecture posits that for every there exists such that both and are primes. If this conjecture is true, then is easy to compute indeed - the program def T(x): return 1
will do the trick. However, mathematicians have tried unsuccessfully to prove this conjecture since 1849. That said, whether or not we know how to implement the function , the definition above provides its specification.
Varying inputs and outputs
Many of the functions that interest us take more than one input. For example, the function
takes the binary representation of a pair of integers , and outputs the binary representation of their product . However, since we can represent a pair of strings as a single string, we will consider functions such as MULT as mapping to . We will typically not be concerned with low-level details such as the precise way to represent a pair of integers as a string, since virtually all choices will be equivalent for our purposes.
Another example of a function we want to compute is
has a single bit as output. Functions with a single bit of output are known as Boolean functions. Boolean functions are central to the theory of computation, and we will discuss them often in this book. Note that even though Boolean functions have a single bit of output, their input can be of arbitrary length. Thus they are still infinite functions that cannot be described via a finite table of values.
“Booleanizing” functions. Sometimes it might be convenient to obtain a Boolean variant for a non-Boolean function. For example, the following is a Boolean variant of .
If we can compute via any programming language such as Python, C, Java, etc., we can compute as well, and vice versa.
Show that for every function , there exists a Boolean function such that a Python program to compute can be transformed into a program to compute and vice versa.
For every , we can define
to be the function that on input outputs the bit of if and . If , then outputs iff and hence this allows to compute the length of .
Computing from is straightforward. For the other direction, given a Python function BF
that computes , we can compute as follows:
Formal Languages
For every Boolean function , we can define the set of strings on which outputs . Such sets are known as languages. This name is rooted in formal language theory as pursued by linguists such as Noam Chomsky. A formal language is a subset (or more generally for some finite alphabet ). The membership or decision problem for a language , is the task of determining, given , whether or not . If we can compute the function , then we can decide membership in the language and vice versa. Hence, many texts such as (Sipser, 1997) refer to the task of computing a Boolean function as “deciding a language”. In this book, we mostly describe computational tasks using the function notation, which is easier to generalize to computation with more than one bit of output. However, since the language terminology is so popular in the literature, we will sometimes mention it.
Restrictions of functions
If is a Boolean function and then the restriction of to inputs of length , denoted as , is the finite function such that for every . That is, is the finite function that is only defined on inputs in , but agrees with on those inputs. Since is a finite function, it can be computed by a Boolean circuit, implying the following theorem:
Let . Then there is a collection of circuits such that for every , computes the restriction of to inputs of length .
This is an immediate corollary of the universality of Boolean circuits. Indeed, since maps to , Theorem 4.15 implies that there exists a Boolean circuit to compute it. In fact, the size of this circuit is at most gates for some constant .
In particular, Theorem 6.1 implies that there exists such a circuit collection even for the function we described before, even though we do not know of any program to compute it. Indeed, this is not that surprising: for every particular , is either the constant zero function or the constant one function, both of which can be computed by very simple Boolean circuits. Hence a collection of circuits that computes certainly exists. The difficulty in computing using Python or any other programming language arises from the fact that we do not know for each particular what is the circuit in this collection.
Deterministic finite automata (optional)
All our computational models so far - Boolean circuits and straight-line programs - were only applicable for finite functions.
In Chapter 7, we will present Turing machines, which are the central models of computation for unbounded input length functions. However, in this section we present the more basic model of deterministic finite automata (DFA). Automata can serve as a good stepping-stone for Turing machines, though they will not be used much in later parts of this book, and so the reader can feel free to skip ahead to Chapter 7. DFAs turn out to be equivalent in power to regular expressions: a powerful mechanism to specify patterns, which is widely used in practice. Our treatment of automata is relatively brief. There are plenty of resources that help you get more comfortable with DFAs. In particular, Chapter 1 of Sipser’s book (Sipser, 1997) contains an excellent exposition of this material. There are also many websites with online simulators for automata, as well as translators from regular expressions to automata and vice versa (see for example here and here).
At a high level, an algorithm is a recipe for computing an output from an input via a combination of the following steps:
- Read a bit from the input
- Update the state (working memory)
- Stop and produce an output
For example, recall the Python program that computes the function:
def XOR(X):
'''Takes list X of 0's and 1's
Outputs 1 if the number of 1's is odd and outputs 0 otherwise'''
result = 0
for i in range(len(X)):
result = (result + X[i]) % 2
return result
In each step, this program reads a single bit X[i]
and updates its state result
based on that bit (flipping result
if X[i]
is and keeping it the same otherwise). When it is done transversing the input, the program outputs result
. In computer science, such a program is called a single-pass constant-memory algorithm since it makes a single pass over the input and its working memory is finite. (Indeed, in this case, result
can either be or .) Such an algorithm is also known as a Deterministic Finite Automaton or DFA (another name for DFAs is a finite state machine). We can think of such an algorithm as a “machine” that can be in one of states, for some constant . The machine starts in some initial state and then reads its input one bit at a time. Whenever the machine reads a bit , it transitions into a new state based on and its prior state. The output of the machine is based on the final state. Every single-pass constant-memory algorithm corresponds to such a machine. If an algorithm uses bits of memory, then the contents of its memory can be represented as a string of length . Therefore such an algorithm can be in one of at most states at any point in the execution.
We can specify a DFA of states by a list of rules. Each rule will be of the form “If the DFA is in state and the bit read from the input is then the new state is ”. At the end of the computation, we will also have a rule of the form “If the final state is one of the following … then output , otherwise output ”. For example, the Python program above can be represented by a two-state automaton for computing of the following form:
- Initialize in the state .
- For every state and input bit read, if then change to state , otherwise stay in state .
- At the end output iff .
We can also describe a -state DFA as a labeled graph of vertices. For every state and bit , we add a directed edge labeled with between and the state such that if the DFA is at state and reads then it transitions to . (If the state stays the same then this edge will be a self-loop; similarly, if transitions to in both the case and then the graph will contain two parallel edges.) We also label the set of states on which the automaton will output at the end of the computation. This set is known as the set of accepting states. See Figure 6.3 for the graphical representation of the XOR automaton.

Formally, a DFA is specified by (1) the table of the rules, which can be represented as a transition function that maps a state and bit to the state which the DFA will transition to from state on input and (2) the set of accepting states. This leads to the following definition.
A deterministic finite automaton (DFA) with states over is a pair with and . The finite function is known as the transition function of the DFA. The set is known as the set of accepting states.
Let be a Boolean function with the infinite domain . We say that computes a function if for every and , if we define and for every , then
Make sure not to confuse the transition function of an automaton ( in Definition 6.2), which is a finite function specifying the table of “rules” which it follows, with the function the automaton computes ( in Definition 6.2) which is an infinite function.
Deterministic finite automata can be defined in several equivalent ways. In particular Sipser (Sipser, 1997) defines a DFA as a five-tuple where is the set of states, is the alphabet, is the transition function, is the initial state, and is the set of accepting states. In this book the set of states is always of the form and the initial state is always , but this makes no difference to the computational power of these models. Also, we restrict our attention to the case that the alphabet is equal to .
Prove that there is a DFA that computes the following function :
When asked to construct a deterministic finite automaton, it is often useful to start by constructing a single-pass constant-memory algorithm using a more general formalism (for example, using pseudocode or a Python program). Once we have such an algorithm, we can mechanically translate it into a DFA. Here is a simple Python program for computing :
def F(X):
'''Return 1 iff X is a concatenation of zero/more copies of [0,1,0]'''
if len(X) % 3 != 0:
return False
ultimate = 0
penultimate = 1
antepenultimate = 0
for idx, b in enumerate(X):
antepenultimate = penultimate
penultimate = ultimate
ultimate = b
if idx % 3 == 2 and ((antepenultimate, penultimate, ultimate) != (0,1,0)):
return False
return True
Since we keep three Boolean variables, the working memory can be in one of configurations, and so the program above can be directly translated into an state DFA. While this is not needed to solve the question, by examining the resulting DFA, we can see that we can merge some states and obtain a state automaton, described in Figure 6.4. See also Figure 6.5, which depicts the execution of this DFA on a particular input.

Anatomy of an automaton (finite vs. unbounded)
Now that we are considering computational tasks with unbounded input sizes, it is crucial to distinguish between the components of our algorithm that have fixed length and the components that grow with the input size. For the case of DFAs these are the following:
Constant size components: Given a DFA , the following quantities are fixed independent of the input size:
The number of states in .
The transition function (which has inputs, and so can be specified by a table of rows, each entry in which is a number in ).
The set of accepting states. This set can be described by a string in specifiying which states are in and which are not.
Together the above means that we can fully describe an automaton using finitely many symbols. This is a property we require out of any notion of “algorithm”: we should be able to write down a complete specification of how it produces an output from an input.
Components of unbounded size: The following quantities relating to a DFA are not bounded by any constant. We stress that these are still finite for any given input.
The length of the input that the DFA is provided. The input length is always finite, but not a priori bounded.
The number of steps that the DFA takes can grow with the length of the input. Indeed, a DFA makes a single pass on the input and so it takes precisely steps on an input .

DFA-computable functions
We say that a function is DFA computable if there exists some that computes . In Chapter 4 we saw that every finite function is computable by some Boolean circuit. Thus, at this point, you might expect that every infinite function is computable by some DFA. However, this is very much not the case. We will soon see some simple examples of infinite functions that are not computable by DFAs, but for starters, let us prove that such functions exist.
Let be the set of all Boolean functions such that there exists a DFA computing . Then is countable.
Every DFA can be described by a finite length string, which yields an onto map from to : namely, the function that maps a string describing an automaton to the function that it computes.
Every DFA can be described by a finite string, representing the transition function and the set of accepting states, and every DFA computes some function . Thus we can define the following function :
Since the set of all Boolean functions is uncountable, we get the following corollary:
There exists a Boolean function that is not computable by any DFA.
If every Boolean function is computable by some DFA, then equals the set of all Boolean functions, but by Theorem 2.12, the latter set is uncountable, contradicting Theorem 6.4.
Regular expressions
Searching for a piece of text is a common task in computing. At its heart, the search problem is quite simple. We have a collection of strings (e.g., files on a hard-drive, or student records in a database), and the user wants to find out the subset of all the that are matched by some pattern (e.g., all files whose names end with the string .txt
). In full generality, we can allow the user to specify the pattern by specifying a (computable) function , where corresponds to the pattern matching . That is, the user provides a program in a programming language such as Python, and the system returns all such that . For example, one could search for all text files that contain the string important document
or perhaps (letting correspond to a neural-network based classifier) all images that contain a cat. However, we don’t want our system to get into an infinite loop just trying to evaluate the program ! For this reason, typical systems for searching files or databases do not allow users to specify the patterns using full-fledged programming languages. Rather, such systems use restricted computational models that on the one hand are rich enough to capture many of the queries needed in practice (e.g., all filenames ending with .txt
, or all phone numbers of the form (617) xxx-xxxx
), but on the other hand are restricted enough so that queries can be evaluated very efficiently on huge files and in particular cannot result in an infinite loop.
One of the most popular such computational models is regular expressions. If you ever used an advanced text editor, a command-line shell, or have done any kind of manipulation of text files, then you have probably come across regular expressions.
A regular expression over some alphabet is obtained by combining elements of with the operation of concatenation, as well as (corresponding to or) and (corresponding to repetition zero or more times). (Common implementations of regular expressions in programming languages and shells typically include some extra operations on top of and , but these operations can be implemented as “syntactic sugar” using the operators and .) For example, the following regular expression over the alphabet corresponds to the set of all strings where every digit is repeated at least twice:
The following regular expression over the alphabet corresponds to the set of all strings that consist of a sequence of one or more of the letters - followed by a sequence of one or more digits (without a leading zero):
Formally, regular expressions are defined by the following recursive definition:
A regular expression over an alphabet is a string over that has one of the following forms:
where
where are regular expressions.
where are regular expressions. (We often drop the parentheses when there is no danger of confusion and so write this as .)
where is a regular expression.
Finally we also allow the following “edge cases”: and . These are the regular expressions corresponding to accepting no strings, and accepting only the empty string respectively.
We will drop parentheses when they can be inferred from the context. We also use the convention that OR and concatenation are left-associative, and we give highest precedence to , then concatenation, and then OR. Thus for example we write instead of .
Every regular expression corresponds to a function where if matches the regular expression. For example, if then but (can you see why?).
The formal definition of is one of those definitions that is more cumbersome to write than to grasp. Thus it might be easier for you first to work out the definition on your own, and then check that it matches what is written below.
Let be a regular expression over the alphabet . The function is defined as follows:
If then iff .
If then where is the OR operator.
If then iff there is some such that is the concatenation of and and .
If then iff there is some and some such that is the concatenation and for every .
Finally, for the edge cases is the constant zero function, and is the function that only outputs on the empty string .
We say that a regular expression over matches a string if .
The definitions above are not inherently difficult but are a bit cumbersome. So you should pause here and go over it again until you understand why it corresponds to our intuitive notion of regular expressions. This is important not just for understanding regular expressions themselves (which are used time and again in a great many applications) but also for getting better at understanding recursive definitions in general.
A Boolean function is called “regular” if it outputs on precisely the set of strings that are matched by some regular expression. That is,
Let be a finite set and be a Boolean function. We say that is regular if for some regular expression .
Similarly, for every formal language , we say that is regular if and only if there is a regular expression such that iff matches .
Let and be the function such that outputs iff consists of one or more of the letters - followed by a sequence of one or more digits (without a leading zero). Then is a regular function, since where
If we wanted to verify, for example, that , we can do so by noticing that the expression matches the string , matches , matches the string , and the expression matches the string . Each one of those boils down to a simpler expression. For example, the expression matches the string because both of the one-character strings and are matched by the expression .
Regular expression can be defined over any finite alphabet , but as usual, we will mostly focus our attention on the binary case, where . Most (if not all) of the theoretical and practical general insights about regular expressions can be gleaned from studying the binary case.
Algorithms for matching regular expressions
Regular expressions would not be very useful for search if we could not evaluate, given a regular expression , whether a string is matched by . Luckily, there is an algorithm to do so. Specifically, there is an algorithm (think “Python program” though later we will formalize the notion of algorithms using Turing machines) that on input a regular expression over the alphabet and a string , outputs iff matches (i.e., outputs ).
Indeed, Definition 6.7 actually specifies a recursive algorithm for computing . Specifically, each one of our operations -concatenation, OR, and star- can be thought of as reducing the task of testing whether an expression matches a string to testing whether some sub-expressions of match substrings of . Since these sub-expressions are always shorter than the original expression, this yields a recursive algorithm for checking if matches , which will eventually terminate at the base cases of the expressions that correspond to a single symbol or the empty string.
Algorithm 6.10 Regular expression matching
Input: Regular expression over ,
Output:
Procedure (,)
if {} return ;
if {} return ;
if {} return iff ;
if {} return { or } ;
if {}
for {}
if { and } return ;
endfor
endif
if {}
if {} return ;
# is the same as
for {}
# is shorter than
if { and } return ;
endfor
endif
return
endproc
We assume above that we have a procedure that on input a regular expression outputs if and only if matches the empty string .
The key observation is that in our recursive definition of regular expressions, whenever is made up of one or two expressions then these two regular expressions are smaller than . Eventually (when they have size ) then they must correspond to the non-recursive case of a single alphabet symbol. Correspondingly, the recursive calls made in Algorithm 6.10 always correspond to a shorter expression or (in the case of an expression of the form ) a shorter input string. Thus, we can prove the correctness of Algorithm 6.10 on inputs of the form by induction over . The base case is when either or is a single alphabet symbol, or . In the case the expression is of the form or , we make recursive calls with the shorter expressions . In the case the expression is of the form , we make recursive calls with either a shorter string and the same expression, or with the shorter expression and a string that is equal in length or shorter than .
Give an algorithm that on input a regular expression , outputs if and only if .
We can obtain such a recursive algorithm by using the following observations:
An expression of the form or always matches the empty string.
An expression of the form , where is an alphabet symbol, never matches the empty string.
The regular expression does not match the empty string.
An expression of the form matches the empty string if and only if one of or matches it.
An expression of the form matches the empty string if and only if both and match it.
Given the above observations, we see that the following algorithm will check if matches the empty string:
Algorithm 6.11 Check for empty string
Input: Regular expression over ,
Output: iff matches the emptry string.
Procedure ()
if {} return ;
if { or } return ;
if {} return or ;
if {} return and ;
if {} return ;
endproc
Efficient matching of regular expressions (optional)
Algorithm 6.10 is not very efficient. For example, given an expression involving concatenation or the “star” operation and a string of length , it can make recursive calls, and hence it can be shown that in the worst case Algorithm 6.10 can take time exponential in the length of the input string . Fortunately, it turns out that there is a much more efficient algorithm that can match regular expressions in linear (i.e., ) time. Since we have not yet covered the topics of time and space complexity, we describe this algorithm in high level terms, without making the computational model precise. Rather we will use the colloquial notion of running time as used in introduction to programming courses and whiteboard coding interviews. We will see a formal definition of time complexity in Chapter 13.
Let be a regular expression. Then there is an time algorithm that computes .
The implicit constant in the term of Theorem 6.12 depends on the expression . Thus, another way to state Theorem 6.12 is that for every expression , there is some constant and an algorithm that computes on -bit inputs using at most steps. This makes sense since in practice we often want to compute for a small regular expression and a large document . Theorem 6.12 tells us that we can do so with running time that scales linearly with the size of the document, even if it has (potentially) worse dependence on the size of the regular expression.
We prove Theorem 6.12 by obtaining more efficient recursive algorithm, that determines whether matches a string by reducing this task to determining whether a related expression matches . This will result in an expression for the running time of the form which solves to .
Restrictions of regular expressions. The central definition for the algorithm behind Theorem 6.12 is the notion of a restriction of a regular expression. The idea is that for every regular expression and symbol in its alphabet, it is possible to define a regular expression such that matches a string if and only if matches the string . For example, if is the regular expression (i.e., one or more occurrences of ) then is equal to and will be . (Can you see why?)
Algorithm 6.13 computes the restriction given a regular expression and an alphabet symbol . It always terminates, since the recursive calls it makes are always on expressions smaller than the input expression. Its correctness can be proven by induction on the length of the regular expression , with the base cases being when is , , or a single alphabet symbol .
Algorithm 6.13 Restricting regular expression
Input: Regular expression over , symbol
Output: Regular expression such that for every
Procedure (,)
if { or } return ;
if { for } return if and return otherwise ;
if {} return ;
if {} return ;
if { and } return ;
if { and } return ;
endproc
Using this notion of restriction, we can define the following recursive algorithm for regular expression matching:
Algorithm 6.14 Regular expression matching in linear time
Input: Regular expression over , where
Output:
Procedure (,)
if {} return ;
Let
return
endproc
By the definition of a restriction, for every and , the expression matches if and only if matches . Hence for every and , and Algorithm 6.14 does return the correct answer. The only remaining task is to analyze its running time. Note that Algorithm 6.14 uses the procedure of Solved Exercise 6.3 in the base case that . However, this is OK since this procedure’s running time depends only on and is independent of the length of the original input.
For simplicity, let us restrict our attention to the case that the alphabet is equal to . Define to be the maximum number of operations that Algorithm 6.13 takes when given as input a regular expression over of at most symbols. The value can be shown to be polynomial in , though this is not important for this theorem, since we only care about the dependence of the time to compute on the length of and not about the dependence of this time on the length of .
Algorithm 6.14 is a recursive algorithm that input an expression and a string , does computation of at most steps and then calls itself with input some expression and a string of length . It will terminate after steps when it reaches a string of length . So, the running time that it takes for Algorithm 6.14 to compute for inputs of length satisfies the recursive equation:
(In the base case , is equal to some constant depending only on .) To get some intuition for the expression Equation 6.2, let us open up the recursion for one level, writing as
Continuing this way, we can see that where is the largest length of any expression that we encounter along the way. Therefore, the following claim suffices to show that Algorithm 6.14 runs in time:
Claim: Let be a regular expression over , then there is a number , such that for every sequence of symbols , if we define (i.e., restricting to , and then and so on and so forth), then .
Proof of claim: For a regular expression over and , we denote by the expression obtained by restricting to and then to and so on. We let . We will prove the claim by showing that for every , the set is finite, and hence so is the number which is the maximum length of for .
We prove this by induction on the structure of . If is a symbol, the empty string, or the empty set, then this is straightforward to show as the most expressions can contain are the expression itself, , and . Otherwise we split to the two cases (i) and (ii) , where are smaller expressions (and hence by the induction hypothesis and are finite). In the case (i), if then is either equal to or it is simply the empty set if . Since is in the set , the number of distinct expressions in is at most . In the case (ii), if then all the restrictions of to strings will either have the form or the form where is some string such that and matches the empty string. Since and , the number of the possible distinct expressions of the form is at most . This completes the proof of the claim.
The bottom line is that while running Algorithm 6.14 on a regular expression , all the expressions we ever encounter are in the finite set , no matter how large the input is, and so the running time of Algorithm 6.14 satisfies the equation for some constant depending on . This solves to where the implicit constant in the O notation can (and will) depend on but crucially, not on the length of the input .
Matching regular expressions using DFAs
Theorem 6.12 is already quite impressive, but we can do even better. Specifically, no matter how long the string is, we can compute by maintaining only a constant amount of memory and moreover making a single pass over . That is, the algorithm will scan the input once from start to finish, and then determine whether or not is matched by the expression . This is important in the common case of trying to match a short regular expression over a huge file or document that might not even fit in our computer’s memory. Of course, as we have seen before, a single-pass constant-memory algorithm is simply a deterministic finite automaton. As we will see in Theorem 6.17, a function can be computed by regular expression if and only if it can be computed by a DFA. We start with showing the “only if” direction:
Let be a regular expression. Then there is an algorithm that on input computes while making a single pass over and maintaining a constant amount of memory.
The single-pass constant-memory for checking if a string matches a regular expression is presented in Algorithm 6.16. The idea is to replace the recursive algorithm of Algorithm 6.14 with a dynamic program, using the technique of memoization. If you haven’t taken yet an algorithms course, you might not know these techniques. This is OK; while this more efficient algorithm is crucial for the many practical applications of regular expressions, it is not of great importance for this book.
Algorithm 6.16 Regular expression matching by a DFA
Input: Regular expression over , where
Output:
Procedure (,)
Let be the set as defined in the proof of the linear-time matching theorem.
for {}
Let if and otherwise
endfor
for {}
Let for all
Let for all
endfor
return
endproc
Algorithm 6.16 checks if a given string is matched by the regular expression . For every regular expression , this algorithm has a constant number of Boolean variables (specifically a variable for every and a variable for every in , using the fact that is in for every ). It makes a single pass over the input string. Hence it corresponds to a DFA. We prove its correctness by induction on the length of the input. Specifically, we will argue that before reading , the variable is equal to for every . In the case this holds since we initialize for all . For this holds by induction since the inductive hypothesis implies that for all and by the definition of the set , for every and , is in and .
Equivalence of regular expressions and automata
Recall that a Boolean function is defined to be regular if it is equal to for some regular expression . (Equivalently, a language is defined to be regular if there is a regular expression such that matches iff .) The following theorem is the central result of automata theory:
Let . Then is regular if and only if there exists a DFA that computes .
One direction follows from Theorem 6.15, which shows that for every regular expression , the function can be computed by a DFA (see for example Figure 6.6). For the other direction, we show that given a DFA for every we can find a regular expression that would match if and only if the DFA starting in state , will end up in state after reading .


Since Theorem 6.15 proves the “only if” direction, we only need to show the “if” direction. Let be a DFA with states that computes the function . We need to show that is regular.
For every , we let be the function that maps to if and only if the DFA , starting at the state , will reach the state if it reads the input . We will prove that is regular for every . This will prove the theorem, since by Definition 6.2, is equal to the OR of for every . Hence if we have a regular expression for every function of the form then (using the operation), we can obtain a regular expression for as well.
To give regular expressions for the functions , we start by defining the following functions : for every and , if and only if starting from and observing , the automata reaches with all intermediate states being in the set (see Figure 6.7). That is, while themselves might be outside , if and only if throughout the execution of the automaton on the input (when initiated at ) it never enters any of the states outside and still ends up at . If then is the empty set, and hence if and only if the automaton reaches from directly on , without any intermediate state. If then all states are in , and hence .
We will prove the theorem by induction on , showing that is regular for every and . For the base case of , is regular for every since it can be described as one of the expressions , , , or . Specifically, if then if and only if is the empty string. If then if and only if consists of a single symbol and . Therefore in this case corresponds to one of the four regular expressions , , or , depending on whether transitions to from when it reads either or , only one of these symbols, or neither.
Inductive step: Now that we’ve seen the base case, let us prove the general case by induction. Assume, via the induction hypothesis, that for every , we have a regular expression that computes . We need to prove that is regular for every . If the automaton arrives from to using the intermediate states , then it visits the -th state zero or more times. If the path labeled by causes the automaton to get from to without visiting the -th state at all, then is matched by the regular expression . If the path labeled by causes the automaton to get from to while visiting the -th state times, then we can think of this path as:
First travel from to using only intermediate states in .
Then go from back to itself times using only intermediate states in
Then go from to using only intermediate states in .
Therefore in this case the string is matched by the regular expression . (See also Figure 6.8.)
Therefore we can compute using the regular expression

Closure properties of regular expressions
If and are regular functions computed by the expressions and respectively, then the expression computes the function defined as . Another way to say this is that the set of regular functions is closed under the OR operation. That is, if and are regular then so is . An important corollary of Theorem 6.17 is that this set is also closed under the NOT operation:
If is regular then so is the function , where for every .
If is regular then by Theorem 6.12 it can be computed by a DFA . But we can then construct a DFA which does the same computation but flips the set of accepted states. The DFA will compute . By Theorem 6.17 this implies that is regular as well.
Since , Lemma 6.18 implies that the set of regular functions is closed under the AND operation as well. Moreover, since OR, NOT and AND are a universal basis, this set is also closed under NAND, XOR, and any other finite function. That is, we have the following corollary:
Let be any finite Boolean function, and let be regular functions. Then the function is regular.
This is a direct consequence of the closure of regular functions under OR and NOT (and hence AND), combined with Theorem 4.13, that states that every can be computed by a Boolean circuit (which is simply a combination of the AND, OR, and NOT operations).
Limitations of regular expressions and the pumping lemma
The efficiency of regular expression matching makes them very useful. This is why operating systems and text editors often restrict their search interface to regular expressions and do not allow searching by specifying an arbitrary function. However, this efficiency comes at a cost. As we have seen, regular expressions cannot compute every function. In fact, there are some very simple (and useful!) functions that they cannot compute. Here is one example:
Let and be the function that given a string of parentheses, outputs if and only if every opening parenthesis is matched by a corresponding closed one. Then there is no regular expression over that computes .
Lemma 6.20 is a consequence of the following result, which is known as the pumping lemma:
Let be a regular expression over some alphabet . Then there is some number such that for every with and , we can write for strings satisfying the following conditions:
.
.
for every .

The idea behind the proof is the following. Let be twice the number of symbols that are used in the expression , then the only way that there is some with and is that contains the (i.e. star) operator and that there is a non-empty substring of that was matched by for some sub-expression of . We can now repeat any number of times and still get a matching string. See also Figure 6.9.
The pumping lemma is a bit cumbersome to state, but one way to remember it is that it simply says the following: “if a string matching a regular expression is long enough, one of its substrings must be matched using the operator”.
To prove the lemma formally, we use induction on the length of the expression. Like all induction proofs, this will be somewhat lengthy, but at the end of the day it directly follows the intuition above that somewhere we must have used the star operation. Reading this proof, and in particular understanding how the formal proof below corresponds to the intuitive idea above, is a very good way to get more comfortable with inductive proofs of this form.
Our inductive hypothesis is that for an length expression, satisfies the conditions of the lemma. The base case is when the expression is a single symbol or that the expression is or . In all these cases the conditions of the lemma are satisfied simply because , and there exists no string of length larger than that is matched by the expression.
We now prove the inductive step. Let be a regular expression with symbols. We set and let be a string satisfying . Since has more than one symbol, it has one of the forms (a) , (b), , or (c) where in all these cases the subexpressions and have fewer symbols than and hence satisfy the induction hypothesis.
In the case (a), every string matched by must be matched by either or . If matches then, since , by the induction hypothesis there exist with and such that (and therefore also ) matches for every . The same arguments works in the case that matches .
In the case (b), if is matched by then we can write where matches and matches . We split to subcases. If then by the induction hypothesis there exist with , such that and matches for every . This completes the proof since if we set then we see that and matches for every . Otherwise, if then since , it must be that . Hence by the induction hypothesis there exist such that , and matches for every . But now if we set we see that and on the other hand the expression matches for every .
In case (c), if is matched by then where for every , is a nonempty string matched by . If , then we can use the same approach as in the concatenation case above. Otherwise, we simply note that if is the empty string, , and then and is matched by for every .
When an object is recursively defined (as in the case of regular expressions) then it is natural to prove properties of such objects by induction. That is, if we want to prove that all objects of this type have property , then it is natural to use an inductive step that says that if etc have property then so is an object that is obtained by composing them.
Using the pumping lemma, we can easily prove Lemma 6.20 (i.e., the non-regularity of the “matching parenthesis” function):
Suppose, towards the sake of contradiction, that there is an expression such that . Let be the number obtained from Theorem 6.21 and let (i.e., left parenthesis followed by right parenthesis). Then we see that if we write as in Theorem 6.21, the condition implies that consists solely of left parenthesis. Hence the string will contain more left parenthesis than right parenthesis. Hence but by the pumping lemma , contradicting our assumption that .
The pumping lemma is a very useful tool to show that certain functions are not computable by a regular expression. However, it is not an “if and only if” condition for regularity: there are non-regular functions that still satisfy the pumping lemma conditions. To understand the pumping lemma, it is crucial to follow the order of quantifiers in Theorem 6.21. In particular, the number in the statement of Theorem 6.21 depends on the regular expression (in the proof we chose to be twice the number of symbols in the expression). So, if we want to use the pumping lemma to rule out the existence of a regular expression computing some function , we need to be able to choose an appropriate input that can be arbitrarily large and satisfies . This makes sense if you think about the intuition behind the pumping lemma: we need to be large enough as to force the use of the star operator.

Prove that the following function over the alphabet is not regular: if and only if where and denotes “reversed”: the string . (The Palindrome function is most often defined without an explicit separator character , but the version with such a separator is a bit cleaner, and so we use it here. This does not make much difference, as one can easily encode the separator as a special binary string instead.)
We use the pumping lemma. Suppose toward the sake of contradiction that there is a regular expression computing , and let be the number obtained by the pumping lemma (Theorem 6.21). Consider the string . Since the reverse of the all zero string is the all zero string, . Now, by the pumping lemma, if is computed by , then we can write such that , and for every . In particular, it must hold that , but this is a contradiction, since and so its two parts are not of the same length and in particular are not the reverse of one another.
For yet another example of a pumping-lemma based proof, see Figure 6.10 which illustrates a cartoon of the proof of the non-regularity of the function which is defined as iff for some (i.e., consists of a string of consecutive zeroes, followed by a string of consecutive ones of the same length).
Answering semantic questions about regular expressions
Regular expressions have applications beyond search. For example, regular expressions are often used to define tokens (such as what is a valid variable identifier, or keyword) in the design of parsers, compilers and interpreters for programming languages. Regular expressions have other applications too: for example, in recent years, the world of networking moved from fixed topologies to “software defined networks”. Such networks are routed by programmable switches that can implement policies such as “if packet is secured by SSL then forward it to A, otherwise forward it to B”. To represent such policies we need a language that is on one hand sufficiently expressive to capture the policies we want to implement, but on the other hand sufficiently restrictive so that we can quickly execute them at network speed and also be able to answer questions such as “can C see the packets moved from A to B?”. The NetKAT network programming language uses a variant of regular expressions to achieve precisely that. For this application, it is important that we are not merely able to answer whether an expression matches a string but also answer semantic questions about regular expressions such as “do expressions and compute the same function?” and “does there exist a string that is matched by the expression ?”. The following theorem shows that we can answer the latter question:
There is an algorithm that given a regular expression , outputs if and only if is the constant zero function.
The idea is that we can directly observe this from the structure of the expression. The only way a regular expression computes the constant zero function is if has the form or is obtained by concatenating with other expressions.
Define a regular expression to be “empty” if it computes the constant zero function. Given a regular expression , we can determine if is empty using the following rules:
If has the form or then it is not empty.
If is not empty then is not empty for every .
If is not empty then is not empty.
If and are both not empty then is not empty.
is empty.
Using these rules, it is straightforward to come up with a recursive algorithm to determine emptiness.
Using Theorem 6.23, we can obtain an algorithm that determines whether or not two regular expressions and are equivalent, in the sense that they compute the same function.
Let be the function that on input (a string representing) a pair of regular expressions , if and only if . Then there is an algorithm that computes .
The idea is to show that given a pair of regular expressions and we can find an expression such that if and only if . Therefore is the constant zero function if and only if and are equivalent, and thus we can test for emptiness of to determine equivalence of and .
We will prove Theorem 6.24 from Theorem 6.23. (The two theorems are in fact equivalent: it is easy to prove Theorem 6.23 from Theorem 6.24, since checking for emptiness is the same as checking equivalence with the expression .) Given two regular expressions and , we will compute an expression such that if and only if . One can see that is equivalent to if and only if is empty.
We start with the observation that for every bit , if and only if
Hence we need to construct such that for every ,
To construct the expression , we will show how given any pair of expressions and , we can construct expressions and that compute the functions and respectively. (Computing the expression for is straightforward using the operation of regular expressions.)
Specifically, by Lemma 6.18, regular functions are closed under negation, which means that for every regular expression , there is an expression such that for every . Now, for every two expressions and , the expression
- We model computational tasks on arbitrarily large inputs using infinite functions .
- Such functions take an arbitrarily long (but still finite!) string as input, and cannot be described by a finite table of inputs and outputs.
- A function with a single bit of output is known as a Boolean function, and the task of computing it is equivalent to deciding a language .
- Deterministic finite automata (DFAs) are one simple model for computing (infinite) Boolean functions.
- There are some functions that cannot be computed by DFAs.
- The set of functions computable by DFAs is the same as the set of languages that can be recognized by regular expressions.
Exercises
Suppose that are regular. For each one of the following definitions of the function , either prove that is always regular or give a counterexample for regular that would make not regular.
.
.
where is the reverse of : for .
One among the following two functions that map to can be computed by a regular expression, and the other one cannot. For the one that can be computed by a regular expression, write the expression that does it. For the one that cannot, prove that this cannot be done using the pumping lemma.
if divides and otherwise.
if and only if and otherwise.
Prove that the following function is not regular. For every , iff is of the form for some .
Prove that the following function is not regular. For every , iff for some .
Bibliographical notes
The relation of regular expressions with finite automata is a beautiful topic, on which we only touch upon in this text. It is covered more extensively in (Sipser, 1997) (Hopcroft, Motwani, Ullman, 2014) (Kozen, 1997) . These texts also discuss topics such as non-deterministic finite automata (NFA) and the relation between context-free grammars and pushdown automata.
The automaton of Figure 6.4 was generated using the FSM simulator of Ivan Zuzak and Vedrana Jankovic. Our proof of Theorem 6.12 is closely related to the Myhill-Nerode Theorem. One direction of the Myhill-Nerode theorem can be stated as saying that if is a regular expression then there is at most a finite number of strings such that for every .
Comments
Comments are posted on the GitHub repository using the utteranc.es app. A GitHub login is required to comment. If you don't want to authorize the app to post on your behalf, you can also comment directly on the GitHub issue for this page.
Compiled on 12/06/2023 00:07:03
Copyright 2023, Boaz Barak.
This work is
licensed under a Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International License.
Produced using pandoc and panflute with templates derived from gitbook and bookdown.