\[ \def\sleq{\ensuremath{\preceq}} \def\sgeq{\ensuremath{\succeq}} \def\diag{\ensuremath{\mathrm{diag}}} \def\support{\ensuremath{\mathrm{support}}} \def\zo{\ensuremath{\{0,1\}}} \def\pmo{\ensuremath{\{\pm 1\}}} \def\uppersos{\ensuremath{\overline{\mathrm{sos}}}} \def\lambdamax{\ensuremath{\lambda_{\mathrm{max}}}} \def\rank{\ensuremath{\mathrm{rank}}} \def\Mslow{\ensuremath{M_{\mathrm{slow}}}} \def\Mfast{\ensuremath{M_{\mathrm{fast}}}} \def\Mdiag{\ensuremath{M_{\mathrm{diag}}}} \def\Mcross{\ensuremath{M_{\mathrm{cross}}}} \def\eqdef{\ensuremath{ =^{def}}} \def\threshold{\ensuremath{\mathrm{threshold}}} \def\vbls{\ensuremath{\mathrm{vbls}}} \def\cons{\ensuremath{\mathrm{cons}}} \def\edges{\ensuremath{\mathrm{edges}}} \def\cl{\ensuremath{\mathrm{cl}}} \def\xor{\ensuremath{\oplus}} \def\1{\ensuremath{\mathrm{1}}} \notag \]
\[ \newcommand{\transpose}[1]{\ensuremath{#1{}^{\mkern-2mu\intercal}}} \newcommand{\dyad}[1]{\ensuremath{#1#1{}^{\mkern-2mu\intercal}}} \newcommand{\nchoose}[1]{\ensuremath{{n \choose #1}}} \newcommand{\generated}[1]{\ensuremath{\langle #1 \rangle}} \notag \]

★ See also the PDF version of this chapter (better formatting/references) ★

See any bugs/typos/confusing explanations? Open a GitHub issue. You can also comment below

Mathematical Background

  • Recall basic mathematical notions such as sets, functions, numbers, logical operators and quantifiers, strings, and graphs.
  • Rigorously define Big-\(O\) notation.
  • Proofs by induction.
  • Practice with reading mathematical definitions, statements, and proofs.
  • Transform an intuitive argument into a rigorous proof.

“When you have mastered numbers, you will in fact no longer be reading numbers, any more than you read words when reading books. You will be reading meanings.”, W. E. B. Du Bois

“I found that every number, which may be expressed from one to ten, surpasses the preceding by one unit: afterwards the ten is doubled or tripled … until a hundred; then the hundred is doubled and tripled in the same manner as the units and the tens … and so forth to the utmost limit of numeration.”, Muhammad ibn Mūsā al-Khwārizmī, 820, translation by Fredric Rosen, 1831.

In this chapter, we review some of the mathematical concepts that we will use in this course. Most of these are not very complicated, but do require some practice and exercise to get comfortable with. If you have not previously encountered some of these concepts, there are several excellent freely-available resources online that cover them. In particular, the CS 121 webpage contains a program for self study of all the needed notions using the lecture notes, videos, and assignments of MIT course 6.042j Mathematics for Computer science. (The MIT lecture notes were also used in the past in Harvard CS 20.)

A mathematician’s apology

Before explaining the math background, perhaps I should explain why is this course so “mathematically heavy”. After all, this is supposed to be a course about computation; one might think we should be talking mostly about programs, rather than more “mathematical” objects such as sets, functions, and graphs, and doing more coding on an actual computer than writing mathematical proofs with pen and paper. So, why are we doing so much math in this course? Is it just some form of hazing? Perhaps a revenge of the “math nerds” against the “hackers”?

At the end of the day, mathematics is simply a language for modeling concepts in a precise and unambiguous way. In this course, we will be mostly interested in the concept of computation. For example, we will look at questions such as “is there an efficient algorithm to find the prime factors of a given integer?”.Actually, scientists currently do not know the answer to this question, but we will see that settling it in either direction has very interesting applications touching on areas as far apart as Internet security and quantum mechanics. To even phrase such a question, we need to give a precise definition of the notion of an algorithm, and of what it means for an algorithm to be efficient. Also, if the answer to this or similar questions turns out to be negative, then this cannot be shown by simply writing and executing some code. After all, there is no empirical experiment that will prove the nonexistence of an algorithm. Thus, our only way to show this type of negative result is to use mathematical proofs. So you can see why our main tools in this course will be mathematical proofs and definitions.

Depending on your background, you can approach this chapter in different ways:

  • If you already have taken some proof-based courses, and are very familiar with the notions of discrete mathematics, you can take a quick look at secmathoverview to see the main tools we will use, and skip ahead to the rest of this book. Alternatively, you can sit back, relax, and read this chapter just to get familiar with our notation, as well as to enjoy (or not) my philosophical musings and attempts at humor. You might also want to start brushing up on discrete probability, which we’ll use later in this course.
  • If your background is not as extensive, you should lean forward, and read this chapter with a pen and paper handy, making notes and working out derivations as you go along. We cannot fit a semester-length discrete math course in a single chapter, and hence will be necessarily brief in our discussions. Thus you might want to occasionally pause to watch some discrete math lectures, read some of the resources mentioned above, and do some exercises to make sure you internalized the material.

A quick overview of mathematical prerequisites

The main notions we will use in this course are the following:

  • Proofs: First and foremost, this course will involve a heavy dose of formal mathematical reasoning, which includes mathematical definitions, statements, and proofs.
  • Sets: The basic set relationships of membership (\(\in\)), containment (\(\subseteq\)), and set operations, principally union, intersection, set difference and Cartesian product (\(\cup,\cap,\setminus\) and \(\times\)).
  • Tuples and strings: The set \(\Sigma^k\) of length-\(k\) strings/lists over elements in \(\Sigma\), where \(\Sigma\) is some finite set which is called the alphabet (quite often \(\Sigma = \{0,1\}\)). We use \(\Sigma^*\) for the set of all strings of finite length.
  • Some special sets: The set \(\N\) of natural numbers. We will index from zero in this course and so write \(\N = \{0,1,2,\ldots \}\). We will use \([n]\) for the set \(\{0,1,2,\ldots,n-1\}\). We use \(\{0,1\}^*\) for the set of all binary strings and \(\{0,1\}^n\) for the set of strings of length \(n\). If \(x\) is a string of length \(n\), then we refer to its coordinate by \(x_0,\ldots,x_{n-1}\).
  • Functions: The domain and codomain of a function, properties such as being one-to-one (also known as injective) or onto (also known as surjective) functions, as well as partial functions (that, unlike standard or “total” functions, are not necessarily defined on all elements of their domain).
  • Logical operations: The operations AND, OR, and NOT (\(\wedge,\vee,\neg\)) and the quantifiers “there exists” and “for all” (\(\exists\),\(\forall\)).
  • Basic combinatorics: Notions such as \(\binom{n}{k}\) (the number of \(k\)-sized subsets of a set of size \(n\)).
  • Graphs: Undirected and directed graphs, connectivity, paths, and cycles.
  • Big-\(O\) notation: \(O,o,\Omega,\omega,\Theta\) notation for analyzing asymptotic growth of functions.
  • Discrete probability: Later on in we will use probability theory, and specifically probability over finite samples spaces such as tossing \(n\) coins, including notions such as random variables, expectation, and concentration. We will only use probability theory in the second half of this text, and will review it beforehand. However, probabilistic reasoning is a subtle (and extremely useful!) skill, and it’s always good to start early in acquiring it.

In the rest of this chapter we briefly review the above notions. This is partially to remind the reader and reinforce material that might not be fresh in your mind, and partially to introduce our notation and conventions which might occasionally differ from those you’ve encountered before.

Reading mathematical texts

In this course, we will eventually tackle some fairly complex definitions. For example, let us consider one of the definitions that we will encounter towards the very end of this text:

If \(G:\{0,1\}^n \rightarrow \{0,1\}\) is a finite function and \(Q\) is a Quantum circuit then we say that \(Q\) computes \(G\) if for every \(x\in \{0,1\}^n\), \(\Pr[ Q(x)=G(x) ] \geq 2/3\).

The class \(\mathbf{BQP}\) (which stands for “bounded-error quantum polynomial time”) is the set of all functions \(F:\{0,1\}^* \rightarrow \{0,1\}\) such that there exists a polynomial-time Turing Machine \(M\) that satisfies the following: for every \(n\in \N\), \(M(1^n)\) is a Quantum circuit \(Q_n\) that computes \(F_n\), where \(F_n:\{0,1\}^n \rightarrow \{0,1\}\) is the restriction of \(F\) to inputs of length \(\{0,1\}^n\). That is, \(\Pr[ M(1^n)(x) = F(x)] \geq 2/3\) for every \(n\in \N\) and \(x\in \{0,1\}^n\).

We will also see the following theorem:

Let \(F:\{0,1\}^* \rightarrow \{0,1\}\) be the function that on input a string representation of a pair \((m,i)\) of natural numbers, outputs \(1\) if and only if the \(i\)-th bit of the smallest prime factor of \(m\) is equal to \(1\). Then \(F \in \mathbf{BQP}\).

While it should make sense to you by the end of the term, at the current point in time it is perfectly fine if BGPintrodef and shorsthmintro seem to you as a meaningless combination of inscrutable terms. Indeed, to a large extent they are such a combination, as they contains many terms that we have not defined (and that we would need to build on a semester’s worth of material to be able to define). Yet, even when faced with what seems like completely incomprehensible gibberish, it is still possible for us to try to make some sense of it, and try to at least to be able to “know what we don’t know”. Let’s use BGPintrodef and shorsthmintro as examples. For starters, let me tell you what this definition and this theorem are about. Quantum computing is an approach to use the peculiarities of quantum mechanics to build computing devices that can solve certain problems exponentially faster than current computers. Many large companies and governments are extremely excited about this possibility, and are investing hundreds of millions of dollars in trying to make this happen. To a first order of approximation, the reason they are so excited is Shor’s Algorithm (i.e., shorsthmintro), which says that the problem of integer factoring, with history going back thousands of years, and whose difficulty is (as we’ll see) closely tied to the security of many current encryption schemes, can be solved efficiently using quantum computers.

shorsthmintro was proven by Peter Shor in 1994. However, Shor could not even have stated this theorem, let alone prove it, without having the proper definition (i.e., BGPintrodef) in place. BGPintrodef defines the class \(\mathbf{BQP}\) of functions that can be computed in polynomial time by quantum computers. Like any mathematical definition, it defines a new concept (in this case the class \(\mathbf{BQP}\)) in terms of other concepts. In this case the concepts that are needed are

  • The notion of a function, which is a mapping of one set to another. In this particular case we use functions whose output is a single number that is either zero or one (i.e., a bit) and the input is a list of bits (i.e., a string) which can either have a fixed length \(n\) (this is denoted as the set \(\{0,1\}^n\)) or have length that is not a priori bounded (this is denoted by \(\{0,1\}^*\)).
  • Restrictions of functions. If \(F\) is a function that takes strings of arbitrary length as input (i.e., members of the set \(\{0,1\}^*\)) then \(F_n\) is the restriction of \(F\) to inputs of length \(n\) (i.e., members of \(\{0,1\}^n\)).
  • We use the notion of a Quantum circuit which will be our computational model for quantum computers, and which we will encounter later on in the course. Quantum circuits can compute functions with a fixed input length \(n\), and we define the notion of computing a function \(G\) as outputting on input \(x\) the value \(G(x)\) with probability at least \(2/3\).
  • We will also use the notion of Turing machines which will be our computational model for “classical” computers.As we’ll see, there is a great variety of ways to model “classical computers”, including RAM machines, \(\lambda\)-calculus, and NAND++ programs.
  • We require that for every \(n\in \N\), the quantum circuit \(Q_n\) for \(F_n\) can be generated efficiently, in the sense that there is a polynomial-time classical program \(P\) that on input a string of \(n\) ones (which we shorthand as \(1^n\)) outputs \(Q_n\).

The point of this example is not for you to understand BGPintrodef and shorsthmintro. Fully understanding them will require background that will take us weeks to develop. The point is to show that you should not be afraid of even the most complicated looking definitions and mathematical terminology. No matter how convoluted the notation, and how many layers of indirection, you can always look at mathematical definitions and try to at least attempt at answering the following questions:

  1. What is the intuitive notion that this definition aims at modeling?
  2. How is each new concept defined in terms of other concepts?
  3. Which of these prior concepts am I already familiar with, and which ones do I still need to look up?

Dealing with mathematical text is in many ways not so different from dealing with any other complex text, whether it’s a legal argument, a philosophical treatise, an English Renaissance play, or even the source code of an operating system. You should not expect it to be clear in a first reading, but you need not despair. Rather you should engage with the text, trying to figure out both the high level intentions as well as the underlying details. Luckily, compared to philosophers or even programmers, mathematicians have a greater discipline of introducing definitions in linear order, making sure that every new concept is defined only in terms of previously defined notions. As you read through the rest of this chapter and this text, try to ask yourself questions 1-3 above every time that you encounter a new definition.

Example: Defining a one to one function

Here is a simpler mathematical definition, which you may have encountered in the past (and will encounter again shortly):

A function \(f:S \rightarrow T\) is one to one if for every two elements \(x,x' \in S\), if \(x \neq x'\) then \(f(x) \neq f(x')\).

This definition captures a simple concept, but even so it uses quite a bit of notation. When reading this definition, or any other piece of mathematical text, it is often useful to annotate it with a pen as you’re going through it, as in onetoonedefannotatedef. For every identifier you encounter (for example \(f,S,T,x,x'\) in this case), make sure that you realize what sort of object is it: is it a set, a function, an element, a number, a gremlin? Make sure you understand how the identifiers are quantified. For example, in onetoonedef there is a universal or “for all” (sometimes denotes by \(\forall\)) quantifier over pairs \((x,x')\) of distinct elements in \(S\). Finally, and most importantly, make sure that aside from being able to parse the definition formally, you also have an intuitive understanding of what is it that the text is actually saying. For example, onetoonedef says that a one to one function is a function where every output is obtained by a unique input.

Figure 1: An annotated form of onetoonedef, marking which type is every object, and with a doodle explaining what the definition says.

Reading mathematical texts in this way takes time, but it gets easier with practice. Moreover, this is one of the most transferable skills you could take from this course. Our world is changing rapidly, not just in the realm of technology, but also in many other human endeavors, whether it is medicine, economics, law or even culture. Whatever your future aspirations, it is likely that you will encounter texts that use new concepts that you have not seen before (for semi-random recent examples from current “hot areas”, see alphagozerofig and zerocashfig). Being able to internalize and then apply new definitions can be hugely important. It is a skill that’s much easier to acquire in the relatively safe and stable context of a mathematical course, where one at least has the guarantee that the concepts are fully specified, and you have access to your teaching staff for questions.

Figure 2: A snippet from the “methods” section of the “AlphaGo Zero” paper by Silver et al, Nature, 2017.
Figure 3: A snippet from the “Zerocash” paper of Ben-Sasson et al, that forms the basis of the cryptocurrency startup Zcash.

Basic discrete math objects

We now quickly review some of the mathematical objects (the “basic data structures” of mathematics, if you will) we use in this course.


A set is an unordered collection of objects. For example, when we write \(S = \{ 2,4, 7 \}\), we mean that \(S\) denotes the set that contains the numbers \(2\), \(4\), and \(7\). (We use the notation “\(2 \in S\)” to denote that \(2\) is an element of \(S\).) Note that the set \(\{ 2, 4, 7 \}\) and \(\{ 7 , 4, 2 \}\) are identical, since they contain the same elements. Also, a set either contains an element or does not contain it – there is no notion of containing it “twice” – and so we could even write the same set \(S\) as \(\{ 2, 2, 4, 7\}\) (though that would be a little weird). The cardinality of a finite set \(S\), denoted by \(|S|\), is the number of elements it contains.Later in this course we will discuss how to extend the notion of cardinality to infinite sets. So, in the example above, \(|S|=3\). A set \(S\) is a subset of a set \(T\), denoted by \(S \subseteq T\), if every element of \(S\) is also an element of \(T\). (We can also describe this by saying that \(T\) is a superset of \(S\).) For example, \(\{2,7\} \subseteq \{ 2,4,7\}\). The set that contains no elements is known as the empty set and it is denoted by \(\emptyset\).

We can define sets by either listing all their elements or by writing down a rule that they satisfy such as \[ \text{EVEN} = \{ x \;:\; \text{ $x=2y$ for some non-negative integer $y$} \} \;. \]

Of course there is more than one way to write the same set, and often we will use intuitive notation listing a few examples that illustrate the rule. For example, we can also define \(\text{EVEN}\) as

\[ \text{EVEN} = \{ 0,2,4, \ldots \} \;. \]

Note that a set can be either finite (such as the set \(\{2,4,7\}\) ) or infinite (such as the set \(\text{EVEN}\)). Also, the elements of a set don’t have to be numbers. We can talk about the sets such as the set \(\{a,e,i,o,u \}\) of all the vowels in the English language, or the set \(\{\) New York, Los Angeles, Chicago, Houston, Philadelphia, Phoenix, San Antonio, San Diego, Dallas \(\}\) of all cities in the U.S. with population more than one million per the 2010 census. A set can even have other sets as elements, such as the set \(\{ \emptyset, \{1,2\},\{2,3\},\{1,3\} \}\) of all even-sized subsets of \(\{1,2,3\}\).

Operations on sets: The union of two sets \(S,T\), denoted by \(S \cup T\), is the set that contains all elements that are either in \(S\) or in \(T\). The intersection of \(S\) and \(T\), denoted by \(S \cap T\), is the set of elements that are both in \(S\) and in \(T\). The set difference of \(S\) and \(T\), denoted by \(S \setminus T\) (and in some texts also by \(S-T\)), is the set of elements that are in \(S\) but not in \(T\).

Tuples, lists, strings, sequences: A tuple is an ordered collection of items. For example \((1,5,2,1)\) is a tuple with four elements (also known as a \(4\)-tuple or quadruple). Since order matters, this is not the same tuple as the \(4\)-tuple \((1,1,5,2)\) or the \(3\)-tuple \((1,5,2)\). A \(2\)-tuple is also known as a pair. We use the terms tuples and lists interchangeably. A tuple where every element comes from some finite set \(\Sigma\) (such as \(\{0,1\}\)) is also known as a string. Analogously to sets, we denote the length of a tuple \(T\) by \(|T|\). Just like sets, we can also think of infinite analogues of tuples, such as the ordered collection \((1,2,4,9,\ldots )\) of all perfect squares. Infinite ordered collections are known as sequences; we might sometimes use the term “infinite sequence” to emphasize this, and use “finite sequence” as a synonym for a tuple.We can identify a sequence \((a_0,a_1,a_2,\ldots)\) of elements in some set \(S\) with a function \(A:\N \rightarrow S\) (where \(a_n = A(n)\) for every \(n\in \N\)). Similarly, we can identify a \(k\)-tuple \((a_0,\ldots,a_{k-1})\) of elements in \(S\) with a function \(A:[k] \rightarrow S\).

Cartesian product: If \(S\) and \(T\) are sets, then their Cartesian product, denoted by \(S \times T\), is the set of all ordered pairs \((s,t)\) where \(s\in S\) and \(t\in T\). For example, if \(S = \{1,2,3 \}\) and \(T = \{10,12 \}\), then \(S\times T\) contains the \(6\) elements \((1,10),(2,10),(3,10),(1,12),(2,12),(3,12)\). Similarly if \(S,T,U\) are sets then \(S\times T \times U\) is the set of all ordered triples \((s,t,u)\) where \(s\in S\), \(t\in T\), and \(u\in U\). More generally, for every positive integer \(n\) and sets \(S_0,\ldots,S_{n-1}\), we denote by \(S_0 \times S_1 \times \cdots \times S_{n-1}\) the set of ordered \(n\)-tuples \((s_0,\ldots,s_{n-1})\) where \(s_i\in S_i\) for every \(i \in \{0,\ldots, n-1\}\). For every set \(S\), we denote the set \(S\times S\) by \(S^2\), \(S\times S\times S\) by \(S^3\), \(S\times S\times S \times S\) by \(S^4\), and so on and so forth.

Sets in Python (optional)

To get more comfortable with sets, one can also play with the set data structure in Python:The set data structure only corresponds to finite sets; infinite sets are much more cumbersome to handle in programming languages, though mechanisms such as Python generators and lazy evaluation in general can be helpful.

A = { 7 , 10 , 12} B = {12 , 8 , 5 } print(A==B) # False print(A=={10,7,7,12}) # True def intersect(S,T): return {x for x in S if x in T} print(intersect(A,B)) # {12} def contains(S,T): return all({x in T for x in S}) print(contains(A,B)) # False print(contains({2,8,8,12},{12,8,2,34})) # True def product(S,T): return {(s,t) for s in S for t in T} print(product(A,B)) # {(10, 8), (10, 5), (7, 12), (12, 12), (10, 12), (12, 5), (7, 5), (7, 8), (12, 8)}

Special sets

There are several sets that we will use in this course time and again, and so find it useful to introduce explicit notation for them. For starters we define

\[ \N = \{ 0, 1,2, \ldots \} \] to be the set of all natural numbers, i.e., non-negative integers. For any natural number \(n\), we define the set \([n]\) as \(\{0,\ldots, n-1\} = \{ k\in \N : k < n \}\).We start our indexing of both \(\N\) and \([n]\) from \(0\), while many other texts index those sets from \(1\). Starting from zero or one is simply a convention that doesn’t make much difference, as long as one is consistent about it.

We will also occasionally use the set \(\Z=\{\ldots,-2,-1,0,+1,+2,\ldots \}\) of (negative and non-negative) integers,The letter Z stands for the German word “Zahlen”, which means numbers. as well as the set \(\R\) of real numbers. (This is the set that includes not just the integers, but also fractional and even irrational numbers; e.g., \(\R\) contains numbers such as \(+0.5\), \(-\pi\), etc.) We denote by \(\R_+\) the set \(\{ x\in \R : x > 0 \}\) of positive real numbers. This set is sometimes also denoted as \((0,\infty)\).

Strings: Another set we will use time and again is

\[ \{0,1\}^n = \{ (x_0,\ldots,x_{n-1}) \;:\; x_0,\ldots,x_{n-1} \in \{0,1\} \} \] which is the set of all \(n\)-length binary strings for some natural number \(n\). That is \(\{0,1\}^n\) is the set of all \(n\)-tuples of zeroes and ones. This is consistent with our notation above: \(\{0,1\}^2\) is the Cartesian product \(\{0,1\} \times \{0,1\}\), \(\{0,1\}^3\) is the product \(\{0,1\} \times \{0,1\} \times \{0,1\}\) and so on.

We will write the string \((x_0,x_1,\ldots,x_{n-1})\) as simply \(x_0x_1\cdots x_{n-1}\) and so for example

\[ \{0,1\}^3 = \{ 000 , 001, 010 , 011, 100, 101, 110, 111 \} \;. \]

For every string \(x\in \{0,1\}^n\) and \(i\in [n]\), we write \(x_i\) for the \(i^{th}\) coordinate of \(x\). If \(x\) and \(y\) are strings, then \(xy\) denotes their concatenation. That is, if \(x \in \{0,1\}^n\) and \(y\in \{0,1\}^m\), then \(xy\) is equal to the string \(z\in \{0,1\}^{n+m}\) such that for \(i\in [n]\), \(z_i=x_i\) and for \(i\in \{n,\ldots,n+m-1\}\), \(z_i = y_{i-n}\).

We will also often talk about the set of binary strings of all lengths, which is

\[ \{0,1\}^* = \{ (x_0,\ldots,x_{n-1}) \;:\; n\in\N \;,\;, x_0,\ldots,x_{n-1} \in \{0,1\} \} \;. \]

Another way to write this set is as \[ \{0,1\}^* = \{0,1\}^0 \cup \{0,1\}^1 \cup \{0,1\}^2 \cup \cdots \] or more concisely as

\[ \{0,1\}^* = \cup_{n\in\N} \{0,1\}^n \;. \]

The set \(\{0,1\}^*\) contains also the “string of length \(0\)” or “the empty string”, which we will denote by \(\mathtt{""}\).We follow programming languages in this notation; other texts sometimes use \(\epsilon\) or \(\lambda\) to denote the empty string. However, this doesn’t matter much since we will rarely encounter this “edge case”.

Generalizing the star operation: For every set \(\Sigma\), we define

\[\Sigma^* = \cup_{n\in \N} \Sigma^n \;.\] For example, if \(\Sigma = \{a,b,c,d,\ldots,z \}\) then \(\Sigma^*\) denotes the set of all finite length strings over the alphabet a-z.

Concatenation: As mentioned in specialsets, the concatenation of two strings \(x\in \Sigma^n\) and \(y\in \Sigma^m\) is the \((n+m)\)-length string \(xy\) obtained by writing \(y\) after \(x\).


If \(S\) and \(T\) are nonempty sets, a function \(F\) mapping \(S\) to \(T\), denoted by \(F:S \rightarrow T\), associates with every element \(x\in S\) an element \(F(x)\in T\). The set \(S\) is known as the domain of \(F\) and the set \(T\) is known as the codomain of \(F\). The image of a function \(F\) is the set \(\{ F(x) \;|\; x\in S\}\) which is the subset of \(F\)’s codomain consisting of all output elements that are mapped from some input.Some texts use range to denote the image of a function, while other texts use range to denote the codomain of a function. Hence we will avoid using the term “range” altogether. Just as with sets, we can write a function either by listing the table of all the values it gives for elements in \(S\) or using a rule. For example if \(S = \{0,1,2,3,4,5,6,7,8,9 \}\) and \(T = \{0,1 \}\), then the table below defines a function \(F: S \rightarrow T\). Note that this function is the same as the function defined by the rule \(F(x)= (x \mod 2)\).For two natural numbers \(x\) and \(a\), \(x \mod a\) (where \(\mod\) is shorthand for “modulo”) denotes the remainder of \(x\) when it is divided by \(a\). That is, it is the number \(r\) in \(\{0,\ldots,a-1\}\) such that \(x = ak +r\) for some integer \(k\). We sometimes also use the notation \(x = y (\mod a)\) to denote the assertion that \(x \mod a\) is the same as \(y \mod a\).

Input Output
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1

If \(F:S \rightarrow T\) satisfies that \(F(x)\neq F(y)\) for all \(x \neq y\) then we say that \(F\) is one-to-one (also known as an injective function or simply an injection). If \(F\) satisfies that for every \(y\in T\) there is some \(x\in S\) such that \(F(x)=y\) then we say that \(F\) is onto (also known as a surjective function or simply a surjection). A function that is both one-to-one and onto is known as a bijective function or simply a bijection. A bijection from a set \(S\) to itself is also known as a permutation of \(S\). If \(F:S \rightarrow T\) is a bijection then for every \(y\in T\) there is a unique \(x\in S\) s.t. \(F(x)=y\). We denote this value \(x\) by \(F^{-1}(y)\). Note that \(F^{-1}\) is itself a bijection from \(T\) to \(S\) (can you see why?).

Giving a bijection between two sets is often a good way to show they have the same size. In fact, the standard mathematical definition of the notion that “\(S\) and \(T\) have the same cardinality” is that there exists a bijection \(f:S \rightarrow T\). In particular, the cardinality of a set \(S\) is defined to be \(n\) if there is a bijection from \(S\) to the set \(\{0,\ldots,n-1\}\). As we will see later in this course, this is a definition that can generalizes to defining the cardinality of infinite sets.

Partial functions: We will sometimes be interested in partial functions from \(S\) to \(T\). A partial function is allowed to be undefined on some subset of \(S\). That is, if \(F\) is a partial function from \(S\) to \(T\), then for every \(s\in S\), either there is (as in the case of standard functions) an element \(F(s)\) in \(T\), or \(F(s)\) is undefined. For example, the partial function \(F(x)= \sqrt{x}\) is only defined on non-negative real numbers. When we want to distinguish between partial functions and standard (i.e., non-partial) functions, we will call the latter total functions. When we say “function” without any qualifier then we mean a total function.

The notion of partial functions is a strict generalization of functions, and so every function is a partial function, but not every partial function is a function. (That is, for every nonempty \(S\) and \(T\), the set of partial functions from \(S\) to \(T\) is a proper superset of the set of total functions from \(S\) to \(T\).) When we want to emphasize that a function \(f\) from \(A\) to \(B\) might not be total, we will write \(f: A \rightarrow_p B\). We can think of a partial function \(F\) from \(S\) to \(T\) also as a total function from \(S\) to \(T \cup \{ \bot \}\) where \(\bot\) is some special “failure symbol”, and so instead of saying that \(F\) is undefined at \(x\), we can say that \(F(x)=\bot\).

Basic facts about functions: Verifying that you can prove the following results is an excellent way to brush up on functions:

  • If \(F:S \rightarrow T\) and \(G:T \rightarrow U\) are one-to-one functions, then their composition \(H:S \rightarrow U\) defined as \(H(s)=G(F(s))\) is also one to one.
  • If \(F:S \rightarrow T\) is one to one, then there exists an onto function \(G:T \rightarrow S\) such that \(G(F(s))=s\) for every \(s\in S\).
  • If \(G:T \rightarrow S\) is onto then there exists a one-to-one function \(F:S \rightarrow T\) such that \(G(F(s))=s\) for every \(s\in S\).
  • If \(S\) and \(T\) are finite sets then the following conditions are equivalent to one another: (a) \(|S| \leq |T|\), (b) there is a one-to-one function \(F:S \rightarrow T\), and (c) there is an onto function \(G:T \rightarrow S\).
Figure 4: We can represent finite functions as a directed graph where we put an edge from \(x\) to \(f(x)\). The onto condition corresponds to requiring that every vertex in the codomain of the function has in-degree at least one. The one-to-one condition corresponds to requiring that every vertex in the codomain of the function has in-degree at most one. In the examples above \(F\) is an onto function, \(G\) is one to one, and \(H\) is neither onto nor one to one.

You can find the proofs of these results in many discrete math texts, including for example, section 4.5 in the Leham-Leighton-Meyer notes. However, I strongly suggest you try to prove them on your own, or at least convince yourself that they are true by proving special cases of those for small sizes (e.g., \(|S|=3,|T|=4,|U|=5\)).

Let us prove one of these facts as an example:

If \(S,T\) are non-empty sets and \(F:S \rightarrow T\) is one to one, then there exists an onto function \(G:T \rightarrow S\) such that \(G(F(s))=s\) for every \(s\in S\).

Let \(S\), \(T\) and \(F:S \rightarrow T\) be as in the Lemma’s statement, and choose some \(s_0 \in S\). We will define the function \(G:T \rightarrow S\) as follows: for every \(t\in T\), if there is some \(s\in S\) such that \(F(s)=t\) then set \(G(t)=s\) (the choice of \(s\) is well defined since by the one-to-one property of \(F\), there cannot be two distinct \(s,s'\) that both map to \(t\)). Otherwise, set \(G(t)=s_0\). Now for every \(s\in S\), by the definition of \(G\), if \(t=F(s)\) then \(G(t)=G(F(s))=s\). Moreover, this also shows that \(G\) is onto, since it means that for every \(s\in S\) there is some \(t\) (namely \(t=F(s)\)) such that \(G(t)=s\).


Graphs are ubiquitous in Computer Science, and many other fields as well. They are used to model a variety of data types including social networks, road networks, deep neural nets, gene interactions, correlations between observations, and a great many more. The formal definitions of graphs are below, but if you have not encountered them before then I urge you to read up on them in one of the sources linked above. Graphs come in two basic flavors: undirected and directed.It is possible, and sometimes useful, to think of an undirected graph as simply a directed graph with the special property that for every pair \(u,v\) either both the edges \(\overrightarrow{u v}\) and \(\overleftarrow{u v}\) are present or neither of them is. However, in many settings there is a significant difference between undirected and directed graphs, and so it’s typically best to think of them as separate categories.

Figure 5: An example of an undirected and a directed graph. The undirected graph has vertex set \(\{1,2,3,4\}\) and edge set \(\{ \{1,2\},\{2,3\},\{2,4\} \}\). The directed graph has vertex set \(\{a,b,c\}\) and the edge set \(\{ (a,b),(b,c),(c,a),(a,c) \}\).

An undirected graph \(G = (V,E)\) consists of a set \(V\) of vertices and a set \(E\) of edges. Every edge is a size two subset of \(V\). We say that two vertices \(u,v \in V\) are neighbors, denoted by \(u \sim v\), if the edge \(\{u,v\}\) is in \(E\).

Given this definition, we can define several other properties of graphs and their vertices. We define the degree of \(u\) to be the number of neighbors \(u\) has. A path in the graph is a tuple \((u_0,\ldots,u_k) \in V^k\), for some \(k>0\) such that \(u_{i+1}\) is a neighbor of \(u_i\) for every \(i\in [k]\). A simple path is a path \((u_0,\ldots,u_{k-1})\) where all the \(u_i\)’s are distinct. A cycle is a path \((u_0,\ldots,u_k)\) where \(u_0=u_{k}\). We say that two vertices \(u,v\in V\) are connected if either \(u=v\) or there is a path from \((u_0,\ldots,u_k)\) where \(u_0=u\) and \(u_k=v\). We say that the graph \(G\) is connected if every pair of vertices in it is connected.

Here are some basic facts about undirected graphs. We give some informal arguments below, but leave the full proofs as exercises. (The proofs can also be found in most basic texts on graph theory.)

In any undirected graph \(G=(V,E)\), the sum of the degrees of all vertices is equal to twice the number of edges.

degreesegeslem can be shown by seeing that every edge \(\{ u,v\}\) contributes twice to the sum of the degrees (once for \(u\) and the second time for \(v\).)

The connectivity relation is transitive, in the sense that if \(u\) is connected to \(v\), and \(v\) is connected to \(w\), then \(u\) is connected to \(w\).

conntranslem can be shown by simply attaching a path of the form \((u,u_1,u_2,\ldots,u_{k-1},v)\) to a path of the form \((v,u'_1,\ldots,u'_{k'-1},w)\) to obtain the path \((u,u_1,\ldots,u_{k-1},v,u'_1,\ldots,u'_{k'-1},w)\) that connects \(u\) to \(w\).

For every undirected graph \(G=(V,E)\) and connected pair \(u,v\), the shortest path from \(u\) to \(v\) is simple. In particular, for every connected pair there exists a simple path that connects them.

simplepathlem can be shown by “shortcutting” any non simple path of the form \((u,u_1,\ldots,u_{i-1},w,u_{i+1},\ldots,u_{j-1},w,u_{j+1},\ldots,u_{k-1},v)\) where the same vertex \(w\) appears in both the \(i\)-th and \(j\)-position, to obtain the shorter path \((u,u_1,\ldots,u_{i-1},w,u_{j+1},\ldots,u_{k-1},v)\).

If you haven’t seen these proofs before, it is indeed a great exercise to transform the above informal exercises into fully rigorous proofs.

A directed graph \(G=(V,E)\) consists of a set \(V\) and a set \(E \subseteq V\times V\) of ordered pairs of \(V\). We denote the edge \((u,v)\) also as \(\overrightarrow{u v}\). If the edge \(\overrightarrow{u v}\) is present in the graph then we say that \(v\) is an out-neighbor of \(u\) and \(u\) is an in-neigbor of \(v\).

A directed graph might contain both \(\overrightarrow{u v}\) and \(\overrightarrow{v u}\) in which case \(u\) will be both an in-neighbor and an out-neighbor of \(v\) and vice versa. The in-degree of \(u\) is the number of in-neighbors it has, and the out-degree of \(v\) is the number of out-neighbors it has. A path in the graph is a tuple \((u_0,\ldots,u_k) \in V^k\), for some \(k>0\) such that \(u_{i+1}\) is an out-neighbor of \(u_i\) for every \(i\in [k]\). As in the undirected case, a simple path is a path \((u_0,\ldots,u_{k-1})\) where all the \(u_i\)’s are distinct and a cycle is a path \((u_0,\ldots,u_k)\) where \(u_0=u_{k}\). One type of directed graphs we often care about is directed acyclic graphs or DAGs, which, as their name implies, are directed graphs without any cycles.

The lemmas we mentioned above have analogs for directed graphs. We again leave the proofs (which are essentially identical to their undirected analogs) as exercises for the reader:

In any directed graph \(G=(V,E)\), the sum of the in-degrees is equal to the sum of the out-degrees, which is equal to the number of edges.

In any directed graph \(G\), if there is a path from \(u\) to \(v\) and a path from \(v\) to \(w\), then there is a path from \(u\) to \(w\).

For every directed graph \(G=(V,E)\) and a pair \(u,v\) such that there is a path from \(u\) to \(v\), the shortest path from \(u\) to \(v\) is simple.

The word graph in the sense above was coined by the mathematician Sylvester in 1878 in analogy with the chemical graphs used to visualize molecules. There is an unfortunate confusion with the more common usage of the term as a way to plot data, and in particular a plot of some function \(f(x)\) as a function of \(x\). We can merge these two meanings by thinking of a function \(f:A \rightarrow B\) as a special case of a directed graph over the vertex set \(V= A \cup B\) where we put the edge \(\overrightarrow{x f(x)}\) for every \(x\in A\). In a graph constructed in this way every vertex in \(A\) has out-degree one.

The following lecture of Berkeley CS70 provides an excellent overview of graph theory.

Logic operators and quantifiers.

If \(P\) and \(Q\) are some statements that can be true or false, then \(P\) AND \(Q\) (denoted as \(P \wedge Q\)) is the statement that is true if and only if both \(P\) and \(Q\) are true, and \(P\) OR \(Q\) (denoted as \(P \vee Q\)) is the statement that is true if and only if either \(P\) or \(Q\) is true. The negation of \(P\), denoted as \(\neg P\) or \(\overline{P}\), is the statement that is true if and only if \(P\) is false.

Suppose that \(P(x)\) is a statement that depends on some parameter \(x\) (also sometimes known as an unbound variable) in the sense that for every instantiation of \(x\) with a value from some set \(S\), \(P(x)\) is either true or false. For example, \(x>7\) is a statement that is not a priori true or false, but does become true or false whenever we instantiate \(x\) with some real number. In such case we denote by \(\forall_{x\in S} P(x)\) the statement that is true if and only if \(P(x)\) is true for every \(x\in S\).In these notes we will place the variable that is bound by a quantifier in a subscript and so write \(\forall_{x\in S}P(x)\) whereas other texts might use \(\forall x\in S. P(x)\). We denote by \(\exists_{x\in S} P(x)\) the statement that is true if and only if there exists some \(x\in S\) such that \(P(x)\) is true.

For example, the following is a formalization of the true statement that there exists a natural number \(n\) larger than \(100\) that is not divisible by \(3\):

\[ \exists_{n\in \N} (n>100) \wedge \left(\forall_{k\in N} k+k+k \neq n\right) \;. \]

“For sufficiently large \(n\)” One expression which comes up time and again is the claim that some statement \(P(n)\) is true “for sufficiently large \(n\)”. What this means is that there exists an integer \(N_0\) such that \(P(n)\) is true for every \(n>N_0\). We can formalize this as \(\exists_{N_0\in \N} \forall_{n>N_0} P(n)\).

Quantifiers for summations and products

The following shorthands for summing up or taking products of several numbers are often convenient. If \(S = \{s_0,\ldots,s_{n-1} \}\) is a finite set and \(f:S \rightarrow \R\) is a function, then we write \(\sum_{x\in S} f(x)\) as shorthand for

\[ f(s_0) + f(s_1) + f(s_2) + \ldots + f(s_{n-1}) \;, \]

and \(\prod_{x\in S} f(x)\) as shorthand for

\[ f(s_0) \cdot f(s_1) \cdot f(s_2) \cdot \ldots \cdot f(s_{n-1}) \;. \]

For example, the sum of the squares of all numbers from \(1\) to \(100\) can be written as

\[ \sum_{i\in \{1,\ldots,100\}} i^2 \;. \label{eqsumsquarehundred} \]

Since summing up over intervals of integers is so common, there is a special notation for it, and for every two integers \(a \leq b\), \(\sum_{i=a}^b f(i)\) denotes \(\sum_{i\in S} f(i)\) where \(S =\{ x\in \Z : a \leq x \leq b \}\). Hence we can write the sum \eqref{eqsumsquarehundred} as

\[ \sum_{i=1}^{100} i^2 \;. \]

Parsing formulas: bound and free variables

In mathematics, as in coding, we often have symbolic “variables” or “parameters”. It is important to be able to understand, given some formula, whether a given variable is bound or free in this formula. For example, in the following statement \(n\) is free but \(a\) and \(b\) are bound by the \(\exists\) quantifier:

\[ \exists_{a,b \in \N} (a \neq 1) \wedge (a \neq n) \wedge (n = a \times b) \label{aboutnstmt} \]

Since \(n\) is free, it can be set to any value, and the truth of the statement \eqref{aboutnstmt} depends on the value of \(n\). For example, if \(n=8\) then \eqref{aboutnstmt} is true, but for \(n=11\) it is false. (Can you see why?)

The same issue appears when parsing code. For example, in the following snippet from the C++ programming language

for (int i=0 ; i<n ; i=i+1) { printf("*"); }

the variable i is bound to the for operator but the variable n is free.

The main property of bound variables is that we can change them to a different name (as long as it doesn’t conflict with another used variable) without changing the meaning of the statement. Thus for example the statement

\[ \exists_{x,y \in \N} (x \neq 1) \wedge (x \neq n) \wedge (n = x \times y) \label{aboutnstmttwo} \] is equivalent to \eqref{aboutnstmt} in the sense that it is true for exactly the same set of \(n\)’s. Similarly, the code

for (int j=0 ; j<n ; j=j+1) { printf("*"); }

produces the same result as the code above that used i instead of j.

Mathematical notation has a lot of similarities with programming language, and for the same reasons. Both are formalisms meant to convey complex concepts in a precise way. However, there are some cultural differences. In programming languages, we often try to use meaningful variable names such as NumberOfVertices while in math we often use short identifiers such as \(n\). (Part of it might have to do with the tradition of mathematical proofs as being handwritten and verbally presented, as opposed to typed up and compiled.)

One consequence of that is that in mathematics we often end up reusing identifier, and also “run out” of letters and hence use greek letters too, as well as distinguish between small and capital letters. Similarly, mathematical notation tends to use quite a lot of “overloading”, using operators such as \(+\) for a great variety of objects (e.g., real numbers, matrices, finite field elements, etc..), and assuming that the meaning can be inferred from the context.

Both fields have a notion of “types”, and in math we often try to reserve certain letters for variables of a particular type. For example, variables such as \(i,j,k,\ell,m,n\) will often denote integers, and \(\epsilon\) will often denote a small positive real number. When reading or writing mathematical texts, we usually don’t have the advantage of a “compiler” that will check type safety for us. Hence it is important to keep track of the type of each variable, and see that the operations that are performed on it “make sense”.

Asymptotics and Big-\(O\) notation

“\(\log\log\log n\) has been proved to go to infinity, but has never been observed to do so.”, Anonymous, quoted by Carl Pomerance (2000)

It is often very cumbersome to describe precisely quantities such as running time and is also not needed, since we are typically mostly interested in the “higher order terms”. That is, we want to understand the scaling behavior of the quantity as the input variable grows. For example, as far as running time goes, the difference between an \(n^5\)-time algorithm and an \(n^2\)-time one is much more significant than the difference between an \(100n^2 + 10n\) time algorithm and an \(10n^2\) time algorithm. For this purpose, \(O\)-notation is extremely useful as a way to “declutter” our text and focus our attention on what really matters. For example, using \(O\)-notation, we can say that both \(100n^2 + 10n\) and \(10n^2\) are simply \(\Theta(n^2)\) (which informally means “the same up to constant factors”), while \(n^2 = o(n^5)\) (which informally means that \(n^2\) is “much smaller than” \(n^5\)).While Big-\(O\) notation is often used to analyze running time of algorithms, this is by no means the only application. At the end of the day, Big-\(O\) notation is just a way to express asymptotic inequalities between functions on integers. It can be used regardless of whether these functions are a measure of running time, memory usage, or any other quantity that may have nothing to do with computation.

Generally (though still informally), if \(F,G\) are two functions mapping natural numbers to non-negative reals, then “\(F=O(G)\)” means that \(F(n) \leq G(n)\) if we don’t care about constant factors, while “\(F=o(G)\)” means that \(F\) is much smaller than \(G\), in the sense that no matter by what constant factor we multiply \(F\), if we take \(n\) to be large enough then \(G\) will be bigger (for this reason, sometimes \(F=o(G)\) is written as \(F \ll G\)). We will write \(F= \Theta(G)\) if \(F=O(G)\) and \(G=O(F)\), which one can think of as saying that \(F\) is the same as \(G\) if we don’t care about constant factors. More formally, we define Big-\(O\) notation as follows:

For \(F,G: \N \rightarrow \R_+\), we define \(F=O(G)\) if there exist numbers \(a,N_0 \in \N\) such that \(F(n) \leq a\cdot G(n)\) for every \(n>N_0\).Recall that \(\R_+\), which is also sometimes denoted as \((0,\infty)\), is the set of positive real numbers, so the above is just a way of saying that \(F\) and \(G\)’s outputs are always positive numbers. We define \(F=\Omega(G)\) if \(G=O(F)\).

We write \(F =o(G)\) if for every \(\epsilon>0\) there is some \(N_0\) such that \(F(n) <\epsilon G(n)\) for every \(n>N_0\). We write \(F =\omega(G)\) if \(G=o(F)\). We write \(F= \Theta(G)\) if \(F=O(G)\) and \(G=O(F)\).

We can also use the notion of limits to define Big- and Little-\(O\) notation. You can verify that \(F=o(G)\) (or, equivalently, \(G=\omega(F)\)) if and only if \(\lim\limits_{n\rightarrow\infty} \tfrac{F(n)}{G(n)} = 0\). Similarly, if the limit \(\lim\limits_{n\rightarrow\infty} \tfrac{F(n)}{G(n)}\) exists and is a finite number then \(F=O(G)\). If you are familiar with the notion of supremum, then you can verify that \(F=O(G)\) if and only if \(\limsup\limits_{n\rightarrow\infty} \tfrac{F(n)}{G(n)} < \infty\).

Figure 6: If \(F(n)=o(G(n))\) then for sufficiently large \(n\), \(F(n)\) will be smaller than \(G(n)\). For example, if Algorithm \(A\) runs in time \(1000\cdot n+10^6\) and Algorithm \(B\) runs in time \(0.01\cdot n^2\) then even though \(B\) might be more efficient for smaller inputs, when the inputs get sufficiently large, \(A\) will run much faster than \(B\).

Using the equality sign for \(O\)-notation is extremely common, but is somewhat of a misnomer, since a statement such as \(F = O(G)\) really means that \(F\) is in the set \(\{ G' : \exists_{N,c} \text{ s.t. } \forall_{n>N} G'(n) \leq c G(n) \}\). For this reason, some texts write \(F \in O(G)\) instead of \(F = O(G)\). If anything, it would have made more sense use inequalities and write \(F \leq O(G)\) and \(F \geq \Omega(G)\), reserving equality for \(F = \Theta(G)\), but by now the equality notation is quite firmly entrenched. Nevertheless, you should remember that a statement such as \(F = O(G)\) means that \(F\) is “at most” \(G\) in some rough sense when we ignore constants, and a statement such as \(F = \Omega(G)\) means that \(F\) is “at least” \(G\) in the same rough sense.

It’s often convenient to use “anonymous functions” in the context of \(O\)-notation, and also to emphasize the input parameter to the function. For example, when we write a statement such as \(F(n) = O(n^3)\), we mean that \(F=O(G)\) where \(G\) is the function defined by \(G(n)=n^3\). Chapter 7 in Jim Apsnes’ notes on discrete math provides a good summary of \(O\) notation; see also this tutorial for a gentler and more programmer-oriented introduction.

Some “rules of thumb” for Big-\(O\) notation

There are some simple heuristics that can help when trying to compare two functions \(F\) and \(G\):

  • Multiplicative constants don’t matter in \(O\)-notation, and so if \(F(n)=O(G(n))\) then \(100F(n)=O(G(n))\).
  • When adding two functions, we only care about the larger one. For example, for the purpose of \(O\)-notation, \(n^3+100n^2\) is the same as \(n^3\), and in general in any polynomial, we only care about the larger exponent.
  • For every two constants \(a,b>0\), \(n^a = O(n^b)\) if and only if \(a \leq b\), and \(n^a = o(n^b)\) if and only if \(a<b\). For example, combining the two observations above, \(100n^2 + 10n + 100 = o(n^3)\).
  • Polynomial is always smaller than exponential: \(n^a = o(2^{n^\epsilon})\) for every two constants \(a>0\) and \(\epsilon>0\) even if \(\epsilon\) is much smaller than \(a\). For example, \(100n^{100} = o(2^{\sqrt{n}})\).
  • Similarly, logarithmic is always smaller than polynomial: \((\log n)^a\) (which we write as \(\log^a n\)) is \(o(n^\epsilon)\) for every two constants \(a,\epsilon>0\). For example, combining the observations above, \(100n^2 \log^{100} n = o(n^3)\).

In most (though not all!) cases we use \(O\)-notation, the constants hidden by it are not too huge and so on an intuitive level, you can think of \(F=O(G)\) as saying something like \(F(n) \leq 1000 G(n)\) and \(F=\Omega(G)\) as saying something \(F(n) \geq 0.001 G(n)\).


Many people think of mathematical proofs as a sequence of logical deductions that starts from some axioms and ultimately arrives at a conclusion. In fact, some dictionaries define proofs that way. This is not entirely wrong, but in reality a mathematical proof of a statement X is simply an argument that convinces the reader that X is true beyond a shadow of a doubt. To produce such a proof you need to:

  1. Understand precisely what X means.
  2. Convince yourself that X is true.
  3. Write your reasoning down in plain, precise and concise English (using formulas or notation only when they help clarity).

In many cases, Step 1 is the most important one. Understanding what a statement means is often more than halfway towards understanding why it is true. In Step 3, to convince the reader beyond a shadow of a doubt, we will often want to break down the reasoning to “basic steps”, where each basic step is simple enough to be “self evident”. The combination of all steps yields the desired statement.

Proofs and programs

There is a great deal of similarity between the process of writing proofs and that of writing programs, and both require a similar set of skills. Writing a program involves:

  1. Understanding what is the task we want the program to achieve.
  2. Convincing yourself that the task can be achieved by a computer, perhaps by planning on a whiteboard or notepad how you will break it up to simpler tasks.
  3. Converting this plan into code that a compiler or interpreter can understand, by breaking up each task into a sequence of the basic operations of some programming language.

In programs as in proofs, step 1 is often the most important one. A key difference is that the reader for proofs is a human being and for programs is a compiler.This difference might be eroding with time, as more proofs are being written in a machine verifiable form and progress in artificial intelligence allows expressing programs in more human friendly ways, such as “programming by example”. Interestingly, much of the progress in automatic proof verification and proof assistants relies on a much deeper correspondence between proofs and programs. We might see this correspondence later in this course. Thus our emphasis is on readability and having a clear logical flow for the proof (which is not a bad idea for programs as well…). When writing a proof, you should think of your audience as an intelligent but highly skeptical and somewhat petty reader, that will “call foul” at every step that is not well justified.

Extended example: graph connectivity

To illustrate these ideas, let us consider the following example of a true theorem:

Every connected undirected graph of \(n\) vertices has at least \(n-1\) edges.

We are going to take our time to understand how one would come up with a proof for graphconthm, and how to write such a proof down. This will not be the shortest way to prove this theorem, but hopefully following this process will give you some general insights on reading, writing, and discovering mathematical proofs.

Before trying to prove graphconthm, we need to understand what it means. Let’s start with the terms in the theorems. We defined undirected graphs and the notion of connectivity in graphsec above. In particular, an undirected graph \(G=(V,E)\) is connected if for every pair \(u,v \in V\), there is a path \((u_0,u_1,\ldots,u_k)\) such that \(u_0=u\), \(u_k=v\), and \(\{ u_i,u_{i+1} \} \in E\) for every \(i\in [k]\).

It is crucial that at this point you pause and verify that you completely understand the definition of connectivity. Indeed, you should make a habit of pausing after any statement of a theorem, even before looking at the proof, and verifying that you understand all the terms that the theorem refers to.

To prove graphconthm we need to show that there is no \(2\)-vertex connected graph with fewer than \(1\) edges, \(3\)-vertex connected graph with fewer than \(2\) edges, and so on and so forth. One of the best ways to prove a theorem is to first try to disprove it. By trying and failing to come up with a counterexample, we often understand why the theorem can not be false. For example, if you try to draw a \(4\)-vertex graph with only two edges, you can see that there are basically only two choices for such a graph as depicted in figurefourvertexgraph, and in both there will remain some vertices that cannot be connected.

Figure 7: In a four vertex graph with two edges, either both edges have a shared vertex or they don’t. In both cases the graph will not be connected.

In fact, we can see that if we have a budget of \(2\) edges and we choose some vertex \(u\), we will not be able to connect to \(u\) more than two other vertices, and similarly with a budget of \(3\) edges we will not be able to connect to \(u\) more than three other vertices. We can keep trying to draw such examples until we convince ourselves that the theorem is probably true, at which point we want to see how we can prove it.

If you have not seen the proof of this theorem before (or don’t remember it), this would be an excellent point to pause and try to prove it yourself. One way to do it would be to describe an algorithm that on input a graph \(G\) on \(n\) vertices and \(n-2\) or fewer edges, finds a pair \(u,v\) of vertices such that \(u\) is disconnected from \(v\).

Mathematical induction

There are several ways to prove graphconthm. One approach to do is to start by proving it for small graphs, such as graphs with 2,3 or 4 edges, for which we can check all the cases, and then try to extend the proof for larger graphs. The technical term for this proof approach is proof by induction.

Induction is simply an application of the self-evident Modus Ponens rule that says that if (a) \(P\) is true and (b) \(P\) implies \(Q\) then \(Q\) is true. In the setting of proofs by induction we typically have a statement \(Q(k)\) that is parameterized by some integer \(k\), and we prove that (a) \(Q(0)\) is true and (b) For every \(k>0\), if \(Q(0),\ldots,Q(k-1)\) are all true then \(Q(k)\) is true.Usually proving (b) is the hard part, though there are examples where the “base case” (a) is quite subtle. By repeatedly applying Modus Ponens, we can deduce from (a) and (b) that \(Q(1)\) is true, and then from (a),(b) and \(Q(1)\) that \(Q(2)\) is true, and so on and so forth to obtain that \(Q(k)\) is true for every \(k\). The statement (a) is called the “base case”, while (b) is called the “inductive step”. The assumption in (b) that \(Q(i)\) holds for \(i<k\) is called the “inductive hypothesis”.

Proofs by inductions are closely related to algorithms by recursion. In both cases we reduce solving a larger problem to solving a smaller instance of itself. In a recursive algorithm to solve some problem P on an input of length \(k\) we ask ourselves “what if someone handed me a way to solve P on instances smaller than \(k\)?”. In an inductive proof to prove a statement Q parameterized by a number \(k\), we ask ourselves “what if I already knew that \(Q(k')\) is true for \(k'<k\)”. Both induction and recursion are crucial concepts for this course and Computer Science at large (and even other areas of inquiry, including not just mathematics but other sciences as well). Both can be initially (and even post-initially) confusing, but with time and practice they become clearer. For more on proofs by induction and recursion, you might find the following Stanford CS 103 handout, this MIT 6.00 lecture or this excerpt of the Lehman-Leighton book useful.

Proving the theorem by induction

There are several ways to use induction to prove graphconthm. We will do so by following our intuition above that with a budget of \(k\) edges, we cannot connect to a vertex more than \(k\) other vertices. That is, we will define the statement \(Q(k)\) as follows:

\(Q(k)\) is “For every graph \(G=(V,E)\) with at most \(k\) edges and every \(u\in V\), the number of vertices that are connected to \(u\) (including \(u\) itself) is at most \(k+1\)”

Note that \(Q(n-2)\) implies our theorem, since it means that in an \(n\) vertex graph of \(n-2\) edges, there would be at most \(n-1\) vertices that are connected to \(u\), and hence in particular there would be some vertex that is not connected to \(u\). More formally, if we define, given any undirected graph \(G\) and vertex \(u\) of \(G\), the set \(C_G(u)\) to contain all vertices connected to \(u\), then the statement \(Q(k)\) is that for every undirected graph \(G=(V,E)\) with \(|E|=k\) and \(u\in V\), \(|C_G(u)| \leq k+1\).

To prove that \(Q(k)\) is true for every \(k\) by induction, we will first prove that (a) \(Q(0)\) is true, and then prove (b) if \(Q(0),\ldots,Q(k-1)\) are true then \(Q(k)\) is true as well. In fact, we will prove the stronger statement (b’) that if \(Q(k-1)\) is true then \(Q(k)\) is true as well. ((b’) is a stronger statement than (b) because it has same conclusion with a weaker assumption.) Thus, if we show both (a) and (b’) then we complete the proof of graphconthm.

Proving (a) (i.e., the “base case”) is actually quite easy. The statement \(Q(0)\) says that if \(G\) has zero edges, then \(|C_G(u)|=1\), but this is clear because in a graph with zero edges, \(u\) is only connected to itself. The heart of the proof is, as typical with induction proofs, is in proving a statement such as (b’) (or even the weaker statement (b)). Since we are trying to prove an implication, we can assume the so-called “inductive hypothesis” that \(Q(k-1)\) is true and need to prove from this assumption that \(Q(k)\) is true. So, suppose that \(G=(V,E)\) is a graph of \(k\) edges, and \(u\in V\). Since we can use induction, a natural approach would be to remove an edge \(e\in E\) from the graph to create a new graph \(G'\) of \(k-1\) edges. We can use the induction hypothesis to argue that \(|C_{G'}(u)| \leq k\). Now if we could only argue that removing the edge \(e\) reduced the connected component of \(u\) by at most a single vertex, then we would be done, as we could argue that \(|C_G(u)| \leq |C_{G'}(u)|+1 \leq k+1\).

Please ensure that you understand why showing that \(|C_G(u)| \leq |C_{G'}(u)|+1\) completes the inductive proof.

Figure 8: Removing a single edge \(e\) can greatly decrease the number of vertices that are connected to a vertex \(u\).

Alas, this might not be the case. It could be that removing a single edge \(e\) will greatly reduce the size of \(C_{G}(u)\). For example that edge might be a “bridge” between two large connected components; such a situation is illustrated in effectofoneedgefig. This might seem as a real stumbling block, and at this point we might go back to the drawing board to see if perhaps the theorem is false after all. However, if we look at various concrete examples, we see that in any concrete example, there is always a “good” choice of an edge, adding which will increase the component connect to \(u\) by at most one vertex.

Figure 9: Removing an edge \(e=\{s,w\}\) where \(w\in C_G(u)\) has degree one removes only \(w\) from \(C_G(u)\).

The crucial observation is that this always holds if we choose an edge \(e = \{ s, w\}\) where \(w \in C_G(u)\) has degree one in the graph \(G\), see addingdegreeonefig. The reason is simple. Since every path from \(u\) to \(w\) must pass through \(s\) (which is \(w\)’s only neighbor), removing the edge \(\{ s,w \}\) merely has the effect of disconnecting \(w\) from \(u\), and hence \(C_{G'}(u) = C_G(u) \setminus \{ w \}\) and in particular \(|C_{G'}(u)|=|C_G(u)|-1\), which is exactly the condition we needed.

Now the question is whether there will always be a degree one vertex in \(C_G(u) \setminus \{u \}\). Of course generally we are not guaranteed that a graph would have a degree one vertex, but we are not dealing with a general graph here but rather a graph with a small number of edges. We can assume that \(|C_G(u)| > k+1\) (otherwise we’re done) and each vertex in \(C_G(u)\) must have degree at least one (as otherwise it would not be connected to \(u\)). Thus, the only case where there is no vertex \(w\in C_G(u) \setminus \{u\}\) of degree one, is when the degrees of all vertices in \(C_G(u)\) are at least \(2\). But then by degreesegeslem the number of edges in the graph is at least \(\tfrac{1}{2}\cdot 2 \cdot (k+1)>k\), which contradicts our assumption that the graph \(G\) has at most \(k\) edges. Thus we can conclude that either \(|C_G(u)| \leq k+1\) (in which case we’re done) or there is a degree one vertex \(w\neq u\) that is connected to \(u\). By removing the single edge \(e\) that touches \(w\), we obtain a \(k-1\) edge graph \(G'\) which (by the inductive hypothesis) satisfies \(|C_{G'}(u)| \leq k\), and hence \(|C_G(u)|=|C_{G'}(u) \cup \{ w \}| \leq k+1\). This suffices to complete an inductive proof of statement \(Q(k)\).

Writing down the proof

All of the above was a discussion of how we discover the proof, and convince ourselves that the statement is true. However, once we do that, we still need to write it down. When writing the proof, we use the benefit of hindsight, and try to streamline what was a messy journey into a linear and easy-to-follow flow of logic that starts with the word “Proof:” and ends with “QED” or the symbol \(\blacksquare\).QED stands for “quod erat demonstrandum”, which is “What was to be demonstrated.” or “The very thing it was required to have shown.” in Latin. All our discussions, examples and digressions can be very insightful, but we keep them outside the space delimited between these two words, where (as described by this excellent handout) “every sentence must be load bearing”. Just like we do in programming, we can break the proof into little “subroutines” or “functions” (known as lemmas or claims in math language), which will be smaller statements that help us prove the main result. However, it should always be crystal-clear to the reader in what stage we are of the proof. Just like it should always be clear to which function a line of code belongs to, it should always be clear whether an individual sentence is part of a proof of some intermediate result, or is part of the argument showing that this intermediate result implies the theorem. Sometimes we highlight this partition by noting after each occurrence of “QED” to which lemma or claim it belongs.

Let us see how the proof of graphconthm looks in this streamlined fashion. We start by repeating the theorem statement

Every connected undirected graph of \(n\) vertices has at least \(n-1\) edges.

The proof will follow from the following lemma:

For every \(k\in \N\), undirected graph \(G=(V,E)\) of at most \(k\) edges, and \(u\in V\), the number of vertices connected to \(u\) in \(G\) is at most \(k+1\).

We start by showing that graphcontlem implies the theorem:

Proof of graphconthmpf from graphcontlem: We will show that for undirected graph \(G=(V,E)\) of \(n\) vertices and at most \(n-2\) edges, there is a pair \(u,v\) of vertices that are disconnected in \(G\). let \(G\) be such a graph and \(u\) be some vertex of \(G\). By graphcontlem, the number of vertices connected to \(u\) is at most \(n-1\), and hence (since \(|V|=n\)) there is a vertex \(v\in V\) that is not connected to \(u\), thus completing the proof. QED (Proof of graphconthmpf from graphcontlem)

We now turn to proving graphcontlem. Let \(G=(V,E)\) be an undirected graph of \(k\) edges and \(u\in V\). We define \(C_G(u)\) to be the set of vertices connected to \(u\). To complete the proof of graphcontlem, we need to prove that \(|C_G(u)| \leq k+1\). We will do so by induction on \(k\).

The base case that \(k=0\) is true because a graph with zero edges, \(u\) is only connected to itself.

Now suppose that graphcontlem is true for \(k-1\) and we will prove it for \(k\). Let \(G=(V,E)\) and \(u\in V\) be as above, where \(|E|=k\), and suppose (towards a contradiction) that \(|C_G(u)| \geq k+2\). Let \(S = C_G(u) \setminus \{u \}\). Denote by \(deg(v)\) the degree of any vertex \(v\). By degreesegeslem, \(\sum_{v\in S} deg(v) \leq \sum_{v\in V} deg(v) = 2|E|=2k\). Hence in particular, under our assumption that \(|S|+1=|C_G(u)| \geq k+2\), we get that \(\tfrac{1}{|S|}\sum_{v\in S} deg(v) \leq 2k/(k+1)< 2\). In other words, the average degree of a vertex in \(S\) is smaller than \(2\), and hence in particular there is some vertex \(w\in S\) with degree smaller than \(2\). Since \(w\) is connected to \(u\), it must have degree at least one, and hence (since \(w\)’s degree is smaller than two) degree exactly one. In other words, \(w\) has a single neighbor which we denote by \(s\).

Let \(G'\) be the graph obtained by removing the edge \(\{ s, w\}\) from \(G\). Since \(G'\) has at most \(k-1\) edges, by the inductive hypothesis we can assume that \(|C_{G'}(u)| \leq k\). The proof of the lemma is concluded by showing the following claim:

Claim: Under the above assumptions, \(|C_G(u)| \leq |C_{G'}(u)|+1\).

Proof of claim: The claim says that \(C_{G'}(u)\) has at most one fewer element than \(C_G(u)\). Thus it follows from the following statement \((*)\): \(C_{G'}(u) \supseteq C_G(u) \setminus \{ w \}\). To prove (*) we need to show that for every \(v \neq w\) that is connected to \(u\), \(v \in C_{G'}(u)\). Indeed for every such \(v\), simplepathlem implies that there must be some simple path \((t_0,t_1,\ldots,t_{i-1},t_i)\) in the graph \(G\) where \(t_0=u\) and \(t_i=v\). But \(w\) cannot belong to this path, since \(w\) is different from the endpoints \(u\) and \(v\) of the path and can’t equal one of the intermediate points either, since it has degree one and that would make the path not simple. More formally, if \(w=t_j\) for \(0 < j < i\), then since \(w\) has only a single neighbor \(s\), it would have to hold that \(w\)’s neighbor \(s\) satisfies \(s=t_{j-1}=t_{j+1}\), contradicting the simplicity of the path. Hence the path from \(u\) to \(v\) is also a path in the graph \(G'\), which means that \(v \in C_{G'}(u)\), which is what we wanted to prove. QED (claim)

The claim implies graphcontlem since by the inductive assumption, \(|C_{G'}(u)| \leq k\), and hence by the claim \(|C_G(u)| \leq k+1\), which is what we wanted to prove. This concludes the proof of graphcontlem and hence also of graphconthmpf. QED (graphcontlem), QED (graphconthmpf)

The proof above used the observation that if the average of some \(n\) numbers \(x_0,\ldots,x_{n-1}\) is at most \(X\), then there must exists at least a single number \(x_i \leq X\). (In this particular proof, the numbers were the degrees of vertices in \(S\).) This is known as the averaging principle, and despite its simplicity, it is often extremely useful.

Reading a proof is no less of an important skill than producing one. In fact, just like understanding code, it is a highly non-trivial skill in itself. Therefore I strongly suggest that you re-read the above proof, asking yourself at every sentence whether the assumption it makes are justified, and whether this sentence truly demonstrates what it purports to achieve. Another good habit is to ask yourself when reading a proof for every variable you encounter (such as \(u\), \(t_i\), \(G'\), etc. in the above proof) the following questions: (1) What type of variable is it? is it a number? a graph? a vertex? a function? and (2) What do we know about it? Is it an arbitrary member of the set? Have we shown some facts about it?, and (3) What are we trying to show about it?.

Proof writing style

A mathematical proof is a piece of writing, but it is a specific genre of writing with certain conventions and preferred styles. As in any writing, practice makes perfect, and it is also important to revise your drafts for clarity.

In a proof for the statement \(X\), all the text between the words “Proof:” and “QED” should be focused on establishing that \(X\) is true. Digressions, examples, or ruminations should be kept outside these two words, so they do not confuse the reader. The proof should have a clear logical flow in the sense that every sentence or equation in it should have some purpose and it should be crystal-clear to the reader what this purpose is. When you write a proof, for every equation or sentence you include, ask yourself:

  1. Is this sentence or equation stating that some statement is true?
  2. If so, does this statement follow from the previous steps, or are we going to establish it in the next step?
  3. What is the role of this sentence or equation? Is it one step towards proving the original statement, or is it a step towards proving some intermediate claim that you have stated before?
  4. Finally, would the answers to questions 1-3 be clear to the reader? If not, then you should reorder, rephrase or add explanations.

Some helpful resources on mathematical writing include this handout by Lee, this handout by Hutching, as well as several of the excellent handouts in Stanford’s CS 103 class.

Patterns in proofs

“If it was so, it might be; and if it were so, it would be; but as it isn’t, it ain’t. That’s logic.”, Lewis Carroll, Through the looking-glass.

Just like in programming, there are several common patterns of proofs that occur time and again. Here are some examples:

Proofs by contradiction: One way to prove that \(X\) is true is to show that if \(X\) was false then we would get a contradiction as a result. Such proofs often start with a sentence such as “Suppose, towards a contradiction, that \(X\) is false” and end with deriving some contradiction (such as a violation of one of the assumptions in the theorem statement). Here is an example:

There are no natural numbers \(a,b\) such that \(\sqrt{2} = \tfrac{a}{b}\).

Suppose, towards the sake of contradiction that this is false, and so let \(a\in \N\) be the smallest number such that there exists some \(b\in\N\) satisfying \(\sqrt{2}=\tfrac{a}{b}\). Squaring this equation we get that \(2=a^2/b^2\) or \(a^2=2b^2\) \((*)\). But this means that \(a^2\) is even, and since the product of two odd numbers is odd, it means that \(a\) is even as well, or in other words, \(a = 2a'\) for some \(a' \in \N\). Yet plugging this into \((*)\) shows that \(4a'^2 = 2b^2\) which means \(b^2 = 2a'^2\) is an even number as well. By the same considerations as above we gat that \(b\) is even and hence \(a/2\) and \(b/2\) are two natural numbers satisfying \(\tfrac{a/2}{b/2}=\sqrt{2}\), contradicting the minimality of \(a\).

Proofs of a universal statement: Often we want to prove a statement \(X\) of the form “Every object of type \(O\) has property \(P\).” Such proofs often start with a sentence such as “Let \(o\) be an object of type \(O\)” and end by showing that \(o\) has the property \(P\). Here is a simple example:

For every natural number \(n\in N\), either \(n\) or \(n+1\) is even.

Let \(n\in N\) be some number. If \(n/2\) is a whole number then we are done, since then \(n=2(n/2)\) and hence it is even. Otherwise, \(n/2+1/2\) is a whole number, and hence \(2(n/2+1/2)=n+1\) is even.

Proofs of an implication: Another common case is that the statement \(X\) has the form “\(A\) implies \(B\)”. Such proofs often start with a sentence such as “Assume that \(A\) is true” and end with a derivation of \(B\) from \(A\). Here is a simple example:

If \(b^2 \geq 4ac\) then there is a solution to the quadratic equation \(ax^2 + bx + c =0\).

Suppose that \(b^2 \geq 4ac\). Then \(d = b^2 - 4ac\) is a non-negative number and hence it has a square root \(s\). Thus \(x = (-b+s)/(2a)\) satisfies \[ \begin{aligned} ax^2 + bx + c &= a(-b+s)^2/(4a^2) + b(-b+s)/(2a) + c \\ &= (b^2-2bs+s^2)/(4a)+(-b^2+bs)/(2a)+c \;. \label{eq:quadeq} \end{aligned} \]

Rearranging the terms of \eqref{eq:quadeq} we get \[ s^2/(4a)+c- b^2/(4a) = (b^2-4ac)/(4a) + c - b^2/(4a) = 0 \]

Proofs of equivalence: If a statement has the form “\(A\) if and only if \(B\)” (often shortened as “\(A\) iff \(B\)”) then we need to prove both that \(A\) implies \(B\) and that \(B\) implies \(A\). We call the implication that \(A\) implies \(B\) the “only if” direction, and the implication that \(B\) implies \(A\) the “if” direction.

Proofs by combining intermediate claims: When a proof is more complex, it is often helpful to break it apart into several steps. That is, to prove the statement \(X\), we might first prove statements \(X_1\),\(X_2\), and \(X_3\) and then prove that \(X_1 \wedge X_2 \wedge X_3\) implies \(X\).As mentioned below, \(\wedge\) denotes the logical AND operator. Our proof of graphconthm had this form.

Proofs by case distinction: This is a special case of the above, where to prove a statement \(X\) we split into several cases \(C_1,\ldots,C_k\), and prove that (a) the cases are exhaustive, in the sense that one of the cases \(C_i\) must happen and (b) go one by one and prove that each one of the cases \(C_i\) implies the result \(X\) that we are after.

“Without loss of generality (w.l.o.g)”: This term can be initially quite confusing to students. It is essentially a way to shorten case distinctions such as the above. The idea is that if Case 1 is equal to Case 2 up to a change of variables or a similar transformation, then the proof of Case 1 will also imply the proof of case 2. It is always a statement that should be viewed with suspicion. Whenever you see it in a proof, ask yourself if you understand why the assumption made is truly without loss of generality, and when you use it, try to see if the use is indeed justified. Sometimes it might be easier to just repeat the proof of the second case (adding a remark that the proof is very similar to the first one).

Proofs by induction: We can think of such proofs as a variant of the above, where we have an unbounded number of intermediate claims \(X_0,X_2,\ldots,X_k\), and we prove that \(X_0\) is true, as well that \(X_0\) implies \(X_1\), and that \(X_0 \wedge X_1\) implies \(X_2\), and so on and so forth. The website for CMU course 15-251 contains a useful handout on potential pitfalls when making proofs by induction.

Mathematical proofs are ultimately written in English prose. The well-known computer scientist Leslie Lamport argued that this is a problem, and proofs should be written in a more formal and rigorous way. In his manuscript he proposes an approach for structured hierarchical proofs, that have the following form:

  • A proof for a statement of the form “If \(A\) then \(B\)” is a sequence of numbered claims, starting with the assumption that \(A\) is true, and ending with the claim that \(B\) is true.
  • Every claim is followed by a proof showing how it is derived from the previous assumptions or claims.
  • The proof for each claim is itself a sequence of subclaims.

The advantage of Lamport’s format is that it is very clear for every sentence in the proof what is the role that it plays. It is also much easier to transform such proofs into machine-checkable format. The disadvantage is that such proofs can be more tedious to read and write, with less differentiation on the important parts of the arguments versus the more routine ones.

Non-standard notation

Most of the notation we discussed above is standard and is used in most mathematical texts. The main points where we diverge are:

  • We index the natural numbers \(\N\) starting with \(0\) (though many other texts, especially in computer science, do the same).
  • We also index the set \([n]\) starting with \(0\), and hence define it as \(\{0,\ldots,n-1\}\). In most texts it is defined as \(\{1,\ldots, n \}\). Similarly, we index coordinates of our strings starting with \(0\), and hence a string \(x\in \{0,1\}^n\) is written as \(x_0x_1\cdots x_{n-1}\).
  • We use partial functions which are functions that are not necessarily defined on all inputs. When we write \(f:A \rightarrow B\) this will refer to a total function unless we say otherwise. When we want to emphasize that \(f\) can be a partial function, we will sometimes write \(f: A \rightarrow_p B\).
  • As we will see later on in the course, we will mostly describe our computational problems in the terms of computing a Boolean function \(f: \{0,1\}^* \rightarrow \{0,1\}\). In contrast, most textbooks will refer to this as the task of deciding a language \(L \subseteq \{0,1\}^*\). These two viewpoints are equivalent, since for every set \(L\subseteq \{0,1\}^*\) there is a corresponding function \(f = 1_L\) such that \(f(x)=1\) if and only if \(x\in L\). Computing partial functions corresponds to the task known in the literature as a solving a promise problem.Because the language notation is so prevalent in textbooks, we will occasionally remind the reader of this correspondence.
  • Some other notation we use is \(\ceil{x}\) and \(\floor{x}\) for the “ceiling” and “floor” operators that correspond to “rounding up” or “rounding down” a number to the nearest integer. We use \((x \mod y)\) to denote the “remainder” of \(x\) when divided by \(y\). That is, \((x \mod y) = x - y\floor{x/y}\). In context when an integer is expected we’ll typically “silently round” the quantities to an integer. For example, if we say that \(x\) is a string of length \(\sqrt{n}\) then we’ll typically mean that \(x\) is of length \(\lceil \sqrt{n} \rceil\). (In most such cases, it will not make a difference whether we round up or down.)
  • Like most Computer Science texts, we default to the logarithm in base two. Thus, \(\log n\) is the same as \(\log_2 n\).
  • We will also use the notation \(f(n)=poly(n)\) as a short hand for \(f(n)=n^{O(1)}\) (i.e., as shorthand for saying that there are some constants \(a,b\) such that \(f(n) \leq a\cdot n^b\) for every sufficiently large \(n\)). Similarly, we will use \(f(n)=polylog(n)\) as shorthand for \(f(n)=poly(\log n)\) (i.e., as shorthand for saying that there are some constants \(a,b\) such that \(f(n) \leq a\cdot (\log n)^b\) for every sufficiently large \(n\)).
  • The basic “mathematical data structures” we’ll need are numbers, sets, tuples, strings, graphs and functions.
  • We can use basic objects to define more complex notions. For example, graphs can be defined as a list of pairs.
  • Given precise definitions of objects, we can state unambiguous and precise statements. We can then use mathematical proofs to determine whether these statements are true or false.
  • A mathematical proof is not a formal ritual but rather a clear, precise and “bulletproof” argument certifying the truth of a certain statement.
  • Big-\(O\) notation is an extremely useful formalism to suppress less significant details and allow us to focus on the high level behavior of quantities of interest.
  • The only way to get comfort with mathematical notions is to apply them in the contexts of solving problems. You should expect to need to go back time and again to the definitions and notation in this lecture as you work through problems in this course.


Most of the exercises have been written in the summer of 2018 and haven’t yet been fully debugged. While I would prefer people do not post online solutions to the exercises, I would greatly appreciate if you let me know of any bugs. You can do so by posting a GitHub issue about the exercise, and optionally complement this with an email to me with more details about the attempted solution.

  1. Write a logical expression \(\varphi(x)\) involving the variables \(x_0,x_1,x_2\) and the operators \(\wedge\) (AND), \(\vee\) (OR), and \(\neg\) (NOT), such that \(\varphi(x)\) is true if the majority of the inputs are True.
  2. Write a logical expression \(\varphi(x)\) involving the variables \(x_0,x_1,x_2\) and the operators \(\wedge\) (AND), \(\vee\) (OR), and \(\neg\) (NOT), such that \(\varphi(x)\) is true if the sum \(\sum_{i=0}^{2} x_i\) (identifying “true” with \(1\) and “false” with \(0\)) is odd.

Use the logical quantifiers \(\forall\) (for all), \(\exists\) (there exists), as well as \(\wedge,\vee,\neg\) and the arithmetic operations \(+,\times,=,>,<\) to write the following:

  1. An expression \(\varphi(n,k)\) such that for every natural numbers \(n,k\), \(\varphi(n,k)\) is true if and only if \(k\) divides \(n\).
  2. An expression \(\varphi(n)\) such that for every natural number \(n\), \(\varphi(n)\) is true if and only if \(n\) is a power of three.

Describe in words the following sets:

  1. \(S = \{ x\in \{0,1\}^{100} : \forall_{i\in \{0,\ldots, 99\}} x_i = x_{99-i} \}\)
  2. \(T = \{ x\in \{0,1\}^* : \forall_{i,j \in \{2,\ldots,|x|-1 \} } i\cdot j \neq |x| \}\)

For each one of the following pairs of sets \((S,T)\), prove or disprove the following statement: there is a one to one function \(f\) mapping \(S\) to \(T\).

  1. Let \(n>10\). \(S = \{0,1\}^n\) and \(T= [n] \times [n] \times [n]\).
  2. Let \(n>10\). \(S\) is the set of all functions mapping \(\{0,1\}^n\) to \(\{0,1\}\). \(T = \{0,1\}^{n^3}\).
  3. Let \(n>100\). \(S = \{k \in [n] \;|\; k \text{ is prime} \}\), \(T = \{0,1\}^{\ceil{\log n -1}}\).
  1. Let \(A,B\) be finite sets. Prove that \(|A\cup B| = |A|+|B|-|A\cap B|\).
  2. Let \(A_0,\ldots,A_{k-1}\) be finite sets. Prove that \(|A_0 \cup \cdots \cup A_{k-1}| \geq \sum_{i=0}^{k-1} |A_i| - \sum_{0 \leq i < j < k} |A_i \cap A_j|\).
  3. Let \(A_0,\ldots,A_{k-1}\) be finite subsets of \(\{1,\ldots, n\}\), such that \(|A_i|=m\) for every \(i\in [k]\). Prove that if \(k>100n\), then there exist two distinct sets \(A_i,A_j\) s.t. \(|A_i \cap A_j| \geq m^2/(10n)\).

Prove that if \(S,T\) are finite and \(F:S \rightarrow T\) is one to one then \(|S| \leq |T|\).

Prove that if \(S,T\) are finite and \(F:S \rightarrow T\) is onto then \(|S| \geq |T|\).

Prove that for every finite \(S,T\), there are \((|T|+1)^{|S|}\) partial functions from \(S\) to \(T\).

Suppose that \(\{ S_n \}_{n\in \N}\) is a sequence such that \(S_0 \leq 10\) and for \(n>1\) \(S_n \leq 5 S_{\lfloor \tfrac{n}{5} \rfloor} + 2n\). Prove by induction that \(S_n \leq 100 n \log n\) for every \(n\).

Describe the following statement in English words: \(\forall_{n\in\N} \exists_{p>n} \forall{a,b \in \N} (a\times b \neq p) \vee (a=1)\).

Prove that for every undirected graph \(G\) of \(100\) vertices, if every vertex has degree at most \(4\), then there exists a subset \(S\) of at \(20\) vertices such that no two vertices in \(S\) are neighbors of one another.

For every pair of functions \(F,G\) below, determine which of the following relations holds: \(F=O(G)\), \(F=\Omega(G)\), \(F=o(G)\) or \(F=\omega(G)\).

  1. \(F(n)=n\), \(G(n)=100n\).
  2. \(F(n)=n\), \(G(n)=\sqrt{n}\).
  3. \(F(n)=n\log n\), \(G(n)=2^{(\log (n))^2}\).
  4. \(F(n)=\sqrt{n}\), \(G(n)=2^{\sqrt{\log n}}\)
  5. \(F(n) = \binom{n}{\ceil{0.2 n}}\) , \(G(n) = 2^{0.1 n}\).

Give an example of a pair of functions \(F,G:\N \rightarrow \N\) such that neither \(F=O(G)\) nor \(G=O(F)\) holds.

Prove that for every directed acyclic graph (DAG) \(G=(V,E)\), there exists a map \(f:V \rightarrow \N\) such that \(f(u)<f(v)\) for every edge \(\overrightarrow{u \; v}\) in the graph.Hint: Use induction on the number of vertices. You might want to first prove the claim that every DAG contains a sink: a vertex without an outgoing edge.

Prove that for every undirected graph \(G\) on \(n\) vertices, if \(G\) has at least \(n\) edges then \(G\) contains a cycle.

Bibliographical notes

The section heading “A Mathematician’s Apology”, refers of course to Hardy’s classic book. Even when Hardy is wrong, he is very much worth reading.


Comments are posted on the GitHub repository using the utteranc.es app. A GitHub login is required to comment. If you don't want to authorize the app to post on your behalf, you can also comment directly on the GitHub issue for this page.

Copyright 2018, Boaz Barak.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

HTML version is produced using the Distill template, Copyright 2018, The Distill Template Authors.