Before we can explore the grand ideas of what can and cannot be computed, we need to agree on a language. Not a programming language, but a formal, mathematical language that lets us speak about computation with absolute precision. Today is about laying that foundation. We’re building our vocabulary from the ground up, starting with the most basic element: the symbol.
Alphabet
Just like any written language, we begin with an alphabet. In our formal world, the definition is simple and clean:
An alphabet is any non-empty, finite set of symbols.
We usually denote an alphabet with the Greek letter (Sigma).
Let’s break down the two conditions:
- Non-empty: If we have no symbols, we can’t write anything down. We need at least one symbol to start describing things.
- Finite: This matches our intuition. The English alphabet has 26 letters. The binary system has 2 symbols. We work with a fixed, finite set of building blocks and combine them to create complexity.
The choice of alphabet depends entirely on what we want to describe.
Examples of Alphabets
- The binary alphabet:
- The lowercase English alphabet:
- The set of decimal digits:
- The symbols on a keyboard (including the spacebar):
Words
Once we have an alphabet, we can form words.
A word over an alphabet is a finite sequence of symbols from .
For example, if our alphabet is , then 01101
is a word. When we write words, we simply concatenate the symbols. We don’t need commas like (0, 1, 1, 0, 1)
because our alphabet consists of distinct, atomic symbols. There’s no ambiguity.
The Empty Word
There is one special word: the empty word. It’s a word with zero symbols, a sequence of length 0. We denote it with (Lambda). It’s a crucial concept, an object in its own right, not just “nothing.”
Length of a Word
The length of a word , denoted , is the number of symbols in its sequence.
- If over , then .
We can also count the occurrences of a specific symbol. We use to denote the number of times the symbol appears in .
- For , we have and .
The Set of All Words:
The set of all possible words that can be formed from an alphabet is denoted by (pronounced “Sigma-star”). This set is infinite, but every element within it is a finite-length word. always includes the empty word .
We also define as the set of all non-empty words over . That is, .
Listing Words: Order Matters
How would you list all the words in for ? A naive lexicographical approach runs into a problem:
You’d never get to a word containing a 1
!
To properly enumerate all words, we use the canonical ordering (or shortlex order):
- First, order words by length, with shorter words coming first.
- For words of the same length, order them lexicographically (like in a dictionary).
For with the order , the canonical ordering is:
This method guarantees that every word will eventually appear in the list. This proves that is a countably infinite set, a fact with profound implications we’ll see later.
An Algebra of Words
We can define operations on words, giving them a nice algebraic structure.
Concatenation
The most fundamental operation is concatenation, which is simply placing one word after another. If and , their concatenation is .
This operation, together with the set , forms a structure called a monoid:
- Associativity: For any words , it holds that . The way we group them doesn’t matter.
- Neutral Element: The empty word is the neutral element. For any word , .
Reverse
The reverse of a word , denoted , is the word written backwards.
- If , then .
- Example: .
A neat property to prove is that for any two words :
This is like putting on socks and then shoes; to reverse the process, you must take off the shoes first, then the socks.
Powers
We can define powers of a word through repeated concatenation:
- for . For example, .
Deconstructing Words: Subwords, Prefixes, and Suffixes
We often need to talk about parts of a word.
- A prefix is an initial segment of a word. is a prefix of if for some word .
- A suffix is a final segment of a word. is a suffix of if for some word .
- A subword (or substring) is a contiguous block anywhere inside a word. is a subword of if for some words and .
For the word :
- Prefixes:
- Suffixes:
- Subwords: All prefixes and suffixes, plus others like
nan
.
Languages: Sets of Words
Now we get to the main event. We use our vocabulary of words to define languages.
A language over an alphabet is any subset of .
That’s it. A language is simply a set of words. This definition is incredibly broad and powerful.
Examples of Languages
- Let . The language of prime numbers is the set . This is a subset of .
- Let be the alphabet of keyboard characters. The language is the set of all syntactically correct Java programs.
- The empty set, , is a language. It contains no words.
- The set , containing only the empty word, is also a language.
The crucial idea is this: We use languages to formally represent problems. The “primality testing problem” is equivalent to the “language membership problem” for : given a word , is ? The task of a Java compiler is to decide if a given source file (a word) is in the language .
An Algebra of Languages
Since languages are sets, we can use standard set operations like union (), intersection (), and complement (). But we can also extend our word operations to languages.
Concatenation of Languages
The concatenation of two languages and is the set of all words formed by taking a word from and sticking a word from onto it.
Example: Let and .
Notice that the size of the resulting language can be smaller than if duplicates are formed.
Powers and the Kleene Star
We can extend powers to languages as well:
- (This is a convention)
This leads to one of the most important operations in formal language theory, the Kleene Star (or Kleene Closure).
The Kleene Star of a language , denoted , is the union of all its powers. It represents “zero or more” concatenations of words from .
Similarly, the Kleene Plus, , represents “one or more” concatenations:
Playing with the Algebra
We can now ask if familiar algebraic laws hold. For example, does concatenation distribute over union?
Yes, it does. We can prove this by showing that any word in the left-hand set must also be in the right-hand set, and vice-versa. This often involves translating the language operations into their logical definitions.
What about intersection? Does ? Let’s check. The inclusion from left to right, , holds.
But the other direction fails. A word might be in but have different decompositions for each part. Counterexample:
Let’s compute both sides:
- Left side: . So, .
- Right side:
Since , the equality does not hold in general.
Problems, Programs, and Algorithms: An Informal Start
We’ve built our formal language. How do we use it to talk about algorithms before we’ve even defined what an algorithm is? We’ll start with an informal, but useful, distinction.
- Program: A text that is syntactically correct according to the rules of some programming language (like Java). A compiler can check this. A program might be nonsense, it might crash, or it might run forever.
- Algorithm: A special kind of program that solves a specific problem. For us, this means it’s a program that, for any valid input, is guaranteed to halt in finite time and produce a correct output.
This distinction is critical. We can write an algorithm (a compiler) to decide if a given text is a syntactically valid program. However, as we will prove later, it is impossible to write a general algorithm that decides if a given program is an algorithm (i.e., if it halts for all inputs).
We can check syntax automatically. We cannot, in general, check semantics (meaning, correctness, termination) automatically. This is one of the deepest results in computer science.
Continue here: 03 Formalizing Algorithmic Problems and Information