02 The Language of Computation - Alphabets, Words, and Languages

Before we can explore the grand ideas of what can and cannot be computed, we need to agree on a language. Not a programming language, but a formal, mathematical language that lets us speak about computation with absolute precision. Today is about laying that foundation. We’re building our vocabulary from the ground up, starting with the most basic element: the symbol.

Alphabet

Just like any written language, we begin with an alphabet. In our formal world, the definition is simple and clean:

An alphabet is any non-empty, finite set of symbols.

We usually denote an alphabet with the Greek letter $Σ$ (Sigma).

Let’s break down the two conditions:

Non-empty: If we have no symbols, we can’t write anything down. We need at least one symbol to start describing things.
Finite: This matches our intuition. The English alphabet has 26 letters. The binary system has 2 symbols. We work with a fixed, finite set of building blocks and combine them to create complexity.

The choice of alphabet depends entirely on what we want to describe.

Examples of Alphabets

The binary alphabet: $Σ = {0, 1}$
The lowercase English alphabet: $Σ = {a, b, c, \dots, z}$
The set of decimal digits: $Σ_{d ec} = {0, 1, \dots, 9}$
The symbols on a keyboard (including the spacebar): $Σ_{k eys} = {a, \dots, z, A, \dots, Z, 0, \dots, 9,!, ?,, \dots}$

Words

Once we have an alphabet, we can form words.

A word over an alphabet $Σ$ is a finite sequence of symbols from $Σ$ .

For example, if our alphabet is $Σ = {0, 1}$ , then 01101 is a word. When we write words, we simply concatenate the symbols. We don’t need commas like (0, 1, 1, 0, 1) because our alphabet consists of distinct, atomic symbols. There’s no ambiguity.

The Empty Word

There is one special word: the empty word. It’s a word with zero symbols, a sequence of length 0. We denote it with $λ$ (Lambda). It’s a crucial concept, an object in its own right, not just “nothing.”

Length of a Word

The length of a word $w$ , denoted $∣ w ∣$ , is the number of symbols in its sequence.

$∣01101∣ = 5$
$∣ λ ∣ = 0$
If $w = ab c a$ over $Σ = {a, b, c}$ , then $∣ w ∣ = 4$ .

We can also count the occurrences of a specific symbol. We use $∣ w ∣_{a}$ to denote the number of times the symbol $a$ appears in $w$ .

For $w = 101101$ , we have $∣ w ∣_{1} = 4$ and $∣ w ∣_{0} = 2$ .

The Set of All Words: $Σ^{*}$

The set of all possible words that can be formed from an alphabet $Σ$ is denoted by $Σ^{*}$ (pronounced “Sigma-star”). This set is infinite, but every element within it is a finite-length word. $Σ^{*}$ always includes the empty word $λ$ .

We also define $Σ^{+}$ as the set of all non-empty words over $Σ$ . That is, $Σ^{+} = Σ^{*} ∖ {λ}$ .

Listing Words: Order Matters

How would you list all the words in $Σ^{*}$ for $Σ = {0, 1}$ ? A naive lexicographical approach runs into a problem:

λ, 0, 00, 000, 0000, \dots

You’d never get to a word containing a 1!

To properly enumerate all words, we use the canonical ordering (or shortlex order):

First, order words by length, with shorter words coming first.
For words of the same length, order them lexicographically (like in a dictionary).

For $Σ = {0, 1}$ with the order $0 < 1$ , the canonical ordering is:

λ, 0, 1, 00, 01, 10, 11, 000, \dots

This method guarantees that every word will eventually appear in the list. This proves that $Σ^{*}$ is a countably infinite set, a fact with profound implications we’ll see later.

An Algebra of Words

We can define operations on words, giving them a nice algebraic structure.

Concatenation

The most fundamental operation is concatenation, which is simply placing one word after another. If $u = 01$ and $v = 110$ , their concatenation is $uv = 01110$ .

This operation, together with the set $Σ^{*}$ , forms a structure called a monoid:

Associativity: For any words $u, v, w$ , it holds that $(uv) w = u (v w)$ . The way we group them doesn’t matter.
Neutral Element: The empty word $λ$ is the neutral element. For any word $w$ , $w λ = λ w = w$ .

Reverse

The reverse of a word $w$ , denoted $w^{R}$ , is the word written backwards.

If $w = a_{1} a_{2} \dots a_{n}$ , then $w^{R} = a_{n} \dots a_{2} a_{1}$ .
Example: $(ab c)^{R} = c ba$ .

A neat property to prove is that for any two words $u, v$ :

(uv)^{R} = v^{R} u^{R}

This is like putting on socks and then shoes; to reverse the process, you must take off the shoes first, then the socks.

Powers

We can define powers of a word through repeated concatenation:

$w^{0} = λ$
$w^{1} = w$
$w^{i} = w \cdot w^{i - 1}$ for $i > 0$ . For example, $(ab)^{3} = ababab$ .

Deconstructing Words: Subwords, Prefixes, and Suffixes

We often need to talk about parts of a word.

A prefix is an initial segment of a word. $v$ is a prefix of $w$ if $w = v y$ for some word $y$ .
A suffix is a final segment of a word. $v$ is a suffix of $w$ if $w = xv$ for some word $x$ .
A subword (or substring) is a contiguous block anywhere inside a word. $v$ is a subword of $w$ if $w = xv y$ for some words $x$ and $y$ .

For the word $w = banana$ :

Prefixes: $λ, b, ba, ban, bana, banan, banana$
Suffixes: $λ, a, na, ana, nana, anana, banana$
Subwords: All prefixes and suffixes, plus others like nan.

Languages: Sets of Words

Now we get to the main event. We use our vocabulary of words to define languages.

A language $L$ over an alphabet $Σ$ is any subset of $Σ^{*}$ .

That’s it. A language is simply a set of words. This definition is incredibly broad and powerful.

Examples of Languages

Let $Σ_{d ec} = {0, 1, \dots, 9}$ . The language of prime numbers is the set $L_{p r im es} = {2, 3, 5, 7, 11, 13, \dots}$ . This is a subset of $Σ_{d ec}^{*}$ .
Let $Σ_{k eys}$ be the alphabet of keyboard characters. The language $L_{J a v a}$ is the set of all syntactically correct Java programs.
The empty set, $\emptyset$ , is a language. It contains no words.
The set ${λ}$ , containing only the empty word, is also a language.

The crucial idea is this: We use languages to formally represent problems. The “primality testing problem” is equivalent to the “language membership problem” for $L_{p r im es}$ : given a word $w$ , is $w \in L_{p r im es}$ ? The task of a Java compiler is to decide if a given source file (a word) is in the language $L_{J a v a}$ .

An Algebra of Languages

Since languages are sets, we can use standard set operations like union ( $\cup$ ), intersection ( $\cap$ ), and complement ( $L^{c} = Σ^{*} ∖ L$ ). But we can also extend our word operations to languages.

Concatenation of Languages

The concatenation of two languages $L_{1}$ and $L_{2}$ is the set of all words formed by taking a word from $L_{1}$ and sticking a word from $L_{2}$ onto it.

L_{1} L_{2} = {uv ∣ u \in L_{1} and v \in L_{2}}

Example: Let $L_{1} = {a, ab}$ and $L_{2} = {c, b c}$ .

L_{1} L_{2} = {a c, ab c, ab c, abb c} = {a c, ab c, abb c}

Notice that the size of the resulting language can be smaller than $∣ L_{1} ∣ \cdot ∣ L_{2} ∣$ if duplicates are formed.

Powers and the Kleene Star

We can extend powers to languages as well:

$L^{0} = {λ}$ (This is a convention)
$L^{i + 1} = L \cdot L^{i}$

This leads to one of the most important operations in formal language theory, the Kleene Star (or Kleene Closure).

The Kleene Star of a language $L$ , denoted $L^{*}$ , is the union of all its powers. It represents “zero or more” concatenations of words from $L$ .
$L^{*} = i = 0 ⋃ \infty L^{i} = L^{0} \cup L^{1} \cup L^{2} \cup \dots$

Similarly, the Kleene Plus, $L^{+}$ , represents “one or more” concatenations:

L^{+} = i = 1 ⋃ \infty L^{i} = L^{1} \cup L^{2} \cup \dots

Playing with the Algebra

We can now ask if familiar algebraic laws hold. For example, does concatenation distribute over union?

L_{1} (L_{2} \cup L_{3}) = (L_{1} L_{2}) \cup (L_{1} L_{3})

Yes, it does. We can prove this by showing that any word in the left-hand set must also be in the right-hand set, and vice-versa. This often involves translating the language operations into their logical definitions.

What about intersection? Does $L_{1} (L_{2} \cap L_{3}) = (L_{1} L_{2}) \cap (L_{1} L_{3})$ ? Let’s check. The inclusion from left to right, $L_{1} (L_{2} \cap L_{3}) \subseteq (L_{1} L_{2}) \cap (L_{1} L_{3})$ , holds.

But the other direction fails. A word might be in $(L_{1} L_{2}) \cap (L_{1} L_{3})$ but have different decompositions for each part. Counterexample:

$L_{1} = {a, aa}$
$L_{2} = {b}$
$L_{3} = {ab}$

Let’s compute both sides:

Left side: $L_{2} \cap L_{3} = \emptyset$ . So, $L_{1} (L_{2} \cap L_{3}) = L_{1} \emptyset = \emptyset$ .
Right side:
- $L_{1} L_{2} = {ab, aab}$
- $L_{1} L_{3} = {aab, aaab}$
- $(L_{1} L_{2}) \cap (L_{1} L_{3}) = {aab}$

Since $\emptyset \neq = {aab}$ , the equality does not hold in general.

Problems, Programs, and Algorithms: An Informal Start

We’ve built our formal language. How do we use it to talk about algorithms before we’ve even defined what an algorithm is? We’ll start with an informal, but useful, distinction.

Program: A text that is syntactically correct according to the rules of some programming language (like Java). A compiler can check this. A program might be nonsense, it might crash, or it might run forever.
Algorithm: A special kind of program that solves a specific problem. For us, this means it’s a program that, for any valid input, is guaranteed to halt in finite time and produce a correct output.

This distinction is critical. We can write an algorithm (a compiler) to decide if a given text is a syntactically valid program. However, as we will prove later, it is impossible to write a general algorithm that decides if a given program is an algorithm (i.e., if it halts for all inputs).

We can check syntax automatically. We cannot, in general, check semantics (meaning, correctness, termination) automatically. This is one of the deepest results in computer science.

Continue here: 03 Formalizing Algorithmic Problems and Information

CS Notes

Explorer

02 The Language of Computation - Alphabets, Words, and Languages

Alphabet

Examples of Alphabets

Words

The Empty Word

Length of a Word

The Set of All Words: $Σ^{*}$

Listing Words: Order Matters

An Algebra of Words

Concatenation

Reverse

Powers

Deconstructing Words: Subwords, Prefixes, and Suffixes

Languages: Sets of Words

Examples of Languages

An Algebra of Languages

Concatenation of Languages

Powers and the Kleene Star

Playing with the Algebra

Problems, Programs, and Algorithms: An Informal Start

Table of Contents

Graph View

CS Notes

Explorer

02 The Language of Computation - Alphabets, Words, and Languages

Alphabet

Examples of Alphabets

Words

The Empty Word

Length of a Word

The Set of All Words: Σ∗

Listing Words: Order Matters

An Algebra of Words

Concatenation

Reverse

Powers

Deconstructing Words: Subwords, Prefixes, and Suffixes

Languages: Sets of Words

Examples of Languages

An Algebra of Languages

Concatenation of Languages

Powers and the Kleene Star

Playing with the Algebra

Problems, Programs, and Algorithms: An Informal Start

Table of Contents

Graph View

The Set of All Words: $Σ^{*}$