OHJ-2206 project topic 2009:
Determinization of NFAs

The problem

In mathematics, an alphabet is a finite set of elements that are called characters. A word (or string) is a finite number of characters picked from the alphabet put next to each other. The alphabet of the English language {a, b, c, …, z} is an alphabet also in the mathematical sense, and written words, such as "program", are words.

In mathematics, also the empty word that consists of zero characters is needed. It has been given the special symbol ε, that is, epsilon, because the logical way of writing it, that is, zero characters in a row, is invisible. Epsilon does not belong to the alphabet.

A nondeterministic finite automaton, that is, NFA, is a mathematical structure that consists of five parts:

namemathematical symbolcondition
statesQfinite set
alphabetΣ finite set that does not contain epsilon, that is, ε ∉ Σ
transitionsΔ each transition starts at the tail state, ends at the head state, and is labelled by an element of the alphabet or epsilon, that is, Δ ⊆ Q × (Σ∪{ε}) × Q
initial stateqˆ the initial state is a state, that is, qˆ ∈ Q
final statesF final states are states, that is F ⊆ Q

(In qˆ, the "ˆ" should be immediately above the "q", but that is not easy to do on a web page.)

An NFA accepts the word a1a2an, if and only if there is a path in the NFA that starts at the initial state, ends at a final state, and the sequence of the labels of the transitions along the path with epsilons removed is a1a2an. Thus epsilon as a name of a transition means that no character is added to the word at that transition, and if the label is not epsilon, then the label is added to the word at that transition.

However, usually the way of thinking is not that a path is traversed and the resulting word is printed; instead, it is that a word is read and it leads to some state of the NFA. This is why instead of producing characters and words, we speak about reading characters and accepting words.

The language of an NFA is the set of the words that it accepts.

A deterministic finite automaton, that is, DFA, is otherwise the same as an NFA, but the transitions obey the additional condition that there are no epsilon-transitions and no two transitions from the same state have the same label.

q: ∀a: ∀q1: ∀q2: ( (q,a,q1) ∈ Δ ∧ (q,a,q2) ∈ Δ ⇒ a ≠ ε ∧ q1 = q2 )

The set of transitions of a DFA is often denoted with a partial function δ such that (q,a,q′) ∈ Δ ⇔ q′ = δ(q,a).

Your program must read an NFA and print a DFA that accepts the same language as the input NFA. This operation is known as the determinization of an NFA.

The subset construction is used for the determinization of NFAs. In it, for every word, the set of states is computed to which reading the word starting at the initial state may take the NFA. These sets will be the states of the DFA. If the set contains at least one final state of the NFA, then the set becomes a final state of the DFA. The initial state of the DFA is the set of those states of the NFA that can be reached from the initial state of the NFA with epsilon-transitions. This set contains the initial state of the NFA, because it can be reached from itself with zero transitions.

The above-mentioned sets are not constructed separately for each word, because there are usually infinitely many words. The initial state of the DFA is constructed as was mentioned above. The other sets are constructed by taking an already constructed set, choosing some label a, and finding those states that can be reached from at least one state of the current set by reading a. This means that the new set contains precisely those states to which there is a path from some state in the old set that consists of zero or more epsilon-transitions followed by an a-transition followed by zero or more epsilon-transitions. If the new state is the same as some earlier set, then they are the same state of the DFA. To implement this, the new set must be compared to all sets that have already been constructed. From the state of the DFA that corresponds to the old set an a-transition is drawn to the state of the DFA that corresponds to the new set.

This construction has the problem that very many sets could be constructed and the DFA may thus become very big. This cannot be always avoided, because there are NFAs whose smallest corresponding DFA contains 2n–1 states, where n is the number of states in the NFA. However, sometimes the construction produces many more states than would be necessary, and simple improvements to the construction may reduce the number of sets significantly. Using this kind of improvements is considered as a merit, but one must be careful to not accidentally change the accepted language. That is, the output is allowed to contain fewer states than the subset construction produces, if it accepts precisely the same words.

The goal is to write a program whose running time and memory usage are as small as possible. The teacher does not look at the number of states in the output when assessing the performance of the program. Therefore, reducing the size of the DFA afterwards usually does not improve the program, because it does not reduce memory consumption at all, and it reduces time consumption only if the savings obtained from writing fewer states to the output outweigh the extra time consumed in the reduction.

One must assume that the input NFA usually contains much fewer transitions than an NFA with the same alphabet and set of states may maximally have.

A lot of information on NFAs and DFAs can be found on any introduction to theoretical computer science.

The input

The input consists of natural numbers. It consists of the following parts:

<The input>::= <Header> <Final states> <Transitions>
<Header>::= <number of states> <size of alphabet> <number of transitions> <number of final states> <initial state>
<Final states>::= <state>*
<Transitions>::= (<tail state> <label> <head state>)*

The states are given as numbers in the range 1, 2, …, <number of states>. The labels are numbered 1, 2, …, <size of alphabet>. Epsilon-transitions are labelled with 0. Final states are given in increasing order. Transitions are primarily given in the order of the numbers of tail states, secondarily in the order of the label, and lastly in the order of the numbers of head states. For instance, transition 1 5 7 is given before transition 2 1 1, and 2 1 1 is given before 2 1 2.

The output

The output obeys the same syntax as the input.

Read also general instructions of the projects.