Project objective
    The objective of this project is to implement the "Elimination of States Method"
to obtain a regular expression for an automaton. Although
the regular language of the automaton is unique, the expression obtained for describing this language is
not. However, the Elimination of States Method, when judiciously applied, tends to produce a more
economical expression than some other known methods. Your implementation can be done using any programming
language you wish. However, you may not use any built-in regular expression features of the language
(as in Perl or Python - but I don't see how this could help any way), or
in fact any special feature not commonly found in high level languages.
Details about the assignment
Caution: Regular expressions should be implemented as strings in your program. However, a
regular expression represents a language, i.e. a set of strings. This can lead to
potential confusion in the following discussion.
Step 1. Using a programming language of your choosing,
start by writing a program that reads information from a text file,
describing an automaton (NFA, not a GNFA), and which creates a matrix
of strings that holds the initial regular expressions between the
various states that are obtained from the labels of the transition
arrow. When there are no arrows between two different states, then the
string that specifies the regular expression between these should be
string "p", representing the null language phi. We will let
"e" be the regular expression that represents the language
{epsilon}, that is, the set containing the empty string. When there are
no explicit transition arrows from a state to itself (loops), then the
corresponding regular expression could either be labeled "p" or
"e" (representing phi or {epsilon}) since this won't affect the
language. (You should understand this. I choose to label it "p"
below.) When there is a single transition from a state to another state
(or the same state) whose label is a symbol from the language, then the
corresponding regular expression in the matrix should just be string of
length one consisting of this symbol. If the transition is instead an
epsilon-transition, then the regular expression should instead be
"e". If there are multiple transitions between two states, then
use a regular expression
"(s1|s2|
...)" consisting of symbols (or "e") separated by
vertical bars and enclosed in parentheses.
    The file that describes the automaton should simply consist of a list of integers. The first number
indicates the number of states in the NFA being described. However, since you ultimately will want to
be able to go through the complete state elimination (essentially using a GNFA),
we want to have a non-accepting starting state that cannot be
reentered and a unique accepting state that cannot be exited. This might not be true of the NFA
being described in the file. So, instead of working with the
original automaton, described by the file (detailed below),
supplement this automaton with a new starting state with an
an epsilon-transition to the old starting state, and with a new accepting state with epsilon transitions
from the old accepting states (and make the old ones non-accepting). The matrix of regular expressions
that you generate should describe this augmented automaton, not the original automaton
described in the file.
The states of the original automaton will be understood to be numbered 1, 2, 3, ..., n, with 1
being the starting state. The second
number from the file indicates the number of accepting states, except that it is negative or positive, depending
on whether or not the (original) starting state is accepting. Apart from this starting state, it is understood that the
accepting states will be numbered less than the non-accepting states. For example, if the first two numbers
read from the file are 7 and 3, then it is implied that states 2, 3 and 4 are accepting, while states 1, 5, 6
and 7 are non-accepting. But if the first two numbers in the file are 7 and -3, then states 1, 2 and 3 are
accepting, while states 4, 5, 6 and 7 are non-accepting. Now, the augmented automaton that you actually
work with should have 0 as its new starting state and n+1 as its new (and only) accepting state.
    The third number read from the file is the number of symbols in the language, and should be positive.
If this number is n then it will be understood that the input symbols are 1, 2, ...., n.
The fourth number read from the file is the number of transitions,
which is then followed by a list of transitions.
Each transition is simply represented by three numbers,
the first being the number of the state where the transition originates,
the second being the number of the
state where the transition goes to, and the third being the input symbol that causes this transition
or else is zero for an epsilon-transition.
    As a complete example of how the list of numbers in the file is to be interpretted, consider the
four-state automaton whose states are labeled 1, 2, 3 and 4, with state 1 as the starting state, and with states 1 and 2 being
accepting. Assume that the input symbols are 1 and 2 (and don't confuse these with the states 1 and 2).
Suppose that there are the following transitions in the original automaton:
- from state 1 to state 2 with input symbol 2,
- from state 1 to state 3 with input symbol 2,
- from state 3 to state 3 with input symbol 1,
- from state 3 to state 4 with input symbol 1,
- from state 3 to state 4 with input symbol 2,
- from state 4 to state 1 with input symbol 2.
- epsilon-transition from state 3 to state 4.
- epsilon-transition from state 4 to state 3.
In this case, the list of integers in the file would be 4, -2, 2, 8, 1, 2, 2, 1, 3, 2, 3, 3, 1, 3, 4, 1, 3, 4, 2,
4, 1, 2, 3, 4, 0, 4, 3, 0; although as a practical matter, you wouldn't want to include the commas in the file, and you might
use carriage returns for better readability, like this:
4 -2 2 8
1 2 2
1 3 2
3 3 1
3 4 1
3 4 2
4 1 2
3 4 0
4 3 0
    The matrix (two-dimensional array) of strings that you need to construct would be doubly indexed by
the states of the automaton. The first index would indicate the source of the transition(s), and the second
would represent the destination. If we call this matrix RegExpr, and use C (Java) notation, then in
the above example we should end up with the following:
RegExpr[0][0] = "e"
RegExpr[0][1] = "e"
RegExpr[0][2] = "p"
RegExpr[0][3] = "p"
RegExpr[0][4] = "p"
RegExpr[0][5] = "p"
RegExpr[1][0] = "p"
RegExpr[1][1] = "e"
RegExpr[1][2] = "2"
RegExpr[1][3] = "2"
RegExpr[1][4] = "p"
RegExpr[1][5] = "e"
RegExpr[2][0] = "p"
RegExpr[2][1] = "p"
RegExpr[2][2] = "e"
RegExpr[2][3] = "p"
RegExpr[2][4] = "p"
RegExpr[2][5] = "e"
RegExpr[3][0] = "p"
RegExpr[3][1] = "p"
RegExpr[3][2] = "p"
RegExpr[3][3] = "1"
RegExpr[3][4] = "1|2|e"
RegExpr[3][5] = "p"
RegExpr[4][0] = "p"
RegExpr[4][1] = "2"
RegExpr[4][2] = "p"
RegExpr[4][3] = "e"
RegExpr[4][4] = "e"
RegExpr[4][5] = "p"
RegExpr[5][0] = "p"
RegExpr[5][1] = "p"
RegExpr[5][2] = "p"
RegExpr[5][3] = "p"
RegExpr[5][4] = "p"
RegExpr[5][5] = "e"
Check this. Again, notice that "p" means no transitions between states, while "e"
signifies an epsilon-transition.
Also, I augmented the original automaton by adding the new starting state and the new accepting state.
Step 2. Write three functions (= static methods) that take strings representing regular expressions as
input and which return another such string. These should implement the three basic regular expression operations:
union, concatenate and star. The function union should take two strings and
concatenate them, with "|" in between. However, if one of the
two arguments is "p", then the function should just return a copy of the other argument.
For example,
- union("121", "1|2") should result in the string "121|1|2"
- union("121", "p") should return "121".
The function concatenate should likewise take two input strings and concatenate these.
However, if one of the two arguments is p, then it should just return p, while
if one of the two arguments if e, it should return a copy of the other argument.
For example,
- concatenate("121", "1|2") should result in the string "121(1|2)"
- concatenate("121", "1(1|2)") should result in the string "1211(1|2)"
- concatenate("121", "1*") should result in the string "1211*"
- concatenate("121", "p") should result in the string "p"
- concatenate("121", "e") should result in the string "121"
(As a technical detail, we are regarding that star (*) takes precedence over concatenation,
and that concatenation takes precedence over union (|). For example,
12|1* means (12)|(1*). We want to impose parentheses as
little as possible. However, it was necessary in the first concatenation example to put them around the second
argument ("1|2") before concatenating. This was not the case in the second example.
In the first example, notice that if we scan the second argument from left to right,
we encounter a vertical bar but no left parenthesis before that. This means the second operand needs to
be wrapped in parentheses before concatenating. Similarly, scanning from right to left in the first
argument, if we encountered a vertical bar before a right parenthesis, we would need to wrap the first
argument in parentheses.)
Finally, the function star should just take a single string as input, wrap this in parentheses
(unless it is a single character), and attach an asterisk to the end of it. However, if the input string is
either e or p, then it should simply return (a fresh copy of) e. For example,
- star("p") should result in the string "e".
- star("e") should result in the string "e".
- star("2") should result in the string "2*".
- star("12") should result in the string "(12)*".
- star("1|2") should result in the string "(1|2)*".
Test these functions thoroughly before proceding. (Another technical fact:
C programmers will need to use malloc, Java programmers will need to use new String
(similarly for C++ programmers),
in order to dynamically allocate memory for the strings returned by these three functions.)
Step 3.
Write a function called saveRegExpr that takes no arguments, returns nothing, but which copies the matrix
regExpr to another such matrix holdRegExpr. Also write a function restoreRegExpr
which does the reverse. Also write a function displayRegExpr that displays the contents of regExpr
in a reasonably readable way (i.e. should be able to tell which regular expression belongs to which
pair of states).
Step 4. Write a function eliminateState that takes an integer argument,
representing a state, returns nothing, but which alters the matrix regExpr so as to capture the
situation that results from eliminating the given state. Test this thoroughly.
Step 5. Write a function called completelyReduce that takes no arguments and
returns nothing, but which iteratively uses eliminateState to eliminate all the
states (in some order) except the starting state and the final (accepting) state.
Have it also display the resulting regular expression between the starting and final start.