CS 139 - Programming Project One
Obtaining Regular Expressions for Automata

Drake University
Spring, 2006



Due date: 3/3




Project objective

    The objective of this project is to implement the "Elimination of States Method" to obtain a regular expression for an automaton. Although the regular language of the automaton is unique, the expression obtained for describing this language is not. However, the Elimination of States Method, when judiciously applied, tends to produce a more economical expression than some other known methods. Your implementation can be done using any programming language you wish. However, you may not use any built-in regular expression features of the language (as in Perl or Python - but I don't see how this could help any way), or in fact any special feature not commonly found in high level languages.



Details about the assignment


Caution: Regular expressions should be implemented as strings in your program. However, a regular expression represents a language, i.e. a set of strings. This can lead to potential confusion in the following discussion.

Step 1. Using a programming language of your choosing, start by writing a program that reads information from a text file, describing an automaton (NFA, not a GNFA), and which creates a matrix of strings that holds the initial regular expressions between the various states that are obtained from the labels of the transition arrow. When there are no arrows between two different states, then the string that specifies the regular expression between these should be string "p", representing the null language phi. We will let "e" be the regular expression that represents the language {epsilon}, that is, the set containing the empty string. When there are no explicit transition arrows from a state to itself (loops), then the corresponding regular expression could either be labeled "p" or "e" (representing phi or {epsilon}) since this won't affect the language. (You should understand this. I choose to label it "p" below.) When there is a single transition from a state to another state (or the same state) whose label is a symbol from the language, then the corresponding regular expression in the matrix should just be string of length one consisting of this symbol. If the transition is instead an epsilon-transition, then the regular expression should instead be "e". If there are multiple transitions between two states, then use a regular expression "(s1|s2| ...)" consisting of symbols (or "e") separated by vertical bars and enclosed in parentheses.

    The file that describes the automaton should simply consist of a list of integers. The first number indicates the number of states in the NFA being described. However, since you ultimately will want to be able to go through the complete state elimination (essentially using a GNFA), we want to have a non-accepting starting state that cannot be reentered and a unique accepting state that cannot be exited. This might not be true of the NFA being described in the file. So, instead of working with the original automaton, described by the file (detailed below), supplement this automaton with a new starting state with an an epsilon-transition to the old starting state, and with a new accepting state with epsilon transitions from the old accepting states (and make the old ones non-accepting). The matrix of regular expressions that you generate should describe this augmented automaton, not the original automaton described in the file. The states of the original automaton will be understood to be numbered 1, 2, 3, ..., n, with 1 being the starting state. The second number from the file indicates the number of accepting states, except that it is negative or positive, depending on whether or not the (original) starting state is accepting. Apart from this starting state, it is understood that the accepting states will be numbered less than the non-accepting states. For example, if the first two numbers read from the file are 7 and 3, then it is implied that states 2, 3 and 4 are accepting, while states 1, 5, 6 and 7 are non-accepting. But if the first two numbers in the file are 7 and -3, then states 1, 2 and 3 are accepting, while states 4, 5, 6 and 7 are non-accepting. Now, the augmented automaton that you actually work with should have 0 as its new starting state and n+1 as its new (and only) accepting state.

    The third number read from the file is the number of symbols in the language, and should be positive. If this number is n then it will be understood that the input symbols are 1, 2, ...., n. The fourth number read from the file is the number of transitions, which is then followed by a list of transitions. Each transition is simply represented by three numbers, the first being the number of the state where the transition originates, the second being the number of the state where the transition goes to, and the third being the input symbol that causes this transition or else is zero for an epsilon-transition.

    As a complete example of how the list of numbers in the file is to be interpretted, consider the four-state automaton whose states are labeled 1, 2, 3 and 4, with state 1 as the starting state, and with states 1 and 2 being accepting. Assume that the input symbols are 1 and 2 (and don't confuse these with the states 1 and 2). Suppose that there are the following transitions in the original automaton: In this case, the list of integers in the file would be 4, -2, 2, 8, 1, 2, 2, 1, 3, 2, 3, 3, 1, 3, 4, 1, 3, 4, 2, 4, 1, 2, 3, 4, 0, 4, 3, 0; although as a practical matter, you wouldn't want to include the commas in the file, and you might use carriage returns for better readability, like this:
    4 -2 2 8
    1 2 2
    1 3 2
    3 3 1
    3 4 1
    3 4 2
    4 1 2
    3 4 0
    4 3 0

    The matrix (two-dimensional array) of strings that you need to construct would be doubly indexed by the states of the automaton. The first index would indicate the source of the transition(s), and the second would represent the destination. If we call this matrix RegExpr, and use C (Java) notation, then in the above example we should end up with the following:
    RegExpr[0][0] = "e"
    RegExpr[0][1] = "e"
    RegExpr[0][2] = "p"
    RegExpr[0][3] = "p"
    RegExpr[0][4] = "p"
    RegExpr[0][5] = "p"
    RegExpr[1][0] = "p"
    RegExpr[1][1] = "e"
    RegExpr[1][2] = "2"
    RegExpr[1][3] = "2"
    RegExpr[1][4] = "p"
    RegExpr[1][5] = "e"
    RegExpr[2][0] = "p"
    RegExpr[2][1] = "p"
    RegExpr[2][2] = "e"
    RegExpr[2][3] = "p"
    RegExpr[2][4] = "p"
    RegExpr[2][5] = "e"
    RegExpr[3][0] = "p"
    RegExpr[3][1] = "p"
    RegExpr[3][2] = "p"
    RegExpr[3][3] = "1"
    RegExpr[3][4] = "1|2|e"
    RegExpr[3][5] = "p"
    RegExpr[4][0] = "p"
    RegExpr[4][1] = "2"
    RegExpr[4][2] = "p"
    RegExpr[4][3] = "e"
    RegExpr[4][4] = "e"
    RegExpr[4][5] = "p"
    RegExpr[5][0] = "p"
    RegExpr[5][1] = "p"
    RegExpr[5][2] = "p"
    RegExpr[5][3] = "p"
    RegExpr[5][4] = "p"
    RegExpr[5][5] = "e"
Check this. Again, notice that "p" means no transitions between states, while "e" signifies an epsilon-transition. Also, I augmented the original automaton by adding the new starting state and the new accepting state.


Step 2. Write three functions (= static methods) that take strings representing regular expressions as input and which return another such string. These should implement the three basic regular expression operations: union, concatenate and star. The function union should take two strings and concatenate them, with "|" in between. However, if one of the two arguments is "p", then the function should just return a copy of the other argument. For example,


The function concatenate should likewise take two input strings and concatenate these. However, if one of the two arguments is p, then it should just return p, while if one of the two arguments if e, it should return a copy of the other argument. For example,


(As a technical detail, we are regarding that star (*) takes precedence over concatenation, and that concatenation takes precedence over union (|). For example, 12|1* means (12)|(1*). We want to impose parentheses as little as possible. However, it was necessary in the first concatenation example to put them around the second argument ("1|2") before concatenating. This was not the case in the second example. In the first example, notice that if we scan the second argument from left to right, we encounter a vertical bar but no left parenthesis before that. This means the second operand needs to be wrapped in parentheses before concatenating. Similarly, scanning from right to left in the first argument, if we encountered a vertical bar before a right parenthesis, we would need to wrap the first argument in parentheses.)

Finally, the function star should just take a single string as input, wrap this in parentheses (unless it is a single character), and attach an asterisk to the end of it. However, if the input string is either e or p, then it should simply return (a fresh copy of) e. For example,


Test these functions thoroughly before proceding. (Another technical fact: C programmers will need to use malloc, Java programmers will need to use new String (similarly for C++ programmers), in order to dynamically allocate memory for the strings returned by these three functions.)

Step 3. Write a function called saveRegExpr that takes no arguments, returns nothing, but which copies the matrix regExpr to another such matrix holdRegExpr. Also write a function restoreRegExpr which does the reverse. Also write a function displayRegExpr that displays the contents of regExpr in a reasonably readable way (i.e. should be able to tell which regular expression belongs to which pair of states).

Step 4. Write a function eliminateState that takes an integer argument, representing a state, returns nothing, but which alters the matrix regExpr so as to capture the situation that results from eliminating the given state. Test this thoroughly.

Step 5. Write a function called completelyReduce that takes no arguments and returns nothing, but which iteratively uses eliminateState to eliminate all the states (in some order) except the starting state and the final (accepting) state. Have it also display the resulting regular expression between the starting and final start.