CS 139 - Project Two
Parsing XML using a DOM Tree

Drake University
Spring, 2006



Due date: 4/3

Important: You are allowed to have one teammate (partner) on this project. Let me know if you wish to have a partner.



Project objective

    This project gives some practical exposure to using the eXtensible Markup Language (XML), which is becoming an essential part of web development, and in particular web-enabled database management. Data presented in an XML document is syntactically analyzed by comparing it to a context-free grammer defined in a Document-Type Definitions (DTD) file. The DTD file defines the component parts of the data, and potentially relationships between the data (which are themselves just data!), much as database management system would include a data definition language for this purpose.

    You might get a better sense of context-free grammers as a result, but you'll see that XML can only handle restricted context-free grammers, ones that are guaranteed to be unambiguous whenever they are valid. You will also get exposure to a validating XML parser called Xerces. This parser generates a data structure which is a rooted (but non-binary) tree called a Document Object Model (DOM) Tree. Each node of this tree is an object that can be used in a C++ program (a Java version is available too). Each node in this tree either represents an element in the XML file, or else an attribute of an element. You will experiment with all this, and pick up the lingo as you go.



Submitting

    Copy your XML and DTD files in to the directory
/home/public/cs139/drop
by means of a command like
cp ... /home/public/cs139/drop
where "..." must be replaced with the name of the file to be copied. You should name your file ...ProjectTwo.c where ... is replaced with the last names of the team members.



Required example text files

    You will need to do this project in C++ in the Sheppard Lab. The files you will require to get started on this project are contained in the directory
/home/public/cs139/project2
We will discuss the code in detail in class before you start working on this project. Start by copying these files to your own directory, one set up for this project. The files needed are as follows:


The README file say the following:
    Copy the source file myDOMExample.cpp to your working directory. In fact,
    copy all the files from home/public/cs139/project2 to your working
    directory by means of the following command (which along with the other
    commands that follow, you might prefer to copy and paste, rather than
    typing them yourself):

        cp /home/public/cs139/project2/* .

    Here is one approach to compiling and running myDOMExample. You'll need to
    set some system parameters. Using bash, you would type the following
    (which you can copy and paste into the terminal window):

        export XERCESCROOT=/home/public/xerces-c-src_2_5_0
        export XERCES_INCLUDE=$XERCESCROOT/include
        export XERCES_LIBRARY=$XERCESCROOT/lib
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$XERCES_LIBRARY

    Now compile the source code for my example using the command

        g++ -c -w -O -I$XERCES_INCLUDE myDOMExample.cpp -o myDOMExample.o

    Then link using

        g++ myDOMExample.o -L$XERCES_LIBRARY -lxerces-c -lc -o myDOMExample

    Assuming you have my sample .dtd and .xlm files in your working directory,
    try the following:

        ./myDOMExample structures.xml

    Please let me know right away if there's any trouble here.

    M. Rieck

The myDOMExample.cpp file looks like this:

	// MyDOMExample.cpp

	// Derived from an example (DOMWordCount.java) I found.

	// M. Rieck

	// 9/16/2000
        // updated 3/25/2004

#include < xercesc/util/PlatformUtils.hpp>
#include < xercesc/util/XMLString.hpp>
#include < xercesc/util/TranscodingException.hpp>
#include < xercesc/sax/SAXException.hpp>
#include < xercesc/sax/SAXParseException.hpp>
#include < xercesc/dom/DOM.hpp>
#include < xercesc/framework/XMLFormatter.hpp>
#include < xercesc/parsers/XercesDOMParser.hpp>
#include < stdlib.h>
#include < string>
#include < iostream>

using namespace std;
using namespace xercesc;

void process(const DOMNode& node);
void process_attributes(const DOMNode& node);
void process_children(const DOMNode& node);
void process_first_child(const DOMNode& node);

int main(int argC, char* argV[])
{
  try {

    XMLPlatformUtils::Initialize();
    XercesDOMParser *parser = new XercesDOMParser;

    for (int i = 1; i < argC; i++) {
      parser->parse(argV[i]);
      DOMNode& doc = *(parser->getDocument());
      process(doc);
    }
  }

  catch(const XMLException& toCatch) {

    cerr << toCatch.getMessage() << endl;

  }
}

void process(const DOMNode& node) {

  switch(node.getNodeType()) {

    case DOMNode::DOCUMENT_NODE:

      process_first_child(node);

    case DOMNode::ELEMENT_NODE:

      cout << "<";
      cout << XMLString::transcode(node.getNodeName());
      process_attributes(node);
      cout << ">" << endl;
      process_children(node);
      cout << "" << endl;
      break;

    case DOMNode::ATTRIBUTE_NODE:

      cout << XMLString::transcode(node.getNodeName());
      cout << "=\"";
      cout << XMLString::transcode(node.getNodeValue());
      cout << "\"" << endl;
      break;

    default: break;

  }
}

void process_attributes(const DOMNode& node) {

  DOMNamedNodeMap *nnm = node.getAttributes();
  if (nnm != NULL) {
    int numAttributes = nnm->getLength();
    for (int j = 0; j < numAttributes; j++) process(*(nnm->item(j)));
  }
}

void process_children(const DOMNode& node) {

  if (node.hasChildNodes()) {
    DOMNodeList& children = *(node.getChildNodes());
    for (int i = 0; i < children.getLength(); i++)
      process(*(children.item(i)));
  }
}

void process_first_child(const DOMNode& node) {

  if (node.hasChildNodes()) {
    DOMNodeList& children = *(node.getChildNodes());
    if (children.getLength() > 0) process(*(children.item(0)));
  }
}

The robotics.dtd file looks like this:
    < !-- Document Type Definitions for all robotics project XML           -->

    < !-- A ROBOT element includes numerous attributes. Some of these like -->
    < !-- NAME must be explicitly given. Some, like HEIGHT, are optional.  -->
    < !-- Some, like MAX_SPEED, are optional, but have default values.     -->

    < !ELEMENT ROBOT (#PCDATA) >
    < !ATTLIST ROBOT
      NAME CDATA #REQUIRED
      CREATION_DATE CDATA #IMPLIED
      HEIGHT CDATA #IMPLIED
      WEIGHT CDATA #IMPLIED
      COLOR CDATA #IMPLIED
      MAX_SPEED CDATA "10"
      ACCELERATION CDATA "2"
      DECELERATION CDATA "5"
      TURN_FACTOR CDATA "10"
      LEFT_TURN_FACTOR CDATA "50"
      SCAN_DISTANCE CDATA "10"
      STEP_UP_DELAY CDATA "3"
      STEP_UP_RISK_FACTOR CDATA "5"
      STEP_DOWN_DELAY CDATA "3"
      STEP_DOWN_RISK_FACTOR CDATA "5"
      RECOVERY_DELAY CDATA "20"
    >

    < !-- A ROBOTS_LIST is just a list of ROBOTs, but also has an optional  -->
    < !-- NAME and and optional CREATION_DATE. 			           -->

    < !ELEMENT ROBOTS_LIST (ROBOT*) >
    < !ATTLIST ROBOTS_LIST
      NAME CDATA #IMPLIED
      CREATION_DATE CDATA #IMPLIED
    >

    < !-- A RECTANGLE has two required dimensions. It can also have offset  -->
    < !-- values, for use when placing inside another rectangle (or inside  -->
    < !-- a rectangular STRUCTURE). The offsets default to zero, and this   -->
    < !-- corresponds to positioning in the extreme left and down position. -->

    < !ELEMENT RECTANGLE (#PCDATA) >
    < !ATTLIST RECTANGLE
      X_LENGTH CDATA #REQUIRED
      Y_LENGTH CDATA #REQUIRED
      X_OFFSET CDATA "0"
      Y_OFFSET CDATA "0"
    >

    < !-- A HOLE is really just a RECTANGLE (I guess). 		            -->

    < !ELEMENT HOLE (RECTANGLE) >

    < !-- A HOLE_LIST is a list of HOLEs.				            -->

    < !ELEMENT HOLES_LIST (HOLE*) >

    < !-- A STRUCTURE is, conceptually, a rectangular slab (1 unit high)    -->
    < !-- that might contain HOLEs as well as holding other STRUCTUREs.     -->

    < !ELEMENT STRUCTURE (RECTANGLE, STRUCTURES_LIST?, HOLES_LIST?) >
    < !ATTLIST STRUCTURE
      NAME CDATA #IMPLIED
      CREATION_DATE CDATA #IMPLIED
    >

    < !-- A STRUCTURES_LIST is just a list of STRUCTURESs, but also has an  -->
    < !-- optional NAME and and optional CREATION_DATE.		            -->

    < !ELEMENT STRUCTURES_LIST (STRUCTURE*) >
    < !ATTLIST STRUCTURES_LIST
      NAME CDATA #IMPLIED
      CREATION_DATE CDATA #IMPLIED
    >

The robots.xml file looks like this:
    < ?xml version="1.0"? >
    < !DOCTYPE ROBOTS_LIST SYSTEM "robotics.dtd" >

    < ROBOTS_LIST NAME="AllRobots" CREATION_DATE="09/15/2000" >

    < !-- The first ROBOT example has all of the default values.   -->

      < ROBOT
	NAME="Default"
      > Hi there. I am a friendly robot!</ROBOT >

    < !-- The next ROBOT, named "Robbie", has all attributes	  -->
    < !-- explicitly defined.					        -->

      < ROBOT
	NAME="Robbie"
	CREATION_DATE="09/10/2000"
	HEIGHT="22"
	WEIGHT="55"
	COLOR="green"
	MAX_SPEED="20"
	ACCELERATION="3"
	DECELERATION="7"
	TURN_FACTOR="33"
	LEFT_TURN_FACTOR="50"
        SCAN_DISTANCE="20"
        STEP_UP_DELAY="7"
        STEP_UP_RISK_FACTOR="4"
        STEP_DOWN_DELAY="9"
        STEP_DOWN_RISK_FACTOR="3"
        RECOVERY_DELAY="23"
      > < /ROBOT >

    < !-- The last ROBOT is "Marvin".                              -->

      < ROBOT
        NAME="Marvin"
        CREATION_DATE="09/15/2000"
        HEIGHT="15"
        WEIGHT="38"
        COLOR="purple"
        MAX_SPEED="15"
        ACCELERATION="1"
        DECELERATION="5"
      > < /ROBOT >

    < /ROBOTS_LIST >

The structures.xml file looks like this:
    < ?xml version="1.0"? >
    < !DOCTYPE STRUCTURES_LIST SYSTEM "robotics.dtd" >

    < STRUCTURES_LIST NAME="AllStructures" CREATION_DATE="09/15/2000" >

    < !-- The first structure is just a flat square slab.                   -->

      < STRUCTURE NAME="FlatLand" CREATION_DATE="09/15/2000" >
        < RECTANGLE X_LENGTH="500" Y_LENGTH="500" > < /RECTANGLE >
      < /STRUCTURE >

    < !-- The next structure has a big hole in the middle. Notice that      -->
    < !-- the 500x500 slab has a 400x400 hole in it. The horizontal offset  -->
    < !-- of 100 means that the hole is positioned 100 units to the right   -->
    < !-- of the extreme left positioning. Similarly, the vertical offset   -->
    < !-- of 100 means a shift up (north) from the extreme down (south)     -->
    < !-- position.                                                         -->

      < STRUCTURE NAME="Frame" CREATION_DATE="09/15/2000" >
        < RECTANGLE X_LENGTH="500" Y_LENGTH="500" > < /RECTANGLE >
        < HOLES_LIST >
          < HOLE >
            < RECTANGLE X_LENGTH="400" Y_LENGTH="400" X_OFFSET="100" Y_OFFSET="100" > < /RECTANGLE >
          < /HOLE >
        < /HOLES_LIST >
      < /STRUCTURE >

    < !-- The next structure is a stack containing four slabs.              -->

      < STRUCTURE NAME="Pyramid" CREATION_DATE="09/15/2000" >
        < RECTANGLE X_LENGTH="400" Y_LENGTH="400" > < /RECTANGLE >
        < STRUCTURES_LIST >
          < STRUCTURE >
            < RECTANGLE X_LENGTH="300" Y_LENGTH="300" X_OFFSET="100" Y_OFFSET="100" > < /RECTANGLE >
            < STRUCTURES_LIST >
              < STRUCTURE >
                < RECTANGLE X_LENGTH="200" Y_LENGTH="200" X_OFFSET="100" Y_OFFSET="100" > < /RECTANGLE >
                < STRUCTURES_LIST >
                  < STRUCTURE >
                    < RECTANGLE X_LENGTH="100" Y_LENGTH="100" X_OFFSET="100" Y_OFFSET="100" > < /RECTANGLE >
                  < /STRUCTURE >
                < /STRUCTURES_LIST >
              < /STRUCTURE >
            < /STRUCTURES_LIST >
          < /STRUCTURE >
        < /STRUCTURES_LIST >
      < /STRUCTURE >

    < !-- The last example is pretty complicated. Numerous slabs and holes. Can you draw it? -->

      < STRUCTURE NAME="RoboWorld" CREATION_DATE="09/10/2000" >
        < RECTANGLE X_LENGTH="500" Y_LENGTH="400" > < /RECTANGLE >
        < STRUCTURES_LIST >
          < STRUCTURE >
            < RECTANGLE X_LENGTH="200" Y_LENGTH="200" X_OFFSET="200" Y_OFFSET="100" > < /RECTANGLE >
            < STRUCTURES_LIST >
              < STRUCTURE >
                < RECTANGLE X_LENGTH="100" Y_LENGTH="100" X_OFFSET="50" Y_OFFSET="50" > < /RECTANGLE >
                < HOLES_LIST >
                  < HOLE >
                    < RECTANGLE X_LENGTH="10" Y_LENGTH="30" X_OFFSET="40" Y_OFFSET="20" > < /RECTANGLE >
                  < /HOLE >
                  < HOLE >
                    < RECTANGLE X_LENGTH="20" Y_LENGTH="20" X_OFFSET="70" Y_OFFSET="70" > < /RECTANGLE >
                  < /HOLE >
                < /HOLES_LIST >
              < /STRUCTURE >
              < STRUCTURE >
                < RECTANGLE X_LENGTH="3" Y_LENGTH="3" X_OFFSET="1" Y_OFFSET="1" > < /RECTANGLE >
                < STRUCTURES_LIST >
                  < STRUCTURE >
                    < RECTANGLE X_LENGTH="1" Y_LENGTH="1" X_OFFSET="1" Y_OFFSET="1" > < /RECTANGLE >
                  < /STRUCTURE >
                < /STRUCTURES_LIST >
              < /STRUCTURE >
            < /STRUCTURES_LIST >
            < HOLES_LIST >
              < HOLE >
                < RECTANGLE X_LENGTH="10" Y_LENGTH="50" X_OFFSET="10" Y_OFFSET="10" > < /RECTANGLE >
              < /HOLE >
              < HOLE >
                < RECTANGLE X_LENGTH="180" Y_LENGTH="20" X_OFFSET="10" Y_OFFSET="50" > < /RECTANGLE >
              < /HOLE >
              < HOLE >
                < RECTANGLE X_LENGTH="190" Y_LENGTH="190" X_OFFSET="4" Y_OFFSET="3" > < /RECTANGLE >
              < /HOLE >
            < /HOLES_LIST >
          < /STRUCTURE >
          < STRUCTURE >
            < RECTANGLE X_LENGTH="100" Y_LENGTH="300" X_OFFSET="50" Y_OFFSET="50" > < /RECTANGLE >
          < /STRUCTURE >
        < /STRUCTURES_LIST >
        < HOLES_LIST >
          < HOLE >
            < RECTANGLE X_LENGTH="3" Y_LENGTH="3" X_OFFSET="300" Y_OFFSET="350" > < /RECTANGLE >
          < /HOLE >
          < HOLE >
            < RECTANGLE X_LENGTH="3" Y_LENGTH="3" X_OFFSET="310" Y_OFFSET="350" > < /RECTANGLE>
          < /HOLE >
          < HOLE >
            < RECTANGLE X_LENGTH="3" Y_LENGTH="3" X_OFFSET="320" Y_OFFSET="350" > < /RECTANGLE>
          < /HOLE >
          < HOLE >
            < RECTANGLE X_LENGTH="3" Y_LENGTH="3" X_OFFSET="330" Y_OFFSET="350" > < /RECTANGLE>
          < /HOLE >
        < /HOLES_LIST >
      < /STRUCTURE >

    < /STRUCTURES_LIST >

Details for the assignment


Part 1. Copy the various files described in the previous section into your directory for Project Two. Go through the steps in README, and check that you get the expected results. Pay particular attention to the order in which the information is presented, and the reason for this, viz. the program. You might want to try using myDOMExample on robots.xml first though, since this XML file is simpler than structures.xml. Submit nothing for grading here.

Part 2. Study the code in myDOMExample. If it succussfully processes a valid XML file, a DOM tree will be created. The function main() will pass the root node of this tree to the function process(). This function will identify the nature and details of the node passed to it, and will call the functions process_attributes() and process_children() to recursively process the child nodes of the passed node, where I am here regarding attributes as special child nodes (leaves) in the tree. Notice the order in which the nodes of the tree are processed, and the information which is displayed. Try out some other small DTD and XML files on your own to get a strong feel for what is happening here. Submit nothing.

Part 3. Write a DTD file to capture the following situation. A "part" can be characterized by name, weight and identification number attributes, where the first two attributes are optional, but the ID is not. A "parts" is a list of one or more "part" ' s. A "basic_assembly" has (optional) name, weight and (required) identification number atttibutes, and is comprised of a "parts" (list). A "widget" is (made up of) either a "part" or a "basic_assembly". A "complex_assembly" has (optional) name, weight and (required) identification number attributes. It consists of at least two "assembly" ' s. An "assembly" is (made up of) either a "widget" or a "complex_assembly". Submit a printout of your DTD file.

Note: XML will not actually allow a "widget" to be either a "part" or a "basic_assembly". It must rather consist of (be built form) either a "part" or a "basic_assembly". This is actually a serious downside to XML, which is therefore not really sufficiently expressive to capture context-free grammers in general. On the other hand, it allows valid XML to be unambiguously (deterministically) parsed.

Part 4. Write increasingly more complicated XML files to test out your DTD file. Test these using myDOMExample, and make sure that all the information is being displayed as expected. If not, something is syntactically wrong with either the DTD or the XML file. You should aim at ending up with an XML file for an "assembly" that is complicated enough to exhibit all the parts of all the definitions made in the DTD file. When you get there, submit a printout of your XML file.

Part 5. Hand draw (or use some software for this purpose if you wish) the DOM tree for your example in Part 4. Show the entity nodes and the attribute nodes. (I regard attribute nodes as special child nodes of their corresponding element nodes, and these child nodes are always leaves.)

Part 6. Write a program, based on MyDOMExample, that processes an XML file, creating a DOM tree. Have the program display only the CDATA that appears between double quotes that is assigned to attributes of elements. Have it do so in a postfix fashion, in the sense that attribute information for each child element is displayed before the attribute information for a parent element. Furthermore, have all this information displayed on one line (i.e. without introducing carriage returns). Test this out on my structures.xml example. Submit only the source code for your program. (This should be easy if you understand the original program.)

Part 7. Write a DTD file to represent the following context-free grammer which is similar to Example 2.3 on page 95 of Sipser's book:
     < EXPR > --> < TERM > | < SUM >
     < SUM >  --> < EXPR > < PLUS > < TERM >
     < TERM > --> < FACTOR > | < PRODUCT >
     < PRODUCT > --> < TERM > < TIMES > < FACTOR >
     < FACTOR > --> < IDENTIFIER > | < ENCLOSED_EXPRESSION >
     < ENCLOSED_EXPRESSION > --> < LEFT_PAREN> < EXPRESSION > < RIGHT_PAREN >
where < IDENTIFIER >, < PLUS >, < TIMES >, < LEFT_PAREN >, and < RIGHT_PAREN > are regarded as terminal symbols in the grammer, and < EXPR >, < TERM > and < FACTOR > are the variable symbols as in Example 2.3. A significant difference here is that we will let the < IDENTIFIER > tag represent an arbitrary "variable" in the sense of an algebra expression.

The < IDENTIFIER > tag in your DTD file should have a "name" attribute that will be set to the name of the variable (in the sense of an algebra expression) it represents. Also, the < PLUS >, < TIMES >, < LEFT_PAREN >, and < RIGHT_PAREN > tags should have a "symbol" attribute whose default values should respectively be: + , * , ( , ) . Also, the remaining tags (< EXPR >, < SUM >, < TERM >, < PRODUCT >, < FACTOR >, and < ENCLOSED_EXPRESSION >) should also have a "symbol" attribute, but here the default value should be the empty string.

    Write the DTD to capture the above situation. Then write an XML file that represents the expression
3 + X + 7 * ( Y * Z + 4 ) * X * X + ( 2 * Y + 5 * X * Z )
using the language you defined in the DTD file. Then use the program you wrote in Part 6 in order to read this XML file and display the above algebraic expression. Once this works properly, submit your DTD and XML files.

Part 8. Change your DTD file from Part 7 so that the default symbol for PLUS, TIMES, LEFT_PAREN, and RIGHT_PAREN is the null string, but the default symbol for < SUM > is +, and the default symbol for < PRODUCT > is * . Again run the program you wrote in Part 6. Indicate the result you got after making this change, describe these results and explain why they were expected. Submit only this, not the DTD code.