CS 139 - Project Two
Parsing XML using a DOM Tree
Drake University
Spring, 2006
Due date: 4/3
Important: You are allowed to have one teammate (partner) on this
project. Let me know if you wish to have a partner.
Project objective
    This project gives some practical exposure to using the eXtensible Markup Language (XML), which is
becoming an essential part of web development, and in particular
web-enabled database management. Data presented in an XML document is syntactically analyzed by comparing it
to a context-free grammer defined in a Document-Type Definitions (DTD) file. The DTD file defines the component
parts of the data, and potentially relationships between the data (which are themselves just data!),
much as database management system would include a data definition language for this purpose.
    You might get a better sense of context-free grammers as a result, but you'll see that XML can only handle
restricted context-free grammers, ones that are guaranteed to be unambiguous whenever they are valid. You will
also get exposure to a validating XML parser called Xerces. This parser generates a data structure which is
a rooted (but non-binary) tree called a Document Object Model (DOM) Tree. Each node of this tree is an object that
can be used in a C++ program (a Java version is available too). Each node in this tree either represents an
element in the XML file, or else an attribute of an element. You will experiment with all this,
and pick up the lingo as you go.
Submitting
    Copy your XML and DTD files in to the directory
/home/public/cs139/drop
by means of a command like
cp ... /home/public/cs139/drop
where "..." must be replaced with the name of the file to be copied.
You should name your file ...ProjectTwo.c where ... is
replaced with the last names of the team members.
Required example text files
    You will need to do this project in C++ in the Sheppard Lab.
The files you will require to get started on this project are contained in the directory
/home/public/cs139/project2
We will discuss the code in detail in class before you start working on this project.
Start by copying these files to your own directory, one set up for this project. The files needed are as follows:
- README
- myDOMProject.cpp
- robotics.dtd
- robots.xml
- structures.xml
The README file say the following:
Copy the source file myDOMExample.cpp to your working directory. In fact,
copy all the files from home/public/cs139/project2 to your working
directory by means of the following command (which along with the other
commands that follow, you might prefer to copy and paste, rather than
typing them yourself):
cp /home/public/cs139/project2/* .
Here is one approach to compiling and running myDOMExample. You'll need to
set some system parameters. Using bash, you would type the following
(which you can copy and paste into the terminal window):
export XERCESCROOT=/home/public/xerces-c-src_2_5_0
export XERCES_INCLUDE=$XERCESCROOT/include
export XERCES_LIBRARY=$XERCESCROOT/lib
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$XERCES_LIBRARY
Now compile the source code for my example using the command
g++ -c -w -O -I$XERCES_INCLUDE myDOMExample.cpp -o myDOMExample.o
Then link using
g++ myDOMExample.o -L$XERCES_LIBRARY -lxerces-c -lc -o myDOMExample
Assuming you have my sample .dtd and .xlm files in your working directory,
try the following:
./myDOMExample structures.xml
Please let me know right away if there's any trouble here.
M. Rieck
The myDOMExample.cpp file looks like this:
// MyDOMExample.cpp
// Derived from an example (DOMWordCount.java) I found.
// M. Rieck
// 9/16/2000
// updated 3/25/2004
#include < xercesc/util/PlatformUtils.hpp>
#include < xercesc/util/XMLString.hpp>
#include < xercesc/util/TranscodingException.hpp>
#include < xercesc/sax/SAXException.hpp>
#include < xercesc/sax/SAXParseException.hpp>
#include < xercesc/dom/DOM.hpp>
#include < xercesc/framework/XMLFormatter.hpp>
#include < xercesc/parsers/XercesDOMParser.hpp>
#include < stdlib.h>
#include < string>
#include < iostream>
using namespace std;
using namespace xercesc;
void process(const DOMNode& node);
void process_attributes(const DOMNode& node);
void process_children(const DOMNode& node);
void process_first_child(const DOMNode& node);
int main(int argC, char* argV[])
{
try {
XMLPlatformUtils::Initialize();
XercesDOMParser *parser = new XercesDOMParser;
for (int i = 1; i < argC; i++) {
parser->parse(argV[i]);
DOMNode& doc = *(parser->getDocument());
process(doc);
}
}
catch(const XMLException& toCatch) {
cerr << toCatch.getMessage() << endl;
}
}
void process(const DOMNode& node) {
switch(node.getNodeType()) {
case DOMNode::DOCUMENT_NODE:
process_first_child(node);
case DOMNode::ELEMENT_NODE:
cout << "<";
cout << XMLString::transcode(node.getNodeName());
process_attributes(node);
cout << ">" << endl;
process_children(node);
cout << "";
cout << XMLString::transcode(node.getNodeName());
cout << ">" << endl;
break;
case DOMNode::ATTRIBUTE_NODE:
cout << XMLString::transcode(node.getNodeName());
cout << "=\"";
cout << XMLString::transcode(node.getNodeValue());
cout << "\"" << endl;
break;
default: break;
}
}
void process_attributes(const DOMNode& node) {
DOMNamedNodeMap *nnm = node.getAttributes();
if (nnm != NULL) {
int numAttributes = nnm->getLength();
for (int j = 0; j < numAttributes; j++) process(*(nnm->item(j)));
}
}
void process_children(const DOMNode& node) {
if (node.hasChildNodes()) {
DOMNodeList& children = *(node.getChildNodes());
for (int i = 0; i < children.getLength(); i++)
process(*(children.item(i)));
}
}
void process_first_child(const DOMNode& node) {
if (node.hasChildNodes()) {
DOMNodeList& children = *(node.getChildNodes());
if (children.getLength() > 0) process(*(children.item(0)));
}
}
The robotics.dtd file looks like this:
< !-- Document Type Definitions for all robotics project XML -->
< !-- A ROBOT element includes numerous attributes. Some of these like -->
< !-- NAME must be explicitly given. Some, like HEIGHT, are optional. -->
< !-- Some, like MAX_SPEED, are optional, but have default values. -->
< !ELEMENT ROBOT (#PCDATA) >
< !ATTLIST ROBOT
NAME CDATA #REQUIRED
CREATION_DATE CDATA #IMPLIED
HEIGHT CDATA #IMPLIED
WEIGHT CDATA #IMPLIED
COLOR CDATA #IMPLIED
MAX_SPEED CDATA "10"
ACCELERATION CDATA "2"
DECELERATION CDATA "5"
TURN_FACTOR CDATA "10"
LEFT_TURN_FACTOR CDATA "50"
SCAN_DISTANCE CDATA "10"
STEP_UP_DELAY CDATA "3"
STEP_UP_RISK_FACTOR CDATA "5"
STEP_DOWN_DELAY CDATA "3"
STEP_DOWN_RISK_FACTOR CDATA "5"
RECOVERY_DELAY CDATA "20"
>
< !-- A ROBOTS_LIST is just a list of ROBOTs, but also has an optional -->
< !-- NAME and and optional CREATION_DATE. -->
< !ELEMENT ROBOTS_LIST (ROBOT*) >
< !ATTLIST ROBOTS_LIST
NAME CDATA #IMPLIED
CREATION_DATE CDATA #IMPLIED
>
< !-- A RECTANGLE has two required dimensions. It can also have offset -->
< !-- values, for use when placing inside another rectangle (or inside -->
< !-- a rectangular STRUCTURE). The offsets default to zero, and this -->
< !-- corresponds to positioning in the extreme left and down position. -->
< !ELEMENT RECTANGLE (#PCDATA) >
< !ATTLIST RECTANGLE
X_LENGTH CDATA #REQUIRED
Y_LENGTH CDATA #REQUIRED
X_OFFSET CDATA "0"
Y_OFFSET CDATA "0"
>
< !-- A HOLE is really just a RECTANGLE (I guess). -->
< !ELEMENT HOLE (RECTANGLE) >
< !-- A HOLE_LIST is a list of HOLEs. -->
< !ELEMENT HOLES_LIST (HOLE*) >
< !-- A STRUCTURE is, conceptually, a rectangular slab (1 unit high) -->
< !-- that might contain HOLEs as well as holding other STRUCTUREs. -->
< !ELEMENT STRUCTURE (RECTANGLE, STRUCTURES_LIST?, HOLES_LIST?) >
< !ATTLIST STRUCTURE
NAME CDATA #IMPLIED
CREATION_DATE CDATA #IMPLIED
>
< !-- A STRUCTURES_LIST is just a list of STRUCTURESs, but also has an -->
< !-- optional NAME and and optional CREATION_DATE. -->
< !ELEMENT STRUCTURES_LIST (STRUCTURE*) >
< !ATTLIST STRUCTURES_LIST
NAME CDATA #IMPLIED
CREATION_DATE CDATA #IMPLIED
>
The robots.xml file looks like this:
< ?xml version="1.0"? >
< !DOCTYPE ROBOTS_LIST SYSTEM "robotics.dtd" >
< ROBOTS_LIST NAME="AllRobots" CREATION_DATE="09/15/2000" >
< !-- The first ROBOT example has all of the default values. -->
< ROBOT
NAME="Default"
> Hi there. I am a friendly robot!</ROBOT >
< !-- The next ROBOT, named "Robbie", has all attributes -->
< !-- explicitly defined. -->
< ROBOT
NAME="Robbie"
CREATION_DATE="09/10/2000"
HEIGHT="22"
WEIGHT="55"
COLOR="green"
MAX_SPEED="20"
ACCELERATION="3"
DECELERATION="7"
TURN_FACTOR="33"
LEFT_TURN_FACTOR="50"
SCAN_DISTANCE="20"
STEP_UP_DELAY="7"
STEP_UP_RISK_FACTOR="4"
STEP_DOWN_DELAY="9"
STEP_DOWN_RISK_FACTOR="3"
RECOVERY_DELAY="23"
> < /ROBOT >
< !-- The last ROBOT is "Marvin". -->
< ROBOT
NAME="Marvin"
CREATION_DATE="09/15/2000"
HEIGHT="15"
WEIGHT="38"
COLOR="purple"
MAX_SPEED="15"
ACCELERATION="1"
DECELERATION="5"
> < /ROBOT >
< /ROBOTS_LIST >
The structures.xml file looks like this:
< ?xml version="1.0"? >
< !DOCTYPE STRUCTURES_LIST SYSTEM "robotics.dtd" >
< STRUCTURES_LIST NAME="AllStructures" CREATION_DATE="09/15/2000" >
< !-- The first structure is just a flat square slab. -->
< STRUCTURE NAME="FlatLand" CREATION_DATE="09/15/2000" >
< RECTANGLE X_LENGTH="500" Y_LENGTH="500" > < /RECTANGLE >
< /STRUCTURE >
< !-- The next structure has a big hole in the middle. Notice that -->
< !-- the 500x500 slab has a 400x400 hole in it. The horizontal offset -->
< !-- of 100 means that the hole is positioned 100 units to the right -->
< !-- of the extreme left positioning. Similarly, the vertical offset -->
< !-- of 100 means a shift up (north) from the extreme down (south) -->
< !-- position. -->
< STRUCTURE NAME="Frame" CREATION_DATE="09/15/2000" >
< RECTANGLE X_LENGTH="500" Y_LENGTH="500" > < /RECTANGLE >
< HOLES_LIST >
< HOLE >
< RECTANGLE X_LENGTH="400" Y_LENGTH="400" X_OFFSET="100" Y_OFFSET="100" > < /RECTANGLE >
< /HOLE >
< /HOLES_LIST >
< /STRUCTURE >
< !-- The next structure is a stack containing four slabs. -->
< STRUCTURE NAME="Pyramid" CREATION_DATE="09/15/2000" >
< RECTANGLE X_LENGTH="400" Y_LENGTH="400" > < /RECTANGLE >
< STRUCTURES_LIST >
< STRUCTURE >
< RECTANGLE X_LENGTH="300" Y_LENGTH="300" X_OFFSET="100" Y_OFFSET="100" > < /RECTANGLE >
< STRUCTURES_LIST >
< STRUCTURE >
< RECTANGLE X_LENGTH="200" Y_LENGTH="200" X_OFFSET="100" Y_OFFSET="100" > < /RECTANGLE >
< STRUCTURES_LIST >
< STRUCTURE >
< RECTANGLE X_LENGTH="100" Y_LENGTH="100" X_OFFSET="100" Y_OFFSET="100" > < /RECTANGLE >
< /STRUCTURE >
< /STRUCTURES_LIST >
< /STRUCTURE >
< /STRUCTURES_LIST >
< /STRUCTURE >
< /STRUCTURES_LIST >
< /STRUCTURE >
< !-- The last example is pretty complicated. Numerous slabs and holes. Can you draw it? -->
< STRUCTURE NAME="RoboWorld" CREATION_DATE="09/10/2000" >
< RECTANGLE X_LENGTH="500" Y_LENGTH="400" > < /RECTANGLE >
< STRUCTURES_LIST >
< STRUCTURE >
< RECTANGLE X_LENGTH="200" Y_LENGTH="200" X_OFFSET="200" Y_OFFSET="100" > < /RECTANGLE >
< STRUCTURES_LIST >
< STRUCTURE >
< RECTANGLE X_LENGTH="100" Y_LENGTH="100" X_OFFSET="50" Y_OFFSET="50" > < /RECTANGLE >
< HOLES_LIST >
< HOLE >
< RECTANGLE X_LENGTH="10" Y_LENGTH="30" X_OFFSET="40" Y_OFFSET="20" > < /RECTANGLE >
< /HOLE >
< HOLE >
< RECTANGLE X_LENGTH="20" Y_LENGTH="20" X_OFFSET="70" Y_OFFSET="70" > < /RECTANGLE >
< /HOLE >
< /HOLES_LIST >
< /STRUCTURE >
< STRUCTURE >
< RECTANGLE X_LENGTH="3" Y_LENGTH="3" X_OFFSET="1" Y_OFFSET="1" > < /RECTANGLE >
< STRUCTURES_LIST >
< STRUCTURE >
< RECTANGLE X_LENGTH="1" Y_LENGTH="1" X_OFFSET="1" Y_OFFSET="1" > < /RECTANGLE >
< /STRUCTURE >
< /STRUCTURES_LIST >
< /STRUCTURE >
< /STRUCTURES_LIST >
< HOLES_LIST >
< HOLE >
< RECTANGLE X_LENGTH="10" Y_LENGTH="50" X_OFFSET="10" Y_OFFSET="10" > < /RECTANGLE >
< /HOLE >
< HOLE >
< RECTANGLE X_LENGTH="180" Y_LENGTH="20" X_OFFSET="10" Y_OFFSET="50" > < /RECTANGLE >
< /HOLE >
< HOLE >
< RECTANGLE X_LENGTH="190" Y_LENGTH="190" X_OFFSET="4" Y_OFFSET="3" > < /RECTANGLE >
< /HOLE >
< /HOLES_LIST >
< /STRUCTURE >
< STRUCTURE >
< RECTANGLE X_LENGTH="100" Y_LENGTH="300" X_OFFSET="50" Y_OFFSET="50" > < /RECTANGLE >
< /STRUCTURE >
< /STRUCTURES_LIST >
< HOLES_LIST >
< HOLE >
< RECTANGLE X_LENGTH="3" Y_LENGTH="3" X_OFFSET="300" Y_OFFSET="350" > < /RECTANGLE >
< /HOLE >
< HOLE >
< RECTANGLE X_LENGTH="3" Y_LENGTH="3" X_OFFSET="310" Y_OFFSET="350" > < /RECTANGLE>
< /HOLE >
< HOLE >
< RECTANGLE X_LENGTH="3" Y_LENGTH="3" X_OFFSET="320" Y_OFFSET="350" > < /RECTANGLE>
< /HOLE >
< HOLE >
< RECTANGLE X_LENGTH="3" Y_LENGTH="3" X_OFFSET="330" Y_OFFSET="350" > < /RECTANGLE>
< /HOLE >
< /HOLES_LIST >
< /STRUCTURE >
< /STRUCTURES_LIST >
Details for the assignment
Part 1. Copy the various files described in the previous section into your directory for Project Two.
Go through the steps in README, and check that you get the expected results. Pay particular attention
to the order in which the information is presented, and the reason for this, viz. the program.
You might want to try using myDOMExample on robots.xml first though, since this XML file
is simpler than structures.xml. Submit nothing for grading here.
Part 2. Study the code in myDOMExample. If it succussfully processes a valid
XML file, a DOM tree will be created. The function main() will pass the root node of this tree to the function
process(). This function will identify the nature and details of the node passed to it, and will call the functions
process_attributes() and process_children() to recursively process the child nodes of the
passed node, where I am here regarding attributes as special child nodes (leaves) in the tree. Notice the order
in which the nodes of the tree are processed, and the information which is displayed. Try out some other small
DTD and XML files on your own to get a strong feel for what is happening here. Submit nothing.
Part 3. Write a DTD file to capture the following situation. A "part" can be characterized by
name, weight and identification number attributes, where the first two attributes are optional, but the ID is not.
A "parts" is a list of one or more "part" ' s. A "basic_assembly" has (optional) name, weight and (required)
identification number atttibutes, and is comprised of a "parts" (list). A "widget" is (made up of) either a "part"
or a "basic_assembly". A "complex_assembly" has (optional) name, weight and (required) identification number
attributes. It consists of at least two "assembly" ' s. An "assembly" is (made up of) either a "widget" or a
"complex_assembly". Submit a printout of your DTD file.
Note: XML will not actually allow a "widget" to be either a "part" or a "basic_assembly". It
must rather consist of (be built form) either a "part" or a "basic_assembly". This is actually a serious
downside to XML, which is therefore not really sufficiently expressive to capture context-free grammers in
general. On the other hand, it allows valid XML to be unambiguously (deterministically) parsed.
Part 4. Write increasingly more complicated XML files to test out your DTD file. Test these
using myDOMExample, and make sure that all the information is being displayed as expected. If not,
something is syntactically wrong with either the DTD or the XML file. You should aim at
ending up with an XML file for an "assembly" that is complicated enough to exhibit all the parts of all
the definitions made in the DTD file. When you get there, submit a printout of your XML file.
Part 5. Hand draw (or use some software for this purpose if you wish) the DOM tree for your
example in Part 4. Show the entity nodes and the attribute nodes. (I regard attribute nodes as special
child nodes of their corresponding element nodes, and these child nodes are always leaves.)
Part 6. Write a program, based on MyDOMExample, that processes an XML file, creating
a DOM tree. Have the program display only the CDATA that appears between double quotes that is
assigned to attributes of elements. Have it do so in a postfix fashion, in the sense that attribute information
for each child element is displayed before the attribute information for a parent element.
Furthermore, have all this information displayed on one line (i.e. without introducing carriage returns).
Test this out on my structures.xml example. Submit only the source code for your program.
(This should be easy if you understand the original program.)
Part 7. Write a DTD file to represent
the following context-free grammer which is
similar to Example 2.3 on page 95 of Sipser's book:
< EXPR > --> < TERM > | < SUM >
< SUM > --> < EXPR > < PLUS > < TERM >
< TERM > --> < FACTOR > | < PRODUCT >
< PRODUCT > --> < TERM > < TIMES > < FACTOR >
< FACTOR > --> < IDENTIFIER > | < ENCLOSED_EXPRESSION >
< ENCLOSED_EXPRESSION > --> < LEFT_PAREN> < EXPRESSION > < RIGHT_PAREN >
where < IDENTIFIER >, < PLUS >, < TIMES >, < LEFT_PAREN >, and
< RIGHT_PAREN > are regarded as terminal symbols in the grammer, and
< EXPR >, < TERM > and < FACTOR >
are the variable symbols as in Example 2.3. A significant difference here is
that we will let the < IDENTIFIER > tag represent an arbitrary
"variable" in the sense of an algebra expression.
The < IDENTIFIER > tag in your DTD file
should have a "name" attribute that will be set to the name of the
variable (in the sense of an algebra expression) it represents.
Also, the
< PLUS >, < TIMES >, < LEFT_PAREN >, and < RIGHT_PAREN > tags
should have a "symbol" attribute whose default values should respectively be:
+ , * , ( , ) .
Also, the remaining tags (< EXPR >, < SUM >,
< TERM >, < PRODUCT >, < FACTOR >, and
< ENCLOSED_EXPRESSION >) should also have a "symbol"
attribute, but here the default value should be the empty string.
    Write the DTD to capture the above situation. Then write an XML file that represents the expression
3 + X + 7 * ( Y * Z + 4 ) * X * X + ( 2 * Y + 5 * X * Z )
using the language you defined in the DTD file. Then use the program you wrote in Part 6 in order to read this
XML file and display the above algebraic expression. Once this works properly, submit your DTD and XML files.
Part 8. Change your DTD file from Part 7 so that the default symbol for PLUS, TIMES,
LEFT_PAREN, and RIGHT_PAREN
is the null string, but the default symbol for < SUM > is +,
and the default symbol for < PRODUCT > is * . Again run the program you wrote in Part 6.
Indicate the result you got after making this change, describe these results and explain why they were expected.
Submit only this, not the DTD code.