Topic: XML -- How to use the expat XML parser

XML index ] [ expat index ] [ xmltools example code ]
Looked at expat? Wished you could figure out how to use it? So did I -- but I really had to figure it out. The result is the xmltools XML command-line utilities and these few words of wisdom.
The main parser loop
Basically, a program built on expat consists of a loop which (1) grabs a buffer full of data and (2) calls XMLParse on it. XMLParse takes the data plus the state saved from the last call, and calls a handler for each "XML event." I use that terminology because the basic structure of these programs is so much like the Windows "event-driven programming" paradigm. It really sends me back, you know?

If you look at the sample program that Clark has provided, you'll see that basic structure. Let's assume we're writing a program to read XML from stdin, just like the sample. The main() function will thus look like this:

int main()
{
  char buf[convenient size];
  int len;   /* len is the number of bytes in the current bufferful of data */
  int done;
  int depth = 0;  /* nothing magic about this; the sample program tracks depth to know how far to indent. */
                  /* depth is thus going to be the user data for this parser. */

  XML_Parser parser = XML_ParserCreate(NULL);
  XML_SetUserData(parser, &depth);
  XML_SetElementHandler(parser, startElement, endElement);
  do {
    get a piece of input into the buffer
    done = whether this bufferful is the last bufferful
    if (!XML_Parse(parser, buf, len, done)) {
      handle parse error
      return 1;
    }
  } while (!done);
  XML_ParserFree(parser);
  return 0;
}
This main is lifted whole cloth from Clark's sample program. I've eliminated the error handling and file I/O to improve readability a little, as you can figure those out separately. Here are two things to note:
  • XML_SetElementHandler
    The XML_Set*Handler calls are where the behavior of the parser is defined. In this case, we only care about the elements themselves, so we only define and install element handlers. There is a corresponding call for each class of XML token.
  • Parse errors
    By definition an XML parser stops parsing when it encounters syntactically incorrect XML. Why? Primarily to avoid the situation now extant with HTML parsers: incorrect HTML may do anything in one browser or the other, with the result that you really need to test layout under at least two browsers (you know which ones.) Otherwise HTML which displays perfectly under your browser may not display at all under another.

So really, this main loop is the one you'll be using for all your expat programming. (Take a look at the main() for xmltools -- it's the same, with a little decoration.) The bulk of the programming is done in the handlers. How you write your handlers depends on what you want to do. In the case of xmltools, the handlers do the work, so that parsing is done on the fly. But your application may need to parse the XML into some structure first, then do its work.

Element handlers
XML_Parse, as it's parsing along the incoming data, calls an element handler each time it encounters an element. There are two handlers for elements; one is called when the element begins, the other when the element ends. Between the two calls there are handler calls for the various contents of the element (if any).






Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.