Programmer's Guide to the XMLAPI


Introduction

The XMLAPI is a DOM-like library which maintains heap structures based on XML documents. These heap structures have the advantage that they can be read from or written to XML documents, of course, but I'm finding that they're pretty nice to have around anyway. An XML element itself acts a lot like an associative array (a hash, for you Perl junkies), and that alone is a nice reason to use the XMLAPI in C. For me, anyway.

This documentation is intended to stand alone, even though it's part of the wftk documentation set. As I've ported (parts of) the XMLAPI into Perl so that I can work with the same API, and as that has nothing whatsoever to do with wftk, I think it makes sense to arrange things this way. So if you're reading this and haven't heard of the wftk, don't panic, it's just the project during which I started working on an XML API, which later became the XMLAPI. See? But if you're figuring out how to work with the wftk, don't panic when I don't talk about it much.

Anyway, besides reading and writing to files, there are a whole lot of things you can do with these XML heap structures. I've broken these down into categories so they're not completely overwhelming.

Table of Contents

Reading and writing XML files

Reading of XML is done with the expat parser by James Clark. There are three flavors -- xml_read reads a file handle and returns an XML structure. If any error is encountered, it bails, returning NULL. If you want error feedback, use xml_read_error (I realized this need after the fact, and in fact early versions of xml_read actually wrote the error message to stdout, which wasn't very friendly behavior at all.) The xml_read_error is a perfectly normal XML reader, but instead of returning NULL in case of error, it returns an error element of the form
   <xml-error code="200" message="some error message" line="40"/>
The third parsing function, xml_parse simply takes a string buffer and parses that. The buffer must be null-terminated. It also returns an error element in case of error (not NULL).

Writing is done fairly simply by walking the XML tree structure. Pretty printing (breaking lines) isn't supported; if you want pretty XML, you have to insert your own line breaks when building the XML in the first place.

Use the xml_writecontent to write only the subelements of the XML passed in without writing the enclosing element or its attributes.

Writing XML as HTML

Writing HTML is pretty much the same as writing XML, except that some HTML elements are treated special: li, for instance, doesn't need to be closed. That sort of thing.

Creating new XML from scratch or by copying

To create a new XML element, you just give its name. Afterwards you can insert content, or insert it as content, and set attributes. Plain text (character sequences) are represented as elements with no name. Use xml_createtext to create a non-element XML. The xml_createtextf function works the same, except that it takes a printf-like format (at the moment it understands only %s and %d).

Use xml_delete to delete a subelement from a parent element; xml_delete calls xml_free on the deleted element. Call xml_free directly if the element isn't a child element (it won't clean up the dangling child pointer in a parent.)

Generating string representations of XML

The string generators build string buffers using malloc. Thus you have to free the result when you're done with it. The *content functions, like xml_writecontent, don't write the enclosing element or its attributes, just its subelements. The *html functions convert the given XML into an HTML-like form while writing. (The given XML is of course unchanged.)

Getting and setting attributes

The xml_attrval function returns a pointer to the named attribute or a pointer to an empty string if the attribute isn't found, and yes, I know the function is named inconveniently. The pointer returned is a const pointer directly into the XML structure, so don't write to it. Use xml_attrvalnum if you just need an integer representation of the attribute's value.

The xml_set* functions set attributes, of course. If the attribute is already present, it will be replaced; otherwise, a new attribute will be created. The xml_setf function takes a format similar to printf's format; it recognizes only %s and %d at the moment. The xml_set makes a copy of the string value given it (and the attribute name, of course); if you don't want the copy to be made, then use xml_set_nodup. Caution: xml_set_nodup may only be called with malloc'd strings! Otherwise, when you attempt to free the XML element, Bad Things will happen.

Inserting and replacing subelements

Use xml_prepend to insert a child element before any other subelement; use xml_append to insert the child after all others. The xml_replace function will replace the given element in its own parent, while xml_replacecontent first deletes the given element's children, then appends the new child (thus effectively replacing the element's children.)

The xml_copyinto function is a little weird. I've only had one use for it, really. What is does is to take all the attributes and subelements of the source XML and write/append them, respectively, into the target XML. If there are duplicate attributes, the source values overwrite. Any existing information is saved otherwise.

Element location codes

The location functions work with locator strings, which is something sort of weird I came up with. A locator string uniquely identifies (at most) one element in an XML tree, either by place position in its parent or by the value of its "id" attribute (or "name" attribute if it has no "id" attribute.) This means that you don't have to manage identifiers throughout an XML document to get bits of it out. You can also do some primitive searching through an XML tree; the only limitation on this search is that you have to know which level the element is at.

The following examples all work with this XML structure:

<record>
   <field name="field1" value="4"/>
   <comment>This is a comment, maybe a memo or something.</comment>
   <field name="field2" value="3"/>
</record>
Then the locator record.field(1) will find the second field element: <field name="field2" value="3"/>. Note that we work from a base of 0, and note that the intervening "comment" field has no effect. Neither would any intervening plain text.

The locator record.comment will find the comment field. (Note that omission of an element number means to return the first you find.) You can search by ID or name, too: the locator record.field[field2] will return the second field as well.

If no matching field is found, the xml_loc function returns a NULL pointer.

The xml_locf function works the same as xml_loc except that it can build your locator for you using a formatting scheme much like the printf function. It only understands %s and %d at the moment, but that's probably all you'll need in locators anyway.

Both loc functions can ignore the element name of the topmost element if you simply omit it; thus the locator .field will find the first data field in the example XML.

The xml_getloc and xml_getlocbuf functions find a locator for the given XML; since each XML element knows its parent, the getloc functions can simply trace up the tree and find a full locator. The xml_getloc function requires that you supply a buffer; in this case, the locator may fill the buffer. Check its length carefully upon return. The xml_getlocbuf function allocates the buffer for you, but you must either free the buffer when you're finished with it, or use xml_set_nodup to pass it back into an attribute for later cleanup with an element. The pattern xml_set_nodup (xml, "myloc", xml_getlocbuf(xml)) will probably be useful here and there.

Iterating through subelements

There are two sets of child element iteration functions; the ones ending in elem skip over any plain text children and return only XML elements. This is useful because plaintext elements have a NULL name, so if you get one by mistake and try to compare its name with something, you will regret it. This happened to me so often I modified the API. Ha. Anyway, you can start an iteration at the beginning of the child list or at the end, and you can move forward or backward in the list. If you get a NULL pointer back, you know you've reached the end (or beginning) of the list. A useful pattern for iteration is thus:
marker = xml_firstelem (parent);
while (marker) {
   /* do something */
   marker = xml_nextelem (marker);
}

Iterating through attributes

Attribute iteration is the newest section of the XMLAPI; I realized that scanning the attributes of an element was the only place where the wftk had to know the internal structure of an element. This seemed as though it would make things harder once I started mixing and matching XMLAPI implementations for embedding in scripting languages, so I closed up the loophole. Now wftk knows (and cares) nothing at all about what's inside these mysterious XML structures.

You can only scan attributes from the front of the list to the back (they're singly linked, so you don't want to scan backwards anyway.) Given an attribute, you can retrieve its name or its value. The unfortunate similarity of xml_attrvalue (which gets the value of an attribute structure) and xml_attrval (which gets an attribute value from an element, treating it as a hash) is regrettable and you have my abject apologies, but I didn't want to go back and change all my existing code. We'll all just have to live with my poor planning.







Copyright (c) 2001 Vivtek.