XML -- How to write a DTD

Topic: XML -- How to write a DTD

So ya want to write a DTD. Or (as happened to me) someone wants to pay you to write one.

Well, it's really not that hard. Assuming you know about data structure (or object) design, writing the actual DTD will prove pretty simple. And this article shows you how, I hope. At least it reflects some of my experiences writing the DTD for the process definition of wftk.

Defining an XML document type (i.e. writing a DTD) consists of the following steps, not necessarily in order:

Define elements -- <!ELEMENT name content>
The elements in your documents are the tags you write, but more importantly they're the basic objects your document is about (and their subobjects and so forth.) Part of the definition of an element is its content model, i.e. for the element 'tag' what may appear between <tag> and </tag>.
Define attributes for elements -- <!ATTLIST name specs>
The attribute lists are the named attributes found within the tags; they're the named members of the object classes defined by the elements.
Define entities -- <!ENTITY name spec>
Entities are familiar from HTML: they're those funny & things like  . In that context, they're effectively names for characters not otherwise expressible in printable ASCII. In XML the entity concept is logically extended to be a name for any arbitrary string, i.e. entities are a lot like macros.
Use a tool to validate the DTD
Any time a human being writes a formal language, mistakes are likely to be made. DTDs are no exception, even though they're syntactically a lot simpler than most languages. I found it very helpful to run my DTD through a parser to discover my mistakes. I used IBM's Visual DTD (part of alphaWorks) to jump-start my DTD and to validate the finished product.

That's pretty much it!

As you can see from that, the process of defining an XML document type is that of designing a set of elements. So what kinds of things can you do with elements? An element is an object that contains data. It can contain data in two ways; first, it has content, second it has attributes. Using the HTML link as an example, the element's name is "a", its most useful attribute is href=, and its content is the stuff between the <a href=something > and the </a>.

A definition of that much of the HTML spec could look like this:


<!ELEMENT a #PCDATA>


<!ATTLIST a

   href CDATA #IMPLIED

>

Whoa! What's that #PCDATA thing in there? It stands for "parsable character data" and it effectively means, this element can contain some text that you should pass on to the application. (A note on that phrase "pass on to the application": XML is defined in terms of the XML parser. The parser is a module which is assumed to be built into, or at least somehow called by, an application which does the actual work of whatever the application does. When you're writing a DTD, you're talking to the parser; when you write a document, you're using the parser to talk to the application.)

You'll notice that the definition of the href attribute has three parts: the name of the attribute, what its content may be ("CDATA" is simply character data), and what values it may take ("#IMPLIED" means "whatever").

That much is easy. But the real usefulness of XML is how well it expresses arbitrary structure. To express arbitrary structure we need to have elements within elements, and that's done in pretty much the same way. Let's define something really simple, a tree structure. We'll use the obvious name for a node: "node". Each node may be named and may contain other nodes. And in fact let's make it a binary tree, i.e. each node may have a maximum of two children. Let's define:


<!ELEMENT node (node?,node?)>


<!ATTLIST node

   name CDATA #IMPLIED

   some_string CDATA #IMPLIED

>

Note the content model of the ELEMENT definition (that's the stuff after the name, remember). I wrote it as (node?, node?) to signify that the content of a node may be an optional node followed by another optional node, and nothing else. I also tossed in some optional node data for good measure.

So now we can write a little XML chunk representing a binary tree:
<node name="1"> <node name="1a" some_string="This is a little extra data.></node> <node name="1b"> <node name="1b1"></node> </node> </node> Now that's cool if you ask me!

What if I wanted a general tree? I'd write the content model like this: (node)*.

Entities come in two flavors, regular entities (like the < kind of entities we know and love from HTML) and parameter entities, which can be used in the DTD definition itself. Parameter entities are prefixed with a percent sign (%) and are dandy for sequences of element content specifications which get reused throughout a DTD. I used a parameter entity in the wftk DTD to represent the types of elements which are considered actions or action-like ... things.

By far the best way to discover what I'm talking about at this point is to go read the wftk DTD yourself and use this little tips to figure out what you're seeing. I've still got some work to do to make this topic more informative, but this will get you started. As always, if I'm missing something, ask.