Defining an XML document type (i.e. writing a DTD) consists of the following steps, not
necessarily in order:|
That's pretty much it!
- Define elements --
<!ELEMENT name content>
The elements in your documents are the tags you write, but more importantly they're
the basic objects your document is about (and their subobjects and so forth.) Part of the
definition of an element is its content model, i.e. for the element 'tag' what may
appear between <tag> and </tag>.
- Define attributes for elements --
<!ATTLIST name specs>
The attribute lists are the named attributes found within the tags; they're the
named members of the object classes defined by the elements.
- Define entities --
<!ENTITY name spec>
Entities are familiar from HTML: they're those funny & things like .
In that context, they're effectively names for characters not otherwise expressible in
printable ASCII. In XML the entity concept is logically extended to be a name for any
arbitrary string, i.e. entities are a lot like macros.
- Use a tool to validate the DTD
Any time a human being writes a formal language, mistakes are likely to be made. DTDs
are no exception, even though they're syntactically a lot simpler than most languages.
I found it very helpful to run my DTD through a parser to discover my mistakes.
I used IBM's Visual DTD (part of alphaWorks)
to jump-start my DTD and to validate the finished product.
As you can see from that, the process of defining an XML document type is that of designing a set
of elements. So what kinds of things can you do with elements? An element is an object that
contains data. It can contain data in two ways; first, it has content, second it has attributes.
Using the HTML link as an example, the element's name is "a", its most useful attribute
href=, and its content is the stuff between the
> and the
A definition of that much of the HTML spec could look like this:
Whoa! What's that
<!ELEMENT a #PCDATA>
href CDATA #IMPLIED
#PCDATA thing in there? It stands for "parsable character
data" and it effectively means, this element can contain some text that you should pass on to the
application. (A note on that phrase "pass on to the application": XML is defined in terms of
the XML parser. The parser is a module which is assumed to be built into, or at least somehow
called by, an application which does the actual work of whatever the application does. When
you're writing a DTD, you're talking to the parser; when you write a document, you're using the
parser to talk to the application.)
You'll notice that the definition of the
href attribute has three parts: the
name of the attribute, what its content may be ("CDATA" is simply character data), and what
values it may take ("#IMPLIED" means "whatever").
That much is easy. But the real usefulness of XML is how well it expresses arbitrary
structure. To express arbitrary structure we need to have elements within elements, and that's
done in pretty much the same way. Let's define something really simple, a tree structure.
We'll use the obvious name for a node: "node". Each node may be named and may contain other
nodes. And in fact let's make it a binary tree, i.e. each node may have a maximum
of two children. Let's define:
Note the content model of the
<!ELEMENT node (node?,node?)>
name CDATA #IMPLIED
some_string CDATA #IMPLIED
definition (that's the stuff after the name, remember). I wrote it as
to signify that the content of a node may be an optional node followed by another optional
node, and nothing else. I also tossed in some optional node data for good measure.
So now we can write a little XML chunk representing a binary tree:
Now that's cool if you ask me!
<node name="1a" some_string="This is a little extra data.></node>
What if I wanted a general tree? I'd write the content model like this:
Entities come in two flavors, regular entities (like the < kind of entities we know and love
from HTML) and parameter entities, which can be used in the DTD definition itself. Parameter
entities are prefixed with a percent sign (%) and are dandy for sequences of element content specifications
which get reused throughout a DTD. I used a parameter entity in the wftk
DTD to represent the types of elements which are considered actions or action-like ... things.
By far the best way to discover what I'm talking about at this point is to go read the wftk
DTD yourself and use this little tips to figure out what you're seeing. I've still got some work to do to make this
topic more informative, but this will get you started. As always, if I'm missing something, ask.