- Extremely well-suited for the representation of complex data
- Easy to parse for machines
- Relatively easy to read for humans
- Worth a lot of money
XML stands for eXtended Markup Language. Like HTML, it's related to SGML. But where HTML defines
certain specific tags, and
those tags mean certain specific things, XML is intended to be totally flexible and provide a facility for
defining basically arbitrary complex data structures. In XML you can define and use any darned tag you
want and worry about meanings later. Like HTML, though, XML is intended to be easy to serve
over the Internet; full SGML is too complicated to make that practical, so XML is the compromise.
So that's what XML is. Why is everybody and his brother using it? Well, XML is an
excellent way to represent complex data structures in a form that's easily defined, easily
generated, and easily parsed. That makes it pretty darned convenient. You can use standard
tools and libraries on a lot of various platforms to work with it, which makes it good for
mixed environments.
For me, XML's ability to represent arbitrary complex data is what sets it apart from other
technologies. It is precisely good with those things that relational databases aren't, i.e.
data that doesn't fit into tables very well. If you find yourself spreading the representation
of an object over several tables, you might well consider defining an XML document type and
saving that XML into one single field, for instance.
When working with XML, you generally need to worry about two things. First, you have to define
all the tags you're going to use in your application (of course, if your application is some
general-purpose XML tool, this doesn't apply.) You do this by defining the elements,
entities, and other various objects that you will use in a DTD. The DTD (Document
Type Definition) is a concept brought over from SGML. You have undoubtedly seen the acronym
before in connection with HTML, where it's used only to specify the definition of HTML
in case somebody wants to run an HTML page through an SGML processor. XML's DTD, though, is
used to define structure right in the document they'll be used in, or of course
you can reference an external DTD document. DTD writing is pretty easy, actually; see our
article about writing DTDs for more information.
The other issue you'll face is how to actually generate and parse XML. Generation is easy, of
course; it's just like generating HTML, and we're all good at that by now. Parsing is "easy"
by design, meaning XML is easier to parse than C or some other complex language, but you're
probably going to want to avoid rewriting a parser for each project. There are a number of
parsers out there, of course, both free and commercial. OpenXML is a free parser in Java, and
expat is a free parser in C (and is used in the Perl XML module, by the way.)
You can check
XML.com for commercial ones if GNU licensing doesn't fit your application.
Unlike HTML, the XML specification actually includes required behavior for an XML parser. This
means that if you do want to write your own, then you have some checkpoints to make sure that
you'll be doing standard things. This is basically due to the ... odd ways that some
HTML browser companies who shall remain nameless decided to handle error situations in HTML.
Particularly situations where the browser accepted erroneous code and did something "pretty
good" without complaining ended up producing a lot of very broken HTML code that then broke
more correct browsers when they came along. So as not to repeat that mess, the XML specification
requires that an XML parser no longer return data in a normal fashion after it has
encountered a syntax error. It can still attempt to clean things up, but the application must
be aware that XML la-la land has been entered.
Well, that was a little off-track. I'll conclude by saying that XML is:
- Extremely well-suited for the representation of complex data
- Easy to parse for machines
- Relatively easy to read for humans
- Worth a lot of money
Get to know it. You won't regret it.