Topic: XML

overview ] [ additional information ] [ links ] [ books ]
  • Extremely well-suited for the representation of complex data
  • Easy to parse for machines
  • Relatively easy to read for humans
  • Worth a lot of money

XML stands for eXtended Markup Language. Like HTML, it's related to SGML. But where HTML defines certain specific tags, and those tags mean certain specific things, XML is intended to be totally flexible and provide a facility for defining basically arbitrary complex data structures. In XML you can define and use any darned tag you want and worry about meanings later. Like HTML, though, XML is intended to be easy to serve over the Internet; full SGML is too complicated to make that practical, so XML is the compromise.

So that's what XML is. Why is everybody and his brother using it? Well, XML is an excellent way to represent complex data structures in a form that's easily defined, easily generated, and easily parsed. That makes it pretty darned convenient. You can use standard tools and libraries on a lot of various platforms to work with it, which makes it good for mixed environments.

For me, XML's ability to represent arbitrary complex data is what sets it apart from other technologies. It is precisely good with those things that relational databases aren't, i.e. data that doesn't fit into tables very well. If you find yourself spreading the representation of an object over several tables, you might well consider defining an XML document type and saving that XML into one single field, for instance.

When working with XML, you generally need to worry about two things. First, you have to define all the tags you're going to use in your application (of course, if your application is some general-purpose XML tool, this doesn't apply.) You do this by defining the elements, entities, and other various objects that you will use in a DTD. The DTD (Document Type Definition) is a concept brought over from SGML. You have undoubtedly seen the acronym before in connection with HTML, where it's used only to specify the definition of HTML in case somebody wants to run an HTML page through an SGML processor. XML's DTD, though, is used to define structure right in the document they'll be used in, or of course you can reference an external DTD document. DTD writing is pretty easy, actually; see our article about writing DTDs for more information.

The other issue you'll face is how to actually generate and parse XML. Generation is easy, of course; it's just like generating HTML, and we're all good at that by now. Parsing is "easy" by design, meaning XML is easier to parse than C or some other complex language, but you're probably going to want to avoid rewriting a parser for each project. There are a number of parsers out there, of course, both free and commercial. OpenXML is a free parser in Java, and expat is a free parser in C (and is used in the Perl XML module, by the way.) You can check XML.com for commercial ones if GNU licensing doesn't fit your application.

Unlike HTML, the XML specification actually includes required behavior for an XML parser. This means that if you do want to write your own, then you have some checkpoints to make sure that you'll be doing standard things. This is basically due to the ... odd ways that some HTML browser companies who shall remain nameless decided to handle error situations in HTML. Particularly situations where the browser accepted erroneous code and did something "pretty good" without complaining ended up producing a lot of very broken HTML code that then broke more correct browsers when they came along. So as not to repeat that mess, the XML specification requires that an XML parser no longer return data in a normal fashion after it has encountered a syntax error. It can still attempt to clean things up, but the application must be aware that XML la-la land has been entered.

Well, that was a little off-track. I'll conclude by saying that XML is:

  • Extremely well-suited for the representation of complex data
  • Easy to parse for machines
  • Relatively easy to read for humans
  • Worth a lot of money
Get to know it. You won't regret it.
ADDITIONAL INFORMATION
  • How to write a DTD
    So somebody wants you to write a DTD? Get the basics.
  • expat topic
    My overview of expat, James Clark's XML parser for C. I'm documenting the API in a linkable fashion and I have a little sample code.
  • xmltools
    My command-line XML utility set, written on expat. Useful. And since they're programmed literately, you can get an idea of how to use expat.
LINKS
  • XML.com
    Wow. This place has got everything. If you need anything to do with XML, it's probably here. However, the most useful part of the site is Norman Ramsey's article on XML basics. Read it.
  • OpenXML
    The OpenXML project is an XML parser in Java. They have some good background information and documentation as well.
  • Free XML Tools
    A very useful and oft-updated list of free XML tools. This includes executable editors and checkers, scripts, parser libraries, the whole nine yards. If you need tools, this is a great place to start looking.





Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.