Literate Programming

Published 1999-02-12

[ additional information ] [ links ] [ books ] [ project ]

Synopsis: A brief history of the literate programming paradigm, recent work, and some ideas that we're working on, with apologies to the established players.

Historical sketch

The term literate programming, and the original literate programming system (WEB) which implemented the concept, were both the creation of Donald Knuth, one of the most literate programmers the world has ever known. Knuth, of course, is the author of The Art of Computer Programming, the TeX typesetting system, and other works of the programming art. It was Knuth's intention to provide a system of programming by which the programmer could typeset his or her work in book or article form, so that each choice of implementation, each algorithm, was clearly explained and justified. The resulting work of art would then stand as the quintessential definition of a solution for the problem it addressed.

Knuth used and developed this system while writing TeX and Metafont, and the resulting two books of code/documentation remain the most readable and usable collections of code I have ever seen. TeX, of course, is the standard of typesetting software in the academic world (usually in its LaTeX incarnation, which runs as a macro package on basic TeX), and has been for nearly twenty years. Twenty years for a software package! Only Unix has comparable staying power. Literate programming, however, is not a mainstream technique. Those who use literate programming tools often wonder why not. There have been no studies done of which I am aware, but the basic shortcoming of literate programming is that it is difficult to write a literate program quickly. Yes, once it is written, it is impeccably documented, easily debugged (in those cases where it isn't already provably correct), simply maintained by the original author and others, and in general simply has a far higher quality in every respect than an "illiterate" program. But it takes longer to see results. As we all know, the software industry is an impatient one. And corporate IT in industries other than our beloved software are even less patient and less likely to understand the benefits of good coding style.

[update 4/20/2000] Having completed the first phase of a large project in literate style (the task list manager for wftk, around 1200 lines of Tcl code) I can say that I seem to be coding somewhat more slowly in literate style, but that being able to find relevant parts of a rather large program appears to make up the difference.

Recent work

Since Knuth, literate programming has seen a few innovations. There are a number of literate programming systems available, and probably the most active is Norman Ramsey's noweb, which you may explore at Ramsey's own page. He also maintains links to many of the projects which use his system, so you can see a number of examples of literate programming in action.

I have yet to study noweb. Its chief innovations seem to be to keep up with programming technology. It is (relatively) language-independent, it is extensible, and it can work with programs which consist of more than one code file (unlike Knuth's original Pascal tool, which assumes a single code output!) Watch this space for further details, or go get noweb and try it out yourself.

<shameless plug>
I'm also both developing and using my own literate programming tool, as any literate programmer must. Go check it out. Use it, even, you'd make my day.
</shameless plug/>

Analysis

When writing a literate program using existing technology, the programmer is simultaneously writing the program code and writing the typesetting code which will be used to typeset the final documentation. Since Knuth created TeX, most of these systems use TeX as their typesetting language. Some use other typesetting languages. The resulting documentation consists of paragraphs of actual English text (possibly containing illustrations or mathematical formulae) interspersed with program text. The program text is broken into small, understandable chunks. Each chunk may include other chunks by means of a marker which identifies the code to be included. The included code is inserted only into the compiler-ready code output; it does not appear explicitly in the program documentation, where it would only obscure the structure of the chunk being explained.

The process of creating the typeset documentation is called weaving; the process of creating the program code for compilation is called tangling. The reason for this is that Knuth particularly disliked compilers for constraining the order of presentation of program parts. He stresses an easily understandable order of presentation as something crucial for the program author's explanation of program logic.

The final program documentation may also contain various lists of identifiers used in the program, where code segments are used and where they are defined, and so forth. This provides an excellent means of understanding the program. For a good introduction, find one of Knuth's published programs and read it. You will immediately see that this is a good way of documenting program development.

In my view, there are three crucial functional aspects of literate programming:

Template processing allows code segments to be broken into understandable chunks and reassembled prior to compilation.
Since only one source is used for both code and documentation, the program is always guaranteed to be in synch with its documentation. Seeing a clear exposition of the program code may also induce the maintenance programmer to correct it when a mistake is found, instead of merely correcting the program code and leaving the specification in its original erroneous state.
Automatic program analysis allows the printed documentation to include information about identifiers and other program constructs with no human intervention. If well-designed, these indices can provide an incredible boost to the quality of the documentation.

With these fundamental requirements in mind, we are trying to generalize these points in preparation for development of our two chief vaporware offerings. (See this description.) In general, we want to take literate programming and make of it a system that we will actually use in the context of paid programming work. (Here's a good take on that point of view that's good for a laugh or two.)

For that to be possible, the effort required to produce a literate program must be of the same order of magnitude as the effort required simply to code up a product and to debug it. This means that it must be simple to take existing test code and reverse-engineer it to produce an explanation of its workings in literate style. The editor used to create the program must be easy to use and must understand literate style. The code must be accessible in its tangled form and changes to it in tangled form must propagate back into the literate source.

For each of the three points above, we have identified the processes that we envision. The finished literate product will be a far cry from current literate programming technology, and we think that this will have the potential to bring literate programming into the mainstream, simply via evolutionary forces if in no other way. Literate programming should increase an individual's productivity if applied correctly.

Here's how we see the eventual programming environment:

Product description language
This generalizes the first of LP's three components. In addition to allowing code segments to be inserted into other segments, defining the files to be created, and doing simple macro processing, it should be possible to generate code on the basis of database queries, for example, or perform other generalized data manipulation to define a program. For instance, if a parser is a component of the program, then the yacc and lex code for the parser should be the source used to generate the program. During tangle, yacc and lex are called to produce C code, and that resulting code segment is then inserted into the final program. It isn't necessary ever to see that code, but it should be acknowledged and documented that it exists. I'm sure you can think of a few applications of this, too.

Annotation
Code segments are associated rigidly with text in current LP technology. However, we would like the ability to annotate any action taken. If a change is made to existing code, that change should be tracked, annotated, and incorporated into the final documentation, so that the thoughts of each maintenance programmer can be followed in making later changes. This whole idea is still somewhat vague; as usual, please watch this space for further details.

Code analysis
Ideally, each language used in a program would be understood and parsed during editing. The results of this parsing would be available for code analysis procedures. This corresponds to current LP technology's inclusion of information about identifiers into the final documentation; I think that this information is crucial during composition and maintenance of the code. And I'm starting to take a more active interest in the whole arena of analysis and measurement -- check out my topic on code analysis and software measurement for some of the thoughts I'm turning over. If a general code analysis framework were combined with a literate programming documentation facility, then, well, words fail me. It would be a Good Thing.

More information at vivtek.com

lpml

LPML (Literate Programming Markup Language) is Vivtek's first stab at XML Literate programming. It's free. And the source is, of course, programmed literately (it tangles and weaves itself). Here's the programming documentation for the prototype Perl script. Take a look and tell me what you think.
wftk task list manager code
The task list manager component of the wftk is done (4/20/2000). This is an example of what 1200 lines of code can look like when documented in literate style. Personally I find things much easier to find when I need to make a change.