Analysis
When writing a literate program using existing technology, the programmer is simultaneously writing
the program code and writing the typesetting code which will be used to typeset the final documentation.
Since Knuth created TeX, most of these systems use TeX as their typesetting language. Some use
other typesetting languages. The resulting documentation consists of paragraphs of actual English
text (possibly containing illustrations or mathematical formulae) interspersed with program text.
The program text is broken into small, understandable chunks. Each chunk may include other chunks by
means of a marker which identifies the code to be included. The included code is inserted only into
the compiler-ready code output; it does not appear explicitly in the program documentation, where
it would only obscure the structure of the chunk being explained.
The process of creating the typeset documentation is called weaving; the process of creating
the program code for compilation is called tangling. The reason for this is that Knuth particularly
disliked compilers for constraining the order of presentation of program parts. He stresses an
easily understandable order of presentation as something crucial for the program author's explanation
of program logic.
The final program documentation may also contain various lists of identifiers used in the program,
where code segments are used and where they are defined, and so forth. This provides an excellent
means of understanding the program. For a good introduction, find one of Knuth's published programs
and read it. You will immediately see that this is a good way of documenting program development.
In my view, there are three crucial functional aspects of literate programming:
- Template processing allows code segments to be broken into understandable chunks and reassembled
prior to compilation.
- Since only one source is used for both code and documentation, the program is always
guaranteed to be in synch with its documentation. Seeing a clear exposition of the program
code may also induce the maintenance programmer to correct it when a mistake is found, instead of
merely correcting the program code and leaving the specification in its original erroneous state.
- Automatic program analysis allows the printed documentation to include information about
identifiers and other program constructs with no human intervention. If well-designed, these
indices can provide an incredible boost to the quality of the documentation.
With these fundamental requirements in mind, we are trying to generalize these points in preparation
for development of our two chief vaporware offerings. (See this description.)
In general, we want to take literate programming and make of it a system that we will actually
use in the context of paid programming work.
(Here's a good take on that point of view that's good for a laugh or two.)
For that to be possible, the effort required to
produce a literate program must be of the same order of magnitude as the effort required simply to
code up a product and to debug it. This means that it must be simple to take existing test
code and reverse-engineer it to produce an explanation of its workings in literate style. The editor
used to create the program must be easy to use and must understand literate style. The code must
be accessible in its tangled form and changes to it in tangled form must propagate back into the
literate source.
For each of the three points above, we have identified the processes that we envision. The finished
literate product will be a far cry from current literate programming technology, and we think that
this will have the potential to bring literate programming into the mainstream, simply via evolutionary
forces if in no other way. Literate programming should increase an individual's productivity
if applied correctly.
Here's how we see the eventual programming environment:
Product description language
This generalizes the first of LP's three components. In addition to allowing code segments to be
inserted into other segments, defining the files to be created, and doing simple macro processing,
it should be possible to generate code on the basis of database queries, for example, or perform
other generalized data manipulation to define a program. For instance, if a parser is a component
of the program, then the yacc and lex code for the parser should be the source used to generate
the program. During tangle, yacc and lex are called to produce C code, and that resulting code
segment is then inserted into the final program. It isn't necessary ever to see that code, but
it should be acknowledged and documented that it exists. I'm sure you can think of a few
applications of this, too.
Annotation
Code segments are associated rigidly with text in current LP technology. However, we would like
the ability to annotate any action taken. If a change is made to existing code, that change
should be tracked, annotated, and incorporated into the final documentation, so that the thoughts
of each maintenance programmer can be followed in making later changes. This whole idea is still
somewhat vague; as usual, please watch this space for further details.
Code analysis
Ideally, each language used in a program would be understood and parsed during editing. The results
of this parsing would be available for code analysis procedures. This corresponds to current LP
technology's inclusion of information about identifiers into the final documentation; I think that
this information is crucial during composition and maintenance of the code. And I'm starting to
take a more active interest in the whole arena of analysis and measurement -- check out my topic
on code analysis and software measurement for some of the thoughts I'm turning over. If a general
code analysis framework were combined with a literate programming documentation facility, then, well,
words fail me. It would be a Good Thing.