Perl/CGI implementation of RFC1867 file upload


This is an implementation of RFC1867 in a form which makes it a generally useful procedure for getting form input in a CGI program. That is, you can call this procedure whether you're expecting file uploads or not; there's very little overhead involved if no files are actually provided. So it makes for convenient site design.

Since this code will execute in a CGI environment, there are two sources of input data. The QUERY_STRING environment variable may contain &-delimited URL-encoded data, and there may be input data on stdin as well. If a normal form is used, then stdin-derived data is also &-delimited URL-encoded data, but if an RFC1867 file upload form is in use, then the format is as I describe on my RFC1867 page. This routine doesn't care; it'll take anything it can get, and will even combine data from both possible sources.

You can get the finished form of the code here. To use it with your CGI code, just put it in your use page and do this:
use "_getargs.txt";
(Or you can rename the file to something else, of course.)

This routine returns input values into a hash, and returns a reference to the hash (often called a hashref.) And we use a hashref to pass in information about how to handle uploads, as well. So you use the function and the returned data like this:
$uploads{base} = "/usr/local/my/directory";
$uploads{file} = "upload_[(field)]";
$uploads{mime}{image/jpeg} = "jpg";
$input = getargs(\%uploads);
$myvalue = $$input{name};
(If you haven't seen $$ to retrieve hashref values, then go buy the Camel book right now.)

Here's what the main function looks like. All the underlined italic phrases are code chunks which are elaborated down below. You can follow the link, or you can just scroll down the page; this whole file is presented on one page. Excuse the disclaimer there, but it kind of has to be there these days.
 
# This is _getargs, a Perl/CGI argument reader capable of retrieving RFC1867 file uploads
# as well as "normal" URL-encoded input.
# (c) Vivtek 2000.  Licensed under the terms of the GNU Public License.
# Documentation at http://www.vivtek.com/cgi/getargs.html
#
# You may freely use and copy this code for any purpose, as long as this comment block
# remains attached exactly as it is.  Modified forms of this code must clearly state the
# fact that they're modified.  This code is distributed with no warranty at all -- if it
# breaks, it's not my problem.  If it breaks your system, it's still not my problem.

sub getargs {
   See Collect all the URL-encoded input into $input
   See Process the URL-encoded input into hash
   See Return the hash right now if we don't have file uploads

   See Do the RFC1867 thing

   return \%tagset;
}
We've got one helper function which builds a filename from values already encountered in the form. Very useful for attaching files to things.
 
See Helper function to build filename
And then we have to return 1 so that use won't get upset.
 
1;


Collect all the URL-encoded input into $input
The contents of the QUERY_STRING environment variable are always URL-encoded. And if we're not getting our input from an RFC1867 multipart-data form, then stdin also contains URL-encoded data, so we simply get it line by line and tack it onto $input.
 
my $input = $ENV{QUERY_STRING};
if (lc($ENV{CONTENT_TYPE}) eq 'application/x-www-form-urlencoded') {
   $input .= "&" if $input ne '';
   while (<>) {
     chomp;
     $input .= $_;
   }
}


Process the URL-encoded input into hash
Each chunk of URL-encoded input is split into a name and value, delimited by the '='. So we split on the '=', then we URL-decode the value and write it into our output hash keyed by the name. This has some side effects, notable among which is the fact that a URL may easily contain two values with the same name (some of my own code uses this sort of trick, for instance, to represent a list of checkboxes of items to delete from a list.) If this happens, the hash will not reflect this -- the last occurrence of a name will be the one that survives in the hash.

So you've been warned. For most applications this code will work fine.
 
my $i;
my @t;
my %tagset;
foreach $i (split /&/, $input) {
   @t = split /=/, $i, 2;
      
   $t[1] =~ tr/+/ /;
   $t[1]=~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
   $tagset{$t[0]} = $t[1];
}


Return the hash right now if we don't have file uploads
To check whether there is mulitpart data on stdin, we just check our content type. If it's not multipart, we return a reference to our tagset hash.
 
return \%tagset if (lc($ENV{CONTENT_TYPE}) !~ m'multipart/form-data;');
If we do have multi-part data, then it's still going to get into the hash, but it will be considerably more involved to get it.

Do the RFC1867 thing
So OK, let's get down to business actually reading multipart/form-data input. First, let's reserve our variables, so as not to clutter up the namespace.
 
my $line;
my $lines;
my $name;
my $type;
my $filename;
my $file = 0;
my $localname;
Where we actually put files, and how we name them, is governed by the uploads hashref passed in as a parameter to the function.
 
my $uploads = shift @_;
The content-type of a multipart/form-data header includes the boundary string. The boundary string is a string guaranteed not to appear in the data itself. So it's safe to separate fields within the input. Each field will thus consist of a number of lines on stdin, separated by the boundary value on its own line. Let's get the boundary value now.
 
my ($junk,$boundary) = split /=/, $ENV{CONTENT_TYPE}, 2;
$boundary =~ s/\n//;
$boundary = "--$boundary";
Now we enter the main loop of the reader. If the line is a boundary line, then we read the header immediately following in order to know how to process subsequent lines; otherwise, we handle the current line appropriately depending on whether we're in the middle of a file, or just a plain data field. (If we're in the middle of a file, then the line is decoded and written to the file we have open while we're reading.)
 
LINE: while (<>) {
      last if ($_ eq "${boundary}--\r\n");
      if ($_ eq "$boundary\r\n") {
         See Handle transition to new field
      }

      if ($file) {
         # Write the line to the temp file.
         print _GETARGS_TEMP $_;
      } else {
         s/[\r\n]*$//;
         if ($lines > 0) {$tagset{$name} .= "\n"; }
         $tagset{$name} .= $_;
         $lines ++;
      }
   }
   close _GETARGS_TEMP;


Handle transition to new field
At the beginning of each field, we have several lines of headers, followed by a blank line, followed by content. When a boundary comes up, then, we have to interpret the headers to see what's coming next. The first header is the Content-Disposition, which gives us the name of the field, and if it's a file then it also gives us the client-side filename.
Content-Disposition: form-data; name="myfile"; filename="c:\myfile.txt"
So the first thing we do at the beginning of a field is to decipher the contents of this first line of headers.
 
$line=<>; # Get first line of headers.
$line =~ s/.*?; //; # Chop off the Content-Disposition part, we don't need it.
($name, $filename) = split /; /, $line, 2;
($junk, $name) = split /"/, $name;
Now we have the field name in $name and the filename (if a file upload) in $filename. We do a little housework:
 
close _GETARGS_TEMP;
$file = 0;
$lines = 0;
(The _GETARGS_TEMP file handle may be open from the last field -- that's the handle we use to write file upload data to.)

If the filename is blank, this field is just a plain old field and the $file flag will thus remain zero. But if there is a filename, then we have to process it and then get the next line so as to interpret the content type.

For each upload field, this code will create a "virtual field" in the output hash listing the details of the upload (like the client-side local filename). If the field is named myfield, then the details of the upload will end up in _details_myfield.
 
if ($filename ne "") {
   $file = 1;
   ($junk, $filename) = split /"/, $filename;
   $tagset{"_details_$name"} = $filename;
OK, so let's get the next line, which is the Content-Type. This is the MIME type of the incoming file.
 
   $line=<>;
   chomp $line;
   ($junk, $type) = split ": ", $line;
   $type =~ s/\r*//g;
   $tagset{"_details_$name"} .= "|" . $type;
Our next step will be to determine a filename in which to store the incoming file data. First, the extension. If you have extensions preset in the $uploads hashref, then we'll use that. (To do this, when building your hashref, do something like the following.)
my %uploads;
$uploads{mime}{image/jpeg} = "jpg";
$input = getargs(\%uploads);
The reason this is important is that many uploads will be coming from Windows machines, where extensions may well be in capital letters. If you're running on Unix, and using this for upload of files to be served from a website, then the extension ".JPG" probably won't be served with the proper mimetype of "image/jpeg", simply because the extension ".jpg" is the only one registered. Add that to the fact that JPEG files are often stored with a four-character extensions, and chaos ensues.

So let's check $uploads for an extension for the current MIME type.
 
   $ext = $$uploads{mime}{$type};
   if ($ext eq '') {
      $ext = $filename;
      $ext =~ s/^.*\.//;
   }
And now let's call getargs_makefilename to format us a filename based on fields already read. (See below for an explanation of this function.)
 
   $tagset{$name} = getargs_makefilename ($$uploads{file}, \%tagset, $name);
   if ($ext ne '') { $tagset{$name} .= ".$ext"; }
Then, using the $uploads parameter again to determine the base directory for the file, build a filename and open the file in _GETARGS_TEMP.
 
   $localname = "$$uploads{base}/$tagset{$name}";
   if ($localname ne '') {
      if (open _GETARGS_TEMP, ">$localname") {
         $tagset{"_details_$name"} .= "|" . $localname;
      }
   }
}
Finally, toss any remaining header lines (whether this is an upload field or not.) This is probably a bad idea, but so far it seems to be working.
 
while ($line=<>) { next LINE if $line == "\n"; }
And that was it. That's one way of parsing RFC1867 file uploads, and you could do substantially the same thing in any language (although those Perl regexps sure make things easy.)

Helper function to build filename
The only piece still missing is that getargs_makefilename function used above. It takes a name specification string, the hash currently being read as a hashref, and the name of the field actually being worked on.

It scans the namespec for things in square brackets, and replaces them with the corresponding value from the hashref. It has one special value: any occurrence of [(field)] will be replaced with the field name of the file being uploaded. This allows multiple files to be uploaded on a single form.

So if I have a form with a username field, say, and two upload fields, call them "file1" and "file2", then I could specify a namespec like "[username]_[(field)]". If my username is joe, then the first file will end up in a file named joe_file1 (plus the MIME-specified extension) and the second in joe_file2. Clear? I thought so.

Just to be on the safe side, we'll also remove "dangerous" characters from the filename. Dangerous characters are basically any kind of punctuation. And I like to replace spaces with underscores as well, just to make coding easier elsewhere. But hey -- if it doesn't work for you, by all means take that part out.
 
sub getargs_makefilename {
   my $spec = shift(@_);
   my $object = shift(@_);
   my $field = shift(@_);

   while ($spec =~ /\[(.*?)\]/) {
      $tag = $1;
      if ($tag eq '(field)') {
         $val = $field;
      } else {
         $val = $$object{$tag};
      }
      $val =~ tr/ /_/;
      $val =~ s/[&!"'*;]//g;
      $tag =~ s/\(/\\(/g;
      $tag =~ s/\)/\\)/g;
      $spec =~ s/\[$tag\]/$val/g;
   }

   return $spec;
}
This kind of thing is just so easy in Perl. I love Perl.

Conclusion
So all in all, RFC1867 may not be for the faint-hearted, because there is a lot to keep track of in comparison to plain text values -- but it's not brain surgery, either. I use the above code daily for a number of my customers' needs, and you can, too. If you do find it useful, tell me -- I'd like to know what's good and what's bad about this presentation. And if you make any changes, I'd love to know what and why.

This code and documentation are released under the terms of the GNU license. They are additionally copyright (c) 2000, Vivtek. All rights reserved except those explicitly granted under the terms of the GNU license. The code is presented in a literate programming style using the LPML tool, available free of charge from Vivtek. Try it, you'll like it.