next up previous
Next: About this document Up: No Title Previous: No Title

Accessing web form data

A. Nature of the data

Consider the web page sample_form.html discussed in a previous handout.

The HTML for the page includes the usual header and ender stuff and also a <FORM ACTION="get_info.cgi" ...> ... </FORM> block, specifying a web form. In this block are various tags for text fields, boxes, buttons, etc.

Notice that each field has a field name, given by an attribute NAME= . When the user fills out the form and clicks on the ``Submit'' button, the browser sends information to the CGI program get_info.cgi as ``name=value pairs''.

For example, one field is named email and whatever the user fills in is its value. If the user fills in emilyts@ucla.edu then the information comes to your CGI program as a string

email=emilyts@ucla.edu

This string is actually part of a longer string holding all the data from the form. Your CGI program needs to take this longer string apart to get the data.

Point your browser to the sample form at

http://www.math.ucla.edu/~baker/40/sample_form.html

Then fill in some information and click on the ``submit'' button. The information is sent to get_info.cgi , which summarizes the name=value pairs it found. Back up, change the information, and try again.

B. How does the browser put together the data to send?

The browser follows these steps:

  1. Make all the ``name=value'' strings described above.
  2. Replace most ``special characters'' in the strings by a percent sign and a two-digit hexadecimal code, so that a comma becomes %2C and so on. (You don't have to know the codes.) A current version of Netscape replaces all special characters except underbar, hyphen, asterisk, period, and space.
  3. Replace space characters by + (in browsers that have not already encoded the space characters)
  4. ``Join'' the ``name=value'' strings together using an ampersand & as the separator, to make one longer combined data string.

The browser gives the combined data string to the web server, which starts the CGI program and passes the string to it. The CGI program has to undo the steps just listed in order to get the information out.

C. What are the name=value pairs for the different kinds of fields?

In every case, the NAME= attribute gives the field name, so it's just the value that's in question.

Some points not to get confused about:

D. How is the combined data string sent?

There are two methods: ``GET'' and ``POST''. The method is specified as an attribute of the FORM tag, for example,

<FORM ACTION="get_info.cgi" METHOD="GET">

and a CGI Perl script is told the method via $ENV{"REQUEST_METHOD"}

For ``GET'', the combined data string is appended to the URL, separated by a question mark. The browser gives it to a CGI Perl script as $ENV{"QUERY_STRING"} .

For ``POST'', the combined data string is given to the Perl CGI script as standard input of a certain length. The Perl script can tell the length from $ENV{"CONTENT_LENGTH"} . There is a Perl command to read a string of a fixed length into your variable, say $query_string

read (STDIN, $query_string, $ENV{"CONTENT_LENGTH"});

Which method is better? That depends on the application. The ``GET'' method is easier to test, since you can just make up data to append directly to the URL and call the CGI program directly. The ``POST'' method is used for larger amounts of data and for data containing passwords, where you don't want them visible in the browser's URL window.

Try this: In your browser, go directly to the URL

http://www.math.ucla.edu/~baker/40/get_info.cgi?XX=hi

instead of filling out the sample form. Does this do what you expect?

In the next assignment, you'll write a Perl subroutine to read the form either way, so you can call the subroutine to get the data without worrying about which way the data was sent. Of course, since you're designing the HTML form page as well, you do control which way the data is sent in the first place.

E. Summary of what your CGI program does

  1. See if $ENV{"REQUEST_METHOD"} is GET or POST . (Use eq .)
  2. If it's GET, use $query_string = $ENV{"QUERY_STRING"};

    If it's POST, use

    read (STDIN, $query_string, $ENV{"CONTENT_LENGTH"});

  3. It would be good to check here to make sure the string doesn't contain any special characters it shouldn't. (How could a hostile user have put them in?)
  4. Get the name-value pairs out by @pairs = split /\&/, $query_string;
  5. Loop through the pairs, splitting the name and value. One way:

    ($name,$value) = split /\=/, $pair

  6. In this loop, replace any + signs by spaces and then any codes by the proper special characters.

    One way: Use $value =~ s/\%([\dA-Fa-f]{2})/chr(hex($1))/ge;

    Here [\dA-Fa-f] describes possible hex digits, {2} means ``exactly two'' of them, () results in $1, hex() is to convert hex to integer, chr finds the character with that integer as its code, g means global as usual, and e means ``evaluate the expression'' instead of putting literally chr(hex...)

  7. In the loop, also save the value appropriately-we'll discuss that.

F. The ASCII character set in hexadecimal.

You don't have to learn these codes; just get the flavor.

The first group consists of nonprinting characters, with antique names that come from teletype codes decades ago. The ones still occasionally relevant for you are BS=backspace, BEL=bell, HT=tab (i.e., horizontal tab), NL=newline, NP=new page, CR=carriage return, and ESC=escape. The end of line is signaled by NL in UNIX, NL CR in DOS and Windows, and CR on Macintoshes.

Some of these nonprinting characters are produced directly by keys, for example tab, backspace, and escape. All the nonprinting characters can be produced using the ``control key'' CTRL, which sets the first three bits to 0. For example, backspace can also be produced by CTRL h , since h is hex 68 = 01101000, which is stripped to 00001000 = hex 08 = backspace.

00 NUL   01 SOH   02 STX   03 ETX   04 EOT   05 ENQ   06 ACK   07 BEL  
08 BS    09 HT    0A NL    0B VT    0C NP    0D CR    0E SO    0F SI   
10 DLE   11 DC1   12 DC2   13 DC3   14 DC4   15 NAK   16 SYN   17 ETB  
18 CAN   19 EM    1A SUB   1B ESC   1C FS    1D GS    1E RS    1F US   
20 SP    21  !    22  "    23  #    24  $    25  %    26  &    27  '   
28  (    29  )    2A  *    2B  +    2C  ,    2D  -    2E  .    2F  /   
30  0    31  1    32  2    33  3    34  4    35  5    36  6    37  7   
38  8    39  9    3A  :    3B  ;    3C  <    3D  =    3E  >    3F  ?   
40  @    41  A    42  B    43  C    44  D    45  E    46  F    47  G   
48  H    49  I    4A  J    4B  K    4C  L    4D  M    4E  N    4F  O   
50  P    51  Q    52  R    53  S    54  T    55  U    56  V    57  W   
58  X    59  Y    5A  Z    5B  [    5C  \    5D  ]    5E  ^    5F  _   
60  `    61  a    62  b    63  c    64  d    65  e    66  f    67  g   
68  h    69  i    6A  j    6B  k    6C  l    6D  m    6E  n    6F  o   
70  p    71  q    72  r    73  s    74  t    75  u    76  v    77  w   
78  x    79  y    7A  z    7B  {    7C  |    7D  }    7E  ~    7F DEL

In this table of 128 codes the first bit is always 0. In Extended ASCII there are 128 more codes for other symbols, with the first bit 1.

A more modern character set is Unicode, which uses 124 bits and encodes many more foreign-language characters.


next up previous
Next: About this document Up: No Title Previous: No Title

Kirby A. Baker
Wed Feb 17 14:49:06 PST 1999