$next$ $up$ $previous$
Next: About this document Up: No Title Previous: No Title

Accessing web form data

A. Nature of the data

Consider the web page sample_form.html discussed in a previous handout.

The HTML for the page includes the usual header and ender stuff and also a <FORM ACTION="get_info.cgi" ...> ... </FORM> block, specifying a web form. In this block are various tags for text fields, boxes, buttons, etc.

Notice that each field has a field name, given by an attribute NAME= . When the user fills out the form and clicks on the ``Submit'' button, the browser sends information to the CGI program get_info.cgi as ``name=value pairs''.

For example, one field is named email and whatever the user fills in is its value. If the user fills in emilyts@ucla.edu then the information comes to your CGI program as a string

email=emilyts@ucla.edu

This string is actually part of a longer string holding all the data from the form. Your CGI program needs to take this longer string apart to get the data.

Point your browser to the sample form at

http://www.math.ucla.edu/~baker/40/sample_form.html

Then fill in some information and click on the ``submit'' button. The information is sent to get_info.cgi , which summarizes the name=value pairs it found. Back up, change the information, and try again.

B. How does the browser put together the data to send?

The browser follows these steps:

Make all the ``name=value'' strings described above.
Replace most ``special characters'' in the strings by a percent sign and a two-digit hexadecimal code, so that a comma becomes %2C and so on. (You don't have to know the codes.) A current version of Netscape replaces all special characters except underbar, hyphen, asterisk, period, and space.
Replace space characters by + (in browsers that have not already encoded the space characters)
``Join'' the ``name=value'' strings together using an ampersand & as the separator, to make one longer combined data string.

The browser gives the combined data string to the web server, which starts the CGI program and passes the string to it. The CGI program has to undo the steps just listed in order to get the information out.

C. What are the name=value pairs for the different kinds of fields?

In every case, the NAME= attribute gives the field name, so it's just the value that's in question.

For a text field the value is whatever the user types in.
For a multiline text field (text area) the value is whatever the user types in.
For a checkbox such as <INPUT TYPE="checkbox" NAME="student"> the value is on if the box is checked but no value is sent if the box is not checked. You can use the attribute VALUE= to change on to something else if you wish, for example, VALUE="good" .
Different checkboxes should have different names, so your program can tell them apart when it gets the data--or if they have the same name, then they should have different values. Your program will need to see whether any value was sent or not.
Radio buttons are in groups, where the user is allowed to check only one button in the group. All the radio buttons in the group must have the same name but different values, set with the attribute VALUE= . Having a name in common is how the browser knows they are in the same group. The values have to be different so your CGI program can tell which one was checked.
A menu or scrolled list can be set to allow just one choice or to allow multiple choices (with the attribute MULTIPLE). Each value selected results in a separate name=value pair. For example, if a form asks the user to select US states and allows multiple choices and if the user selects both CA and TX, the CGI program will get both
state=CA and state=TX .
For a password field, the value is whatever the user types. The user sees only asterisks while typing.
For a submit field such as <INPUT TYPE="submit" VALUE="Submit the form"> the value returned is given by the VALUE= attribute. In other words the value returned is the same as what is printed on the button.
For a reset field, the value returned is the same as what is printed on the button, set by the VALUE= attribute.
For hidden field such as <INPUT TYPE="hidden" NAME="info" VALUE="student"> the value is whatever is given by the VALUE attribute, since the user doesn't have any opportunity to change it!

Some points not to get confused about:

For most fields you can specify a VALUE= attribute, which becomes the default value. So it's returned as part of a name=value pair only if the user doesn't change it to something else.
A field asking for the user's name may have ``name'' as its name, as in the example, which has NAME=name .

D. How is the combined data string sent?

There are two methods: ``GET'' and ``POST''. The method is specified as an attribute of the FORM tag, for example,

<FORM ACTION="get_info.cgi" METHOD="GET">

and a CGI Perl script is told the method via $ENV{"REQUEST_METHOD"}

For ``GET'', the combined data string is appended to the URL, separated by a question mark. The browser gives it to a CGI Perl script as $ENV{"QUERY_STRING"} .

For ``POST'', the combined data string is given to the Perl CGI script as standard input of a certain length. The Perl script can tell the length from $ENV{"CONTENT_LENGTH"} . There is a Perl command to read a string of a fixed length into your variable, say $query_string

read (STDIN, $query_string, $ENV{"CONTENT_LENGTH"});

Which method is better? That depends on the application. The ``GET'' method is easier to test, since you can just make up data to append directly to the URL and call the CGI program directly. The ``POST'' method is used for larger amounts of data and for data containing passwords, where you don't want them visible in the browser's URL window.

Try this: In your browser, go directly to the URL

http://www.math.ucla.edu/~baker/40/get_info.cgi?XX=hi

instead of filling out the sample form. Does this do what you expect?

In the next assignment, you'll write a Perl subroutine to read the form either way, so you can call the subroutine to get the data without worrying about which way the data was sent. Of course, since you're designing the HTML form page as well, you do control which way the data is sent in the first place.

E. Summary of what your CGI program does

See if $ENV{"REQUEST_METHOD"} is GET or POST . (Use eq .)
If it's GET, use $query_string = $ENV{"QUERY_STRING"};
If it's POST, use
read (STDIN, $query_string, $ENV{"CONTENT_LENGTH"});
It would be good to check here to make sure the string doesn't contain any special characters it shouldn't. (How could a hostile user have put them in?)
Get the name-value pairs out by @pairs = split /\&/, $query_string;
Loop through the pairs, splitting the name and value. One way:
($name,$value) = split /\=/, $pair
In this loop, replace any + signs by spaces and then any codes by the proper special characters.
One way: Use $value =~ s/\%([\dA-Fa-f]{2})/chr(hex($1))/ge;
Here [\dA-Fa-f] describes possible hex digits, {2} means ``exactly two'' of them, () results in $1, hex() is to convert hex to integer, chr finds the character with that integer as its code, g means global as usual, and e means ``evaluate the expression'' instead of putting literally chr(hex...)
In the loop, also save the value appropriately-we'll discuss that.

F. The ASCII character set in hexadecimal.

You don't have to learn these codes; just get the flavor.

The first group consists of nonprinting characters, with antique names that come from teletype codes decades ago. The ones still occasionally relevant for you are BS=backspace, BEL=bell, HT=tab (i.e., horizontal tab), NL=newline, NP=new page, CR=carriage return, and ESC=escape. The end of line is signaled by NL in UNIX, NL CR in DOS and Windows, and CR on Macintoshes.

Some of these nonprinting characters are produced directly by keys, for example tab, backspace, and escape. All the nonprinting characters can be produced using the ``control key'' CTRL, which sets the first three bits to 0. For example, backspace can also be produced by CTRL h , since h is hex 68 = 01101000, which is stripped to 00001000 = hex 08 = backspace.

00 NUL   01 SOH   02 STX   03 ETX   04 EOT   05 ENQ   06 ACK   07 BEL  
08 BS    09 HT    0A NL    0B VT    0C NP    0D CR    0E SO    0F SI   
10 DLE   11 DC1   12 DC2   13 DC3   14 DC4   15 NAK   16 SYN   17 ETB  
18 CAN   19 EM    1A SUB   1B ESC   1C FS    1D GS    1E RS    1F US   
20 SP    21  !    22  "    23  #    24  $    25  %    26  &    27  '   
28  (    29  )    2A  *    2B  +    2C  ,    2D  -    2E  .    2F  /   
30  0    31  1    32  2    33  3    34  4    35  5    36  6    37  7   
38  8    39  9    3A  :    3B  ;    3C  <    3D  =    3E  >    3F  ?   
40  @    41  A    42  B    43  C    44  D    45  E    46  F    47  G   
48  H    49  I    4A  J    4B  K    4C  L    4D  M    4E  N    4F  O   
50  P    51  Q    52  R    53  S    54  T    55  U    56  V    57  W   
58  X    59  Y    5A  Z    5B  [    5C  \    5D  ]    5E  ^    5F  _   
60  `    61  a    62  b    63  c    64  d    65  e    66  f    67  g   
68  h    69  i    6A  j    6B  k    6C  l    6D  m    6E  n    6F  o   
70  p    71  q    72  r    73  s    74  t    75  u    76  v    77  w   
78  x    79  y    7A  z    7B  {    7C  |    7D  }    7E  ~    7F DEL

In this table of 128 codes the first bit is always 0. In Extended ASCII there are 128 more codes for other symbols, with the first bit 1.

A more modern character set is Unicode, which uses 124 bits and encodes many more foreign-language characters.

$next$ $up$ $previous$
Next: About this document Up: No Title Previous: No Title

Kirby A. Baker
Wed Feb 17 14:49:06 PST 1999