The HTTP Conversation

In Hour 21, "Introduction to CGI," you learned about the basic conversation between the web browser (Netscape, Internet Explorer, and so on) and the web server (Apache, IIS, and so on). The discussion in that hour was somewhat oversimplified. Now that you're more comfortable with CGI, it's probably a good time to take a closer look at this subject. Later in this hour, you'll learn some techniques for manipulating this conversation to perform some interesting tasks.

The conversation is described by a protocol known as the Hypertext Transfer Protocol (HTTP). The two current versions of this standard are HTTP 1.0 and HTTP 1.1. For purposes of this discussion, either one is applicable.

By the Way

The Internet standards documents that describe the protocols used on the Internet are called Request For Comments or RFCs. The RFCs are maintained by the Internet Engineering Task Force (IETF) and can be viewed on the Web at http://www.ietf.org. The specific documents that describe HTTP are RFC 1945 and RFC 2616. Be forewarned: These documents are highly technical in nature.

When your web browser makes a connection to the web server, the browser sends an initial message to the server that looks something like this:

The GET line indicates the path part of the URL you're trying to receive and what version of the protocol you're accepting. In this case, you're accepting version 1.0 of the HTTP protocol.

The Connection line indicates that you would like this connection kept open for multiple page fetches. By default, a browser makes a separate connection for each frame, page, and image on a web page. The directive Keep-Alive asks the server to keep the connection open so that multiple items can be fetched using the same connection.

The Accept lines indicate what sorts of data you're willing to accept on the connection. The */* at the end of the first Accept line indicates that you're willing to accept any kind of data. The next line (iso-8859-1 and so on) indicates what character encoding can be used for the document. The Accept-Encoding line, in this case, says that the browser can accept content compressed with gzip (GNU Zip) for a faster transfer. Finally, Accept-Language indicates what languages are acceptable to this browser—English, English–Great Britain, German, French, and so on.

Host is the hostname from the URL that you're retrieving. Because of virtual hosting, it might be different than the main hostname of the server.

Finally, the browser identifies itself to the web server as Mozilla/4.51 C-c32f404p (WinNT; U). In web terminology, the browser is called a user agent.

The status 200 indicates that everything went fine. The server also identifies itself on the Server line; in this case, the server is a Netscape-Enterprise/3.5.1G web server.

The Content-Length line indicates that 2,222 bytes of content will be sent back to the browser. Using this information, your browser knows that a page is 50 percent complete, 60 percent complete, and so on. The Content-Type is the kind of page that is being sent back. For HTML pages, this line is set to text/html. For an image, it might be set to image/jpeg.

The Last-Modified date indicates to the browser whether the page has changed since it was last fetched. Most web browsers cache pages so that if you look at a web page twice, the date can be compared at this point to a saved copy the browser already has. If the page on the server hasn't changed, downloading the entire page again might not be necessary.

Example: Fetching a Page Manually

You can fetch a web page manually. This capability is often useful when you're just testing and want to make sure that a web server is sending correct replies.

To follow this example, you need a program called a Telnet client. The Telnet client is a remote-terminal access program used to log in to Unix workstations remotely; however, it's often useful for tasks such as debugging HTTP.

If you have a Unix machine, it's very likely that you have Telnet installed already. If you have a Microsoft Windows machine, Telnet may already be installed as part of your networking utilities. Simply use the Run option from the Start menu to run the Telnet client.

Here, www.webserver.com is the name of the web server, and 80 is the port number you want to connect to (port 80 is typically where web servers are listening). If your Telnet client is a graphical one, you might need to set these values in a dialog box.

When Telnet connects, you might not receive a prompt or a connection message. Don't worry; that's normal. HTTP expects the client to talk first; the server isn't expected to prompt. Under Unix, you get a message that says something like this:

Press the Enter key twice after typing this line. The web server should then respond with a normal HTTP header and the top-level page for the web site and then disconnect.

Redirection

One useful trick to use in CGI programs is called HTTP redirection. Use redirection when you want a CGI program to load another page based on some computed value.

If you have a series of pages specific to a browser—for example, they contain a plug-in that's available only to Mozilla browsers under Microsoft Windows—you can send all the web site visitors to the same URL and have a CGI program redirect them to the correct page, as illustrated in Figure 24.1.

Another example might be where your CGI program has produced some kind of output: a PDF file, a ZIP file, an Excel Spreadsheet, or other downloadable file and you've placed it in a file somewhere else on the web server. You can then redirect the browser to pick up the file so that the CGI program doesn't have to be active during the download itself.

To implement a redirection, you need to use the CGI module's redirect function. The redirect function manipulates the HTTP conversation discussed earlier and causes the browser to load a new page.

Listing 24.1 contains a short program to redirect users of Mozilla under Windows to one page and all other browsers to a different page.

Watch Out!

The redirect header has to be printed before anything else is emitted from the CGI script. Don't print the output of the header() function or anything else before calling redirect().

Listing 24.1. Redirection Based on Browser

Redirection through CGI is seamless, whereas other techniques such as using JavaScript and HTML extensions have problems. JavaScript is not supported on all platforms, and using a window.location.href assignment in JavaScript might not produce the proper results. Using an HTML <META HTTP-EQUIV="refresh"> tag for redirection causes a noticeable delay because the browser has to load the page completely before the redirection can take place. JavaScript shares this problem. HTTP redirection happens before any HTML is transmitted and is nearly instantaneous.

By the Way

The user_agent name returned by a typical Windows XP Firefox browser is similar to Firefox/1.0.1 (Windows NT 5.1; U; pl-PL). Microsoft Internet Explorer 6 identifies itself as Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1). Coding a script that guesses the browser in a foolproof manner is almost impossible though because a browser can lie about its user agent type, and often will.

The HTTP Conversation