Using the gather Directories

The gather directories are a place that web applications can save information to be collected by a batch process. Note that this does not allow real-time updates; changes from all the web servers can only be seen after the batch process runs.

The host on which the batch process runs has access to all of the gather directories from the different web servers, but each web server can only view its own gather directory. You need to send mail to www-mgmt to get the batch process executed on a regular schedule. While developing the batch process, you can write a short CGI which calls the program on the host wwwdev.cac.washington.edu or wwwudev.cac.washington.edu, which also have access to all gather directories.

An example of how the gather directories can be used is in the gather Directory Case Study.

Uses for the gather Directories

The gather directories are used when an application receives data from users and either needs to publish the data or process the data and publish the results.

Surveys
The CGI can gather information from a form, saving that information in a file with one line per entry.. The batch process would then collect the data and either save a data file or generate HTML files.
Forms with blocks of text
If you have a form which has text areas, then it may be better to save data in separate files. The batch process could either generate a single HTML file or several files, depending on the requirements.

Writing to the gather Directories

The web server passes the SERVER_GATHER environment variable to all CGIs. Use this directory to write information you wish to later retrieve from the batch process. Note that there are other applications which also use this directory, so be sure to uniquely name your file. If you have several files, you should create a subdirectory in which you can save them.

You should not assume that the file and/or directories you use have already been created, so your CGI should needs to create them if they do not already exist.

An example of a path you can use (using Perl syntax) is:

"$ENV{'SERVER_GATHER'}/myapp"

Format of Data in gather Directories

You should format the gather file data to speed saving data. If you are gathering short data (perhaps one line per entry) then one possibility is to append each line to the gather file. If you are collecting a large amount of data, or data which is very unstructured, then you can either save the data in a file using unique separator lines, or you can write each entry into an individual file; you should save these files in a gather subdirectory.

Locking Issues

You need to be sure to use file locking for any files you write in the gather directories, since it's possible to have multiple CGIs and even the batch process running at the same time. If the CGI is a Perl script, then you can use the lockfile library. If you created a subdirectory and need to access the whole directory, you can lock the whole directory with the lockfile library.

Batch Processes

The batch process collects data from all the gather directories and writes data to the web directories, which can then get pushed out the next evening. Another possibility is that the batch process summarizes data and then sends a report via email.

Access Restrictions

The design of the www.washington.edu cluster assures the integrity of its web filesystem by assuring that the server runs as a user different than the user which ownes the files. If you have a batch process which needs to modify files in the web directories, then that process must execute as the user which owns those files. However, since that user is different than the user which owns the files in the gather directories, your batch process will not be able to modify or remove the files in the gather directories.

Emptying a single file
If you wish to empty the files in the gather directories after processing the data, you need to have the CGI perform that function. However, the CGI needs to know when the batch process has run so it knows when it is necessary to empty the file in the gather directory. One way to do this is with a timestamp file created by the batch process. The logic used by the CGI is:
- If the timestamp file does not exist
  The file has never been read, so do not empty it.
- Else, if the timestamp file exists and is newer than the current data file
  The file has been read more recently than the last time we wrote to it. Need to empty the data file.
- Else, if the timestamp file exists and is older than the current data file
  The file has already been cleared after it was last read. Do not empty it.
The logic used by the batch process is:
- If the timestamp file does not exist
  The file has never been read. Do so and create a new timestamp file.
- Else, if the timestamp file exists and is newer than the current data file
  The file has not changed since last read. Updating the timestamp file is optional.
- Else, if the timestamp file exists and is older than the current file
  The file has changed since last read. Read the file and update timestamp file.
Emptying a directory
A similar method can be used if the CGI writes several files into a subdirectory. It's possible to just use the timestamp on the subdirectory, but it's best to use a separate timestamp file for the CGI. This is because if any files are modified in the subdirectory, then the timestamp of the subdirectory does not change.

Writing into the Web Directories

Because the data is automatically changed, it's best to have the batch process modify files directly in the production web directories (such as /www/world.) Be sure that the directory in /usr/local/wwwdev or /usr/local/wwwudev uses a .wwwinstrc file to make sure the modified data files to not get removed by a wwwdinst or wwwuinst command. Information about these files can be found in the wwwdinst and wwwuinst documentation.

The reason you should have the batch process write into the production directories is because there is then no need to run wwwdinst or wwwuinst to install those files into production. The files will automatically be pushed out with the nightly push, assuming the batch process completes its work before the nightly push begins.

Reading from gather Directories

The paths to use in your batch process to read information from the gather directories depends on which server you are using. For www.washington.edu, you should use a path such as:

/gather/*/www.cac/myapp

A sample Perl code fragment to process all the files:

foreach my $file in (glob "/gather/*/www.cac/myapp") { do work
}

More detailed code examples are in the gather case study.