Nightly Batch Process Annotated Source

This script is run nightly to collect information from the gather files written by gather.cgi. The data file used by the CGI is then updated.

Main Definitions

#!/usr/local/bin/perl5
 
use strict;
use File::stat;
use lib '/www/lib', '/usr/local/www/lib';
require 'lockfile.pl';
my $blib = $0;
$blib =~ s{[^/]*$}{browserlib.pl};
require $blib;
 
sub addto ($$);

As with the CGI script, we use our own browser library to figure out what browser and operating system the client is using. We also use the lockfile library to make sure we don't interfere with a CGI script when reading the data file.

Reading Gather Files

my $datafile = '/www/world/webinfo/case/gather/gather.data';
 
my (%browsers, %platforms) = ();
foreach my $file (glob "/gather/*/www/webinfo_case_gather") {
    next unless lockfile ($file);

After defining where the data file is, we start looking at all the files in the different gather directories. Note the path in gather.cgi was /wwwgather/webinfo_case_gather. We only work with files for which we can get a lock.

Test for new data

    my $stamp = "$file.READ";
    if (-f $stamp) {
        my $fst = stat $file;
        my $sst = stat $stamp;
        unlockfile ($file), next if $fst->mtime < $sst->mtime;
    }

This is where we decide whether the file needs to be read. The gather.cgi script will make sure that the gather file is emptied out if this batch script has read the contents. Every time this batch process reads a gather file, it creates a timestamp file, so the first thing we do is test whether the timestamp file exists. If it doesn't, then we know we've never read the file. However, if the timestamp does exist and is more recent than the gather file, we know that we've read the contents more recently than it has been written; in this case we do not need to read the file and we go to the next. We need to be sure to unlock the file, however.

Read in gather file's data

    if (open F, "<$file") {
        while (<F>) {
            chomp;
            my ($b, $p) = whatbrowser ($_);
            ++$browsers{$b};
            ++$platforms{$p};
        }
        close F;

Now we're ready to read the file. The contents are browser User-Agent strings, one per line. We read each line, and use our library file to break the string into individual browser and platform pairs, keeping a running total.

        #
        # Signal the file has been read
        #
        open F, ">$stamp";
        close F;
    }
    unlockfile ($file);
}

After reading the file, be sure to update our timestamp and then release the lock.

Read in Data File

The data file consists of 18 lines:

  1. Totals of browser type.
  2. Totals of client platforms.
  3. Totals for past 7 days of browser type.
  4. Totals for past 7 days of client platforms.
  5. Previous day's totals of browser type.
  6. Day before previous day's totals of browser type.
    ...
  7. Totals of browser type for 7 days ago.
  8. Previous day's totals of client platforms.
  9. Day before previous day's totals of client platforms.
    ...
  10. Totals of client platforms for 7 days ago.

The format of each line is

entry 1 | count | entry 2 | count | entry 3 | count

For example, a line could contain:

Netscape Navigator|3|Internet Explorer|5

The data file is not very robust, since things can easily get out of sync if the file is corrupted. However, it is sufficient for the purposes of this example.

Reading the data

if (open F, "<$datafile") {
    my $ab = <F>;
    chomp $ab;
    my $ap = <F>;
    chomp $ap;
    my @wba = ();
    $_ = <F>;        # We don't need these weekly summaries;
    $_ = <F>;
    while (<F>) {
        chomp;
        push @wba, $_;
    }
    close F;

The first thing to do is read in the data so we can manipulate it. The first 2 lines are saved as strings, since we will manipulate them later. The next two strings are discarded, since they are weekly counts which we need to recompute anyway. Finally, we create an array which contains all the remaining lines.

Splitting daily counts

    my $len = int 0.5 * scalar @wba;
    my @wpa = splice @wba, $len;
    if ($len > 6) {
        splice @wba, 6;
        splice @wpa, 6;
    }

As noted above, the daily counts consist of all the browser counts followed by all the platform counts. However, it's not guaranteed that this will consist of 14 lines, since the first few days that the batch script runs, there will be fewer than the full 7 days' worth of counts. Therefore, we compute half the length of the array with daily counts, and then split it into two parts. After that, we keep only the first 6 entries, since we no longer need the 7th entry.

Compute New Totals

    $ab = addto ($ab, \%browsers);
    $ap = addto ($ap, \%platforms);

The addto function takes a count string and a hash array of entry/count pairs. Not coincidentally, the count string is the same format as in the data file, and the entry/count hash is the same as we generated when we read in the gather files. The addto function then returns a new count string. This pair of lines, therefore, computes a new grand total.

Computing weekly totals

    unshift @wba, addto (undef, \%browsers);
    unshift @wpa, addto (undef, \%platforms);

With these lines, we create new count strings for the day's totals. Since we pass in an empty string, the result of the addto function is the contents of the hash which we pass. We then prepend that string to the appropriate array for the weekly totals. After these two lines, we have two arrays, one with up to seven entries containing count strings for the past days' browser counts, and the other for platform counts.

    %browsers = ();
    my $wb;
    foreach my $item (@wba) {
        $wb = addto ($item, \%browsers);
        %browsers = split /\|/, $wb;
    }

Now that we know that the @wba array contains the count strings for the previous week, we reset our browser hash array to be empty. Then, for each of the count strings, we create a new string which is the sum of the current count string and our hash array. Once that's done, we update the hash array with the values of the new count string.

    %platforms = ();
    my $wp;
    foreach my $item (@wpa) {
        $wp = addto ($item, \%platforms);
        %platforms = split /\|/, $wp;
    }

The same process is used to count the past week's data for the platform strings.

    if (open F, ">$datafile") {
        print F join "\n", $ab, $ap, $wb, $wp, @wba, @wpa;
        close F;
    }
}

Once we've added everything together, we create a new data file. Note we do not lock this file, since we are never writing to a live web server due to the architecture of the www.washington.edu cluster.

Adding Count Strings and Hash Arrays

sub addto ($$) {
    my ($vals, $new) = @_;
    my %valarr = split /\|/, $vals;

When this function is called, it is passed a count string and a reference to a hash array. The first thing we do is create a new hash array which is the equivalent of the count string.

    foreach my $key (keys %$new) {
        $valarr{$key} += $new->{$key};
    }

Next, we iterate over all key and value pairs for the passed hash array and add them to our new array.

    my @newvals = ();
    foreach my $key (keys %valarr) {
        push @newvals, $key, $valarr{$key};
    }
    join '|', @newvals;
}

Once everything is added up, we can create a new count string to return.