wwwscan - Scan www.washington.edu Log Files

The wwwscan command allows developers to view log files created by the HTTP daemons. These logs cannot be directly viewed because they are on hosts on which developers do not have accounts.

Note: In most cases where Google Analytics will give you better and more detailed information, as well as not requiring manual processing of log files.

Usage for the wwwscan command is:

wwwscan -[a|e] -[p|t|d|v|u] [-s] [-h virtual_host] regexp|-nlines

The option flags are:

-a	Get data from the access log (default)
-e	Get data from the error log

-p	Get data for the production (www) environment (default)
-t	Get data for the test (wwwtest) environment
-d	Get data for the development (wwwdev) environment
-u	Get data for the u-development (wwwudev) environment

-s	Get data for SSL connections
-h virtual_host	Get data for virtual host www.HOST.org or www.HOST.net

-c	Return access logs in common log format
-C	Return access logs in combined log format

-D	Return access logs sorted by date (may take a long time)

You must specify one of the following:

regexp	Regular expression for which to search
-nlines	Show last nlines lines for each file

Warning: Be sure that you do not save any wwwscan output to the web directories. If you do put them in to the web directories, they become accessible via the web. Due to the content of the log files, there may be some privacy concerns if the raw data were viewed by people not involved with maintaining content for www.washington.edu. For this reason, please be sure all output goes into your home directory or into a group directory.

If you have problems saving the output of the wwwscan command because you get the error message "write stdout: Permission denied", it could be because you are in a directory for which you do not have write permission. If that's not the case, then you could be running across a bug in some filesystem implementations which doesn't let you directly write. In these cases, you need to put another command into the command line before saving the output. In the examples below, the grep command was used. If you aren't using the grep command, you can substitute cat, such as:

% wwwscan ' /webinfo/' | cat >webinfo.scan

Format of wwwscan Output

Each line of output is prepended with the host on which the particular log file shown originated and the path to the environment. For example:

% wwwscan -1 world
www1:www/world/access: green.alexa.com - - [27/Jan/2000:12:13:14 -0800] "GET /students/timeschd/sln.cgi?QTRYR=AUT+1999&SLN=6327 HTTP/1.0 User-Agent='ia_archiver' Referer='-'" 11918 200 2203 0
www2:www/world/access: 128.220.12.65 - - [27/Jan/2000:12:13:21 -0800] "GET /admin/eoo/ads/index.html HTTP/1.0 User-Agent='Mozilla/4.61 [en] (Win95; I)' Referer='http://www.washington.edu/admin/eoo/ads/'" 17549 200 8214 0
www3:www/world/access: house11.studaff.calpoly.edu - - [27/Jan/2000:12:13:23 -0800] "GET /students/uga/css/images/bg_navDesc_yTitle.jpg HTTP/1.0 User-Agent='Mozilla/4.61 [en] (Win98; I)' Referer='http://www.washington.edu/students/uga/tr/'" 11740 200 1702 0
www4:www/world/access: host-216-79-211-24.shv.bellsouth.net - - [27/Jan/2000:12:13:23 -0800] "GET /home/graphics/mo/arrow.gif HTTP/1.1 User-Agent='Mozilla/4.0 (compatible; MSIE 4.01; MSN 2.5; Windows 98)' Referer='http://www.washington.edu/'" 21403 200 62 0

Note that the access logs give extra fields which are passed from the browser. The field User-Agent is the same string passed by the client, if any, as is the Referer field. The root field shows the root used to find a document if different than the default document root.

The last four numbers are the process ID of the server handling the request, the return code (200 in this case means a successful transfer), the number of data bytes transfered, and how many seconds the transfer took.

If you are searching on a virtual host, note that you still must specify a search string or a number of lines. However, you can search on a string which will appear in all requests:

% wwwscan -h kcmu /
www3t:kcmu/prod/access: shiva1.cac.washington.edu - - [07/Jan/2000:13:02:40 -0800] "GET / HTTP/1.0 User-Agent='-' Referer='-'" 21921 200 3577 4
www3t:kcmu/prod/access: shiva1.cac.washington.edu - - [07/Jan/2000:13:02:46 -0800] "GET / HTTP/1.0 User-Agent='-' Referer='-'" 21922 200 3577 1
www3t:kcmu/prod/access: banshee.kerbango.com - - [07/Jan/2000:13:32:41 -0800] "GET /ra/wp991019.ram HTTP/1.1 User-Agent='CheckURL/1.0 libwww/5.2.6' Referer='-'" 21923 200 42 1
www3t:kcmu/prod/access: banshee.kerbango.com - - [07/Jan/2000:13:32:44 -0800] "GET /ra/swdr991014.ram HTTP/1.1 User-Agent='CheckURL/1.0 libwww/5.2.6' Referer='-'" 21926 200 44 1
www3t:kcmu/prod/access: banshee.kerbango.com - - [07/Jan/2000:13:32:45 -0800] "GET /ra/sts991022.ram HTTP/1.1 User-Agent='CheckURL/1.0 libwww/5.2.6' Referer='-'" 21921 200 43 0

Searching for Strings

Note that searching for strings can take a very long time, but sometimes is exactly what is needed. Beware that this searching also affects the servers themselves (because they're searching large log files), so please try to limit usage. String searching is automatically given the lowest priority on the web servers, so be prepared to wait for a very long time for the results.

If you wish to search for all hits on a certain set of pages, the best way is with a string that's as specific as possible, and begins with a space (so it won't match the Referer information). For example:

% wwwscan ' /cambots/archive'
www1:www/world/access: 205.68.79.66 - - [01/Jan/2000:02:26:15 -0800] "GET /cambots/archive.html HTTP/1.0 User-Agent='Mozilla/4.6 [en] (WinNT; U)' Referer='http://www.washington.edu/cambots/'" 21844 200 8296 0
www1:www/world/access: 205.68.79.66 - - [01/Jan/2000:02:26:21 -0800] "GET /cambots/archive/june.mpg HTTP/1.0 User-Agent='Mozilla/4.6 [en] (WinNT; U)' Referer='http://www.washington.edu/cambots/archive.html'" 21844 200 57344 3
www1:www/world/access: 1cust175.tnt5.bos2.da.uu.net - - [01/Jan/2000:15:45:59 -0800] "GET /cambots/archive.html HTTP/1.1 User-Agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows 98)' Referer='http://www.washington.edu/cambots/'" 30085 200 8296 0
www1:www/world/access: 1cust175.tnt5.bos2.da.uu.net - - [01/Jan/2000:15:46:42 -0800] "GET /cambots/archive/april97/0545.gif HTTP/1.1 User-Agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows 98)' Referer='http://www.washington.edu/cambots/archive.html'" 2516 200 130013 17
www1:www/world/access: 1cust175.tnt5.bos2.da.uu.net - - [01/Jan/2000:15:47:08 -0800] "GET /cambots/archive/april96/0550.gif HTTP/1.1 User-Agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows 98)' Referer='http://www.washington.edu/cambots/archive.html'" 24409 200 124040 16
etc.

If you know you will want to do searches on many different pages in the same directory, it's best to do a general search in the top-level directory and save results in a file. The Advanced Use section shows how this is done.

Log Retention

Logs are kept for the current and previous months. For example, on 31-Oct you will be able to see logs from 1-Sep through 31-Oct. However, on 1-Nov you will only be able to see logs from 1-Oct through 1-Nov.

Using with Log Analysis Tools

If you have access to tools to do web log analysis (such as webalizer or analog) you will probably need to use the -C and -D flags to convert the logs to the Common Log Format, and to sort the output by date.

Advanced Use

A project is currently under way to provide a web interface to viewing wwwscan logs, as well as summaries of usage. Until that time, however, there are several things you can do at the unix prompt to provide basic count information.

Note that the documentation below assumes that the sections are used in order, and that some of the commands rely on the results of previous ones. For example, Computing Total Hits relies on the file generated by Saving wwwscan output for later use.

Saving wwwscan output for later use
Computing total hits
Computing hits on specific files
Computing hits for all files
Computing hits on a per-domain basis
Computing hit references
Computing reverse references
Browser types
Browser types by Vendor
Browser types by Vendor and Version
Browser type by Platform

Warning: As noted above, do not save any wwwscan output into the web directories. All the examples below assume the log files are written to one's home directory.

Saving wwwscan output for later use

As an example, let's figure out how often the Webinfo documentation was read in the previous month (December 1999). The first step is to retrieve the stats for the whole tree and save it in a separate file so we only have to run wwwscan once:

% wwwscan ' /webinfo/' | grep Dec/1999 >webinfo.scan

Note that the grep is run separately because it makes for a much more efficient pattern that's being searched in wwwscan, so it will have less of an effect on the production web servers.

Computing total hits

To get a count of all the hits, using wc gives that information:

% wc -l webinfo.scan
    3214 webinfo.scan

This tells us there are 3,214 entries. however, it doesn't tell us what those files are. Also, it also counts errors as well as successful requests.

Computing hits on specific files

If we wish to see how may times the files wwwscan.html and wwwinst.html were accessed, we can use grep with the -c option:

% grep -c ' /webinfo/wwwscan.html' webinfo.scan
49
% grep -c ' /webinfo/wwwinst.html' webinfo.scan
36

If we just did a grep for the string wwwscan.html, not only would we get all hits, but we'd also get results for entries where wwwscan.html was the referring page. Specifying the grep string as shown above prevents this from happening.

The -c flag for the grep command is what generates a count. If you wish to generate the actual list of files, do not use the -c.

Computing hits for all files

If we wish to see how many hits we got for all the files, the following command can be used. It is quite complex, and should be typed all on one line:

% awk '$(NF-2) < 400 {print}' webinfo.scan | sed 's,/index.html,/,g' >webinfo.scan.valid
% awk '{++u[$8]} END {for (x in u) print u[x], x}' webinfo.scan.valid | sort -rn
228 /webinfo/graphics/1pix.gif
221 /webinfo/webinfo.css
153 /webinfo/
62 /webinfo/tidy.html
49 /webinfo/wwwscan.html
45 /webinfo/ssl.html
41 /webinfo/weblint.cgi
36 /webinfo/wwwinst.html
35 /webinfo/mailto/
33 /webinfo/chtml/
33 /webinfo/announcetech.html
32 /webinfo/env.html
31 /webinfo/htaccess.html
etc.

The first line creates a file which eliminates requests which resulted in errors (the 3rd from the last field of every log entry is the HTTP return code, and codes greater than or equal to 400 are considered errors). This intermediate file will be used for all other computations, since we will not want to count error requests.

The first line also makes sure that references to index.html turn into references into just the directory. For example, any accesses to "/webinfo/index.html" would be converted into "/webinfo/" since they are both for the webinfo main page. This way, the accesses will be grouped together. Note that if you have another name for your index file (such as index.cgi) you'll want to use that either instead of index.html in the command, or add another sed command to filter for both.

Computing hits on a per-domain basis

It is sometimes useful to see what users are accessing your files. You can count access on a per-domain basis to do this. Note, however, that many accesses will probably be from hosts that did not return a domain. Those entries will be all listed as IP addresses, and can not be separated into separate domains. To show how many hits are from addresses which did not get resolved to hostnames, we first isolate the hostname from the logs file and save that to another intermediate file. We then look for entries which have no alphabetic letter in them:

% cut -f2 -d' ' webinfo.scan.valid >webinfo.scan.hosts
% grep -c -v '[a-z]' webinfo.scan.hosts
340

Next, to compute hits on a per-domain basis:

% grep '[a-z]' webinfo.scan.hosts | sed 's/^.*\.\([^.]*\.[^.]*\)$/\1/' | sort | uniq -c | sort -rn
1352 washington.edu
 415 alltheweb.com
 208 inktomi.com
 132 edu.tw
  62 home.com
  50 ziplink.net
  35 sanpaolo.net
  32 earthlink.net
  26 oz.net
  24 idt.net
  22 uswest.net
  22 stsi.net
  22 aol.com
etc.

The sed command converts the hostname to just the domain. The first sort groups hostnames together, and the uniq -c counts how many of each host there is.

Computing hit references

The referer field allows you to see how people got to your pages. For example, to compute what links were followed to get to the main page for the page wwwscan.html:

% sed -n "/ \/webinfo\/wwwscan.html/s/^.*Referer='\([^']*\).*/\1/p" webinfo.scan.valid | sort | uniq -c | sort -rn
  25 -
  21 http://www.washington.edu/webinfo/
   1 http://www.webtop.com/
   1 http://www.washington.edu/cgi-bin/search/webinfo/?Kind=Results&Key=wwwscan&Phrase=&CaseSensitive=&PartialWords=
   1 http://huskysearch.cs.washington.edu/results/99120917-0/30007-zhadum-0/rmain.html

This shows that 25 of the accesses did not have a referer field (either because the URL was typed by the user or because the browser did not forward the information), and 21 of the accesses came from the webinfo main page. Of interest is the access from http://www.washington.edu/cgi-bin/search/webinfo/, which is the webinfo search function, and the access from http://huskysearch.cs.washington.edu/results/, which is the huskysearch search engine.

In the command shown, the arguments to sed are very complex. The sed documentation can be used to explain their use for those who are interested.

Computing reverse references

Another use of the reference field is to do the opposite of checking references. In other words, in order to find out where in a directory a particular page references. Note that this information is much less complete, because someone may follow a link to another system, and you won't be able to detect that. However, suppose we wish to know how many people followed a link on the top-level page to another page in the same directory:

% grep "Referer='[^']*/webinfo/'" webinfo.scan.valid | awk '{++u[$8]} END {for (x in u) print u[x], x}' | sort -rn
103 /webinfo/graphics/1pix.gif
60 /webinfo/webinfo.css
21 /webinfo/wwwscan.html
18 /webinfo/wwwinst.html
12 /webinfo/wwwauth.html
12 /webinfo/env.html

Note there are some files which are usually never directly accessed by users, such as the first two files in the list (the first is a placeholder image used by some of the tables, and the second is the style sheet). However, we see that wwwscan.html is the page most often referenced by the webinfo main page.

Browser types

The log files also have information about the type of browser the client used to access your files. Computing usage based on browser type is just like computing references, but the User-Agent field is used instead of the Referer field.

To see what browser types have accessed the main webinfo page:

% sed -n "/ \/webinfo\/ /s/^.*User-Agent='\([^']*\).*/\1/p" webinfo.scan.valid | sort | uniq -c | sort -rn
  13 Mozilla/4.61 [en] (X11; U; HP-UX B.10.20 9000/715)
  10 Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)
   9 Mozilla/4.0 (compatible; MSIE 4.01; Windows NT)
   8 Mozilla/4.0 (compatible; MSIE 4.5; Mac_PowerPC)
   8 Mozilla/4.0 (compatible; MSIE 4.01; Windows 98)
   6 Mozilla/4.7 [en] (Win98; U)
   6 Mozilla/4.7 [en] (Win98; I)
   6 Mozilla/4.5 [en] (WinNT; I)
   6 Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)
   5 Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
(etc.)

We see that Netscape (which has a User-Agent field of "Mozilla") version 4.61 for X accessed the webinfo pages 13 times, but the browser which accessed the page 10 times was really Microsoft Internet Explorer 5.0. However, this probably gives too much information, because of all the version numbers and platforms.

Browser types by Vendor

To further break down the browser types by which browser:

% sed -n "/ \/webinfo\/ /s/^.*User-Agent='\([^']*\).*/\1/p" webinfo.scan.valid >webinfo.scan.browsers
% sed -e 's,^.*MSIE ,MSIE/,' -e 's,/.*,,' webinfo.scan.browsers | sort | uniq -c | sort -rn
  70 Mozilla
  61 MSIE
   4 Slurp
   3 libwww-perl
   3 Teleport Pro
   2 Slurp.so
   2 FAST-WebCrawler
   2 CCU_GAIS Robot
   2 ArchitextSpider
   1 www.WebWombat.com.au
   1 WebCopier Session 6
   1 Spider
   1 EliteSys SuperBot

Browser types by Vendor and Version

To include version numbers with which browser is being used:

% sed -e 's,^.*MSIE ,MSIE/,' -e 's,\(/[0-9a-z.]*\).*,\1,' webinfo.scan.browsers | sort | uniq -c | sort -rn
  28 MSIE/4.01
  24 Mozilla/4.7
  18 MSIE/5.0
  15 Mozilla/4.61
   8 Mozilla/4.5
   8 MSIE/4.5
   7 Mozilla/4.04
   6 MSIE/5.01
etc.

Note there is extra code which maniuplates text with MSIE in it. This is to properly identify Internet Explorer browser strings, which at first glance look like Netscape strings.

Browser type by Platform

If you wish to see what operating system your users are using, that information is more difficult because the number of variations in the User-Agent strings is even greater than for the browser vendors. However, we can get close:

% grep -c Mac webinfo.scan.browsers
19
% grep -c Win webinfo.scan.browsers
80
% grep -c X11 webinfo.scan.browsers
28
% wc -l webinfo.scan.browsers
     153 webinfo.scan.browsers

By computing the total number of lines, we find that there are 26 entries which didn't fall into one of our categories. These can be either search engine robots (programs which crawl the web for content to put into search engines such as AltaVista or Google) or programs that people wrote which use a generic User-Agent field.