Pushing Files

The info cluster synchronizes files between the different web servers with the wwwpush command (which is called when using the -push flag with wwwuinst or wwwdinst, the tool used to install files). The same mechanism is used to synchronize FTP files, but for simplicity this document will only refer to the web servers.

The main program used to synchronize files is rdist, a UNIX program. The host info.cac.washington.edu ("info" in this document) is the rdist client and the web servers are the rdist servers (because the rdist command is initiated on the host info). Much terminology specific to rdist is used in this document; the terms are explained in the rdist documentation.

Usually files and directories are pushed by content developers, but all directories are automatically pushed every morning, forcing everything to be in sync.

The steps involved in using the wwwpush command are:

  1. User runs wwwpush on a compute server. After doing rudimentary checks to ensure the arguments passed to the command line are correct, the user is asked for his or her password which causes the effective user ID to change to a central user.
  2. As the central user, wwwpush sends to info's push daemon a list of the file paths to push.
  3. Once the push daemon sees a connection as the central user, it makes sure the other host is on its list of approved hosts, rejecting the connection if not.
  4. The push daemon sets a lock on the files and/or directories which will be pushed. If any files are already locked, wait until they are available.
  5. The daemon will check the files which will be pushed to make sure that they are at least world readable. This is necessary because some versions of rdist become confused if it does not have read access to the files on the local (info's) filesystem.
  6. Next, the push daemon builds distfiles which will tell rdist what files to send to what hosts.
  7. An rdist lock is obtained by the push daemon. Rather than being an exclusive lock, there are only so many slots available. If they are all taken, wait until one is available.
  8. The rdist processes are started in parallel. For pushing to the web servers, dedicated push hosts do the actual rdist commands, which increases throughput.
  9. Output from rdist is sent back to the user (each line prepended with a timestamp and the destination host).
  10. The rdist and file locks are released.

The major steps are described in detail below.

Locking Files

The locking mechanism on info affects both files and directories. If a directory is locked, then all files under that directory are implicitly locked too. If a file is locked, then any of its parent directories are also implicitly locked.

For example, if the path world/webinfo/behind/ is locked, then a request to lock world/webinfo/behind/push.html will need to wait until the directory is unlocked. Likewise, if world/webinfo/behind/push.html is locked, a request to lock world/webinfo/behind/ will not be granted until the file is unlocked.

Building distfiles

To tell rdist what files to send to the web servers, the push mechanism builds one or more distfiles. To form a basis for these distfiles, info contains a skeleton distfile named distfile.skel. For the web servers, there are several distfile.skel files which are used to build distfiles to the different kinds of web servers; if a push command involves files which are not used by a particular distfile.skel, that file is skipped.

The distfile.skel has a list of files which should not be sent to that particular server. When distfiles are built, files to push which are in the exception list (or match the exception patterns) are not added as source files. Target hostnames are also not added, since those will be defined on the command line. The generated distfiles are stored in a temporary area with unique names (so multiple push processes don't overwrite each other's distfiles).

Locking rdist slots

To prevent info from doing too many rdists at one time, info has an rdist locking mechanism which limits the number of push processes. Once a process is given an rdist slot, it can actually start as many rdist processes as it wants, so the maximum number of slots needs to be chosen to take this into account.

Parallel rdists

While the rdists can be done serially, doing them in parallel greatly increases performance, since much of the time rdist on info will be waiting for a response from the remote web server (which is either reading its directory and comparing files, or writing a new or updated file).

To further improve performance, multiple hosts are used to update the web servers, and each of these push servers runs parallel rdists.