How To Set Up a Mirror of the EPICS Web Site

From EPICSWIKI

This page describes the techniques BESSY is using to maintain a mirror of the EPICS web site (mainly the documentation parts). This is a Wiki: Please add or correct things as you find them wrong, misleading or out-of date.

Introduction

Idea

Even nowadays, internet access to the EPICS web site at the APS may be slow. For an EPICS installation, it might be useful to retain a local mirror of the EPICS documentation. Such a mirror minimises download times (some of the often-used documents are quite large) and reduces network traffic through your institute's link to the internet.

The EPICS web site changes, but not too often. Doing an automated synchronisation once per night should be good enough. Doing the sync automatically also keeps the maintenance effort down.

What you need

  1. Disk space. Not too much, though: Our mirror uses about 600MB.
  2. The wget utility on the host that does the synchronisation.
  3. A web server (we use apache) to serve the mirror.

How the synchronisation works

A cron job script starts wget, which crawls through the APS web server checking file dates and sizes. It downloads all files that are newer than the local copy or have a different size. After downloading, wget will search through the HTML code and replace all links that point to the remote (original) web space with links that point to the local (mirror) web. That way navigating the mirror will be kept inside the mirror as long as possible.

After this synchronisation process, the local mirror contains a web space, i.e. a set of files containing HTML code, that can be served by a web server.

Caveat: wget only sees what a browser would see. If the web server uses a script to generate HTML code, wget will see the HTML code and save it under the name of the URL. (Usually the name of the script plus arguments.)

Which gets us to the

Limits

Tech-Talk and Core-Talk archives

There are two limitations regarding the Tech-Talk and Core-Talk mail exploder archives:

  • The exploder pages are a large web of preprocessed mail messages. The 'by thread' and 'by date' views are two php scripts, the thread script generating thousands of different views depending on which threads are shown expanded.
    As explained in the last paragraph, wget would save every single view in a separate file, multiplying the needed disk space beyond any reason.
  • The archive pages contain a search engine, which is one of the most important tools to access the mail archives. There is no way to mirror a search engine.

For these reasons, we exclude the Tech-Talk and Core-Talk areas from our mirror.

I'm giving some indications to a possible "right" way of doing it in the "Possible Improvements" section.

Wiki area

The Wiki principle is based on the fact that pages on a central server are editable through a web browser. This doesn't fit well with mirroring a web space:

  • either (without a local Wiki) you maybe get some read-only pages, but lots of links that point into nirwana,
  • or (with a local Wiki) you get a system where you can change things, but these local changes are overwritten.

For these reasons, we exclude the Wiki area from our mirror.

This should actually be another topic in the "Possible Improvements" section.

Error tracking system

Another feature of the EPICS web site that is based on a central server. Error reports are stored in a database, and can be changed and updated by logged-in users. It needs the database, a lot of scripts, an account system - none of which can be mirrored easily.

For these reasons, we exclude the Mantis area from our mirror.


Installation

Prepare your machine

  • Identify the machine(s) you will be using for the synchronisation and the web server.
  • Choose the location for the mirror and make sure you have enough disk space accessible and writeable.
  • On the machine that does the synchronisation:
    • Make sure you have the wget utility available.
    • Make sure you are allowed to create a cron job.
  • Set up the web server (if not running already). Make sure you have the necessary rights to change its configuration, restart it, and put things into its root directory.

Install and configure the synchronisation script

We are using the following script called 'fetch_epics' in the mirror location to do the synchronization:


#!/bin/sh
#                                                    -*-Mode: shell-script;-*-
# fetch_epics
# Script to fetch mirrored Web pages using GNU wget
#
# Author: Ralph Lange      Date: 26 Aug 1997
#
# To make the relative links work, one has to set up links
# in the root directory of the mirror web server for all the
# first level directories like this:
# asd    ->  <whereever_you_keep_the_mirrors>/www.aps.anl.gov/asd
# aod    ->  ...../www.aps.anl.gov/aod
# xfd    ->  ...../www.aps.anl.gov/xfd
# epics  ->  ...../www.aps.anl.gov/epics
# Accelerator_Systems_Division  
#        ->  ...../www.aps.anl.gov/Accelerator_Systems_Division

# LOCAL CONFIGURATION (mirror site dependent)

# <whereever_you_keep_the_mirrors>
mirrors=/opt/apache/Mirrors

# Paths to some tools (cron sets just a minimal environment)
# We need: gzip wget
gzip=/usr/contrib/bin/gzip
wget=/opt/wget/bin/wget

# Log file name extension (expands to yy-mm-dd-hh:mm)
datecode=`date +%y-%m-%d-%R`

# How many log files to keep
keeplogs=32

# MIRROR CONFIGURATION (original server site dependent)
# Directories to include / exclude

entry_point='http://www.aps.anl.gov/epics/index.php'

included_dirs='-I /epics,/asd/controls/epics/images,\
/asd/controls/epics/EpicsDocumentation,/asd/controls/hideos,\
/asd/oag,/asd/people/mrk,/asd/people/anj,/aod/bcda/epicsgettingstarted,\
/xfd'

excluded_dirs='-X /epics/tech-talk,/epics/core-talk,/epics/mantis,\
/epics/wiki,/asd/oag/cgi-bin,/xfd/SoftDist/Distribution'

# Now let's do it ...

cd $mirrors
umask 022

# Make sure log directory exists
[ -d log ] || mkdir log

# From APS's EPICS server:
# All Documentation pages

$wget -m -nv -k -o log/epics.$datecode \
$included_dirs $excluded_dirs -i epics.url_list $entry_point

# Gzip logfile
$gzip log/epics.$datecode

# Clean up log directory
cd log && rm -f dummy `ls -t * | sed -e 1,${keeplogs}d`

This should be pretty much self-explaining. Change the configuration part to whatever you need for your mirror.

This also needs a file called 'epics.url_list' in the $mirrors directory, which contains URLs that are needed, but do not have web links pointing to them:


http://www.aps.anl.gov/epics/epics.css
http://www.aps.anl.gov/epics/favicon.ico
http://www.aps.anl.gov/Accelerator_Systems_Division/Operations_Analysis/manuals/EPICStoolkit/EPICStoolkit.css
http://www.aps.anl.gov/Accelerator_Systems_Division/Operations_Analysis/manuals/SDDStoolkit/SDDStoolkit.css
http://www.aps.anl.gov/Accelerator_Systems_Division/Operations_Analysis/manuals/spiffe_V2.0/spiffe.css

Test the synchronisation script

At this point you should be able to test-drive the script. If everything works well, it should silently grab the complete EPICS documentation from the APS server and recreate it in your mirror location.

If there are problems, check the log file (in the log subdirectory) to find out what happened. Maybe you have to define a proxy, maybe you have to adjust the timeouts for wget, for me it just works this way.

Check the wget man page or the GNU Wget Manual for the knobs you can tweak on.

Make it run automatically

Choose the right moment

Spend a little time to figure out when to do the synchronisation.

The internet load is considerably higher during the day, both in the US (where the APS server is) and where you are - so nightly synchronisation sounds like a good idea. Most changes to the web will be made from the US, i.e. the EPICS web should be settled by 1 a.m. UTC and be stable until noon UTC.

If you find a time between 01:00 and 12:00 UTC that is outside your local business hours, that's the time to do the synchronisation.

Create a cron job

Create a cron job that calls the fetch_epics script at the time you chose. Your crontab entry should look similar to this:

# Min  Hour Monthday Month Weekday Command
# ---------------------------------------------------------------------
   0     6     *       *     1-5   /opt/apache/Mirrors/fetch_epics

Where you can see that we're synchronising at six in the morning, Monday through Friday. And our mirror location is /opt/apache/Mirrors.

Configure the web server

There's a list of things that have to be set for the web server. I'm showing the lines for configuring apache, adding some comments so you can find out what to do if you use a different server.

Allow php file suffix

Most EPICS web pages are generated by php scripts. As mentioned above, wget saves the HTML code under the script name, so we have to allow the suffix .php to be used for HTML code.

In our case, in the configuration file srm.conf, add a line like

# To use php files as (HTML-file)
AddType text/html .php

Redirect missing parts to the original server

Configure redirections for all parts that are not mirrored, that point back to the original server at the APS. By doing that you allow users to browse through your mirror and automagically see the APS pages when they click a link to something that is not mirrored.

In the main configuration file httpd.conf, add the following lines in the section for the virtual host that does the EPICS mirror:

Redirect /epics/core-talk               http://www.aps.anl.gov/epics/core-talk
Redirect /epics/tech-talk               http://www.aps.anl.gov/epics/tech-talk
Redirect /epics/mantis                  http://www.aps.anl.gov/epics/mantis
Redirect /epics/wiki                    http://www.aps.anl.gov/epics/wiki
Redirect /asd/oag/cgi-bin               http://www.aps.anl.gov/asd/oag/cgi-bin
Redirect /xfd/SoftDist/Distribution     http://www.aps.anl.gov/xfd/SoftDist/Distribution

These should match the excluded_dirs from the synchronisation script.

Set up links in the server's root directory

For the internal links to work, you have to set up soft links in the web server's root directory for epics and asd, pointing to your mirror.

In /opt/apache/htdocs (or wherever your server keeps the documents), add links like these:

asd   -> /opt/apache/Mirrors/www.aps.anl.gov/asd
aod   -> /opt/apache/Mirrors/www.aps.anl.gov/aod
xfd   -> /opt/apache/Mirrors/www.aps.anl.gov/xfd
epics -> /opt/apache/Mirrors/www.aps.anl.gov/epics
Accelerator_Systems_Division
      -> /opt/apache/Mirrors/www.aps.anl.gov/Accelerator_Systems_Division


Make your mirror available to the outside world

If you want to publish your mirror, so that others in your region can profit, I would strongly suggest the following steps.

Talk to your network guys

Publishing the mirror may cause additional network traffic, especially on the outgoing line. Depending on the type of your internet connection and how you pay for it (by volume?), someone in your institution might not be happy with that. Please make sure you have the approvement of the responsible people before going public.

Change robots.txt to exclude the mirror

We don't want to have search engines browse all EPICS doc mirrors in the world, so that search results get annoyingly repetitive.

Enter lines like these into the robots.txt file at the server's root directory to keep search engines out:

Disallow:   /asd
Disallow:   /aod
Disallow:   /xfd
Disallow:   /epics
Disallow:   /Accelerator_Systems_Division

These should match the links to the mirror areas that you set into the server's root directory.

Please try to make sure that this works before the next googlebot crawls by.

Tell Andrew to include a link in the original server

Send mail to Andrew Johnson to have him add a link to your mirror at the original site, so that others know where to find it.

Add statistics

If you are interested in who's browsing your mirror, you can get a web statistics tool and have it publish the stats as a web page. We use wwwstat to do this.

There are two tiny scripts in the mirror directory to do the job. On is called update_stats:


#!/bin/sh

acclog=/opt/apache/logs/www-csr.access_log
mirloc=/opt/apache/Mirrors
outloc=/opt/apache/www-csr/htdocs/control/Soft
wwwstat='/opt/perl/bin/perl /opt/wwwtools/bin/wwwstat'

umask 22
export PERLLIB=$PERLLIB${PERLLIB:+:}/opt/wwwtools/lib

$wwwstat -ident \
-A "(bessy.de|193.149.8|193.149.9|193.149.12|193.149.14|193.149.15)" \
-n "(asd)" -H "External Access Statistics for EPICS Doc Mirror" \
-X "last.wwwstats.html" -dns -cache $mirloc/wwwstat.dnscache $acclog \
> /tmp/wwwstats.html.$$

mv -f /tmp/wwwstats.html.$$ $outloc/wwwstats.html

where you will have to adapt the -A directive to exclude all internal accesses, so that only visitors from the outside appear in the statistics.

The other script is doing the monthly rotation and is called update_monthly:


#!/bin/sh

loc=/opt/apache/www-csr/htdocs/control/Soft

umask 22

/opt/apache/Mirrors/update_stats
cp -f $loc/wwwstats.html $loc/last.wwwstats.html

The crontab entries for those scripts look like this:

# Min  Hour Monthday Month Weekday Command
# ---------------------------------------------------------------------
   0     0     1       *      *    /opt/apache/Mirrors/update_monthly
   0     0    2-31     *      *    /opt/apache/Mirrors/update_stats

This will give you an idea of who else is taking advantage of your mirror and provides you with some numbers about the network traffic caused by this.

Possible Improvements

Mirror mail exploder archives

A way to mirror the Tech-Talk and Core-Talk archives would be replicating the complete system:

  • install the needed tools (PHP, perl, ...)
  • install the scripts to generate HTML from mail messages
  • install the scripts for the date and thread views
  • install the indexing and search engines
  • copy all mail archives for old mail from the APS web site
  • create a mail address
  • subscribe that address to Tech-Talk/Core-Talk and feed incoming mail into the web space as Andrew is doing at the APS.

I don't know if all that effort is worth it. Maybe it is. Make sure you take notes when you're doing it so others can do it, too. (I'd be quite interested...)


Mirror Wiki areas read-only

It would be interesting to have a read-only version of the Wiki pages on the mirror. I have no idea how to do such a thing.

Tell me if you find out...


Good Luck!

Ralph Lange (BESSY)

--Ralph 04:36, 18 Aug 2006 (CDT)