How To Set Up a Mirror of the EPICS Web Site

From EPICSWIKI
Revision as of 15:14, 15 August 2006 by RalphLange (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This page describes the techniques BESSY is using to maintain a mirror of the EPICS web site (mainly the documentation parts). This is a Wiki: Please add or correct things as you find them wrong, misleading or out-of date.

Introduction

Idea

Even nowadays, internet access to the EPICS web site at the APS may be slow. For an EPICS installation, it might be useful to retain a local mirror of the EPICS documentation. Such a mirror minimises download times (some of the often-used documents are quite large) and reduces network traffic through your institute's link to the internet.

The EPICS web site changes, but not too often. Doing an automated synchronisation once per night should be good enough. Doing the sync automatically also keeps the maintenance effort down.

What you need

  1. Disk space. Not too much, though: Our mirror uses about 400MB.
  2. The wget utility on the host that does the synchronisation.
  3. A web server (we use apache) to serve the mirror.

How the synchronisation works

A cron job script starts wget, which crawls through the APS web server checking file dates and sizes. It downloads all files that are newer than the local copy or have a different size. After downloading, wget will search through the HTML code and replace all links that point to the remote (original) web space with links that point to the local (mirror) web. That way navigating the mirror will be kept inside the mirror as long as possible.

After this synchronisation process, the local mirror contains a web space, i.e. a set of files containing HTML code, that can be served by a web server.

Caveat: wget only sees what a browser would see. If the web server uses a script to generate HTML code, wget will see the HTML code and save it under the name of the URL. (Usually the name of the script plus arguments.)

Which gets us to the

Limits

Tech-Talk and Core-Talk archives

There are two limitations regarding the Tech-Talk and Core-Talk mail exploder archives:

  • The exploder pages are a large web of preprocessed mail messages. The 'by thread' and 'by date' views are two php scripts, the thread script generating thousands of different views depending on which threads are shown expanded.
    As explained in the last paragraph, wget would save every single view in a separate file, multiplying the needed disk space beyond any reason.
  • The archive pages contain a search engine, which is one of the most important tools to access the mail archives. There is no way to mirror a search engine.

For these reasons, we exclude the Tech-Talk and Core-Talk areas from our mirror.

I'm giving some indications to the "right" way of doing it in the "Possible Improvements" section.

Wiki area

The Wiki principle is based on the fact that pages on a central server are editable through a web browser. This doesn't fit well with mirroring a web space:

  • either (without a local Wiki) you maybe get some read-only pages, but lots of links that point into nirwana,
  • or (with a local Wiki) you get a system where you can change things, but these local changes are overwritten.

For these reasons, we exclude the Wiki area from our mirror.

This should actually be another topic in the "Possible Improvements" section.

Download area

Not really a limit anymore:

Historically (before EPICS became Open Source), downloading the source tars needed a license to be signed and an account on the APS web site. That's why we excluded the download area from our mirror site.

As the EPICS license has changed, we could mirror the dowload area as well. We've been running out of disk space lately, though, so I'm deferring this until the mirror has moved to a different server (or disk).


Installation

Choose your machine

  • Identify the machine(s) you will be using for the synchronization and the web server.
  • Choose the location for the mirror and make sure you have enough disk space accessible and writeable.

This document is under construction. More details will be here soon.



Possible Improvements

Mirror mail exploder archives

A way to mirror the Tech-Talk and Core-Talk archives would be replicating the complete system:

  • install the needed tools (PHP, perl, ...)
  • install the scripts to generate HTML from mail messages
  • install the scripts for the date and thread views
  • install the indexing and search engines
  • copy all mail archives for old mail from the APS web site
  • create a mail address
  • subscribe that address to Tech-Talk/Core-Talk and feed incoming mail into the web space as Andrew is doing at the APS.

I don't know if all that effort is worth it. Maybe it is. Make sure you take notes when you're doing it so others can do it, too. (I'd be quite interested...)


Mirror Wiki areas read-only

This document is under construction. More details will be here soon.