(Redirected from LOCKSS Plugin Tool)
Jump to: navigation, search

Contents

[edit] Introduction

Every journal that LOCKSS might preserve has some peculiarities; some are more peculiar than others. The LOCKSS daemon gets its knowledge of these peculiarities, be they large or small, from a plugin. The plugin for a complex journal is written mostly in Java, but for most simple journals the plugin is an XML file which is interpreted by a generic, or definable, plugin written by the LOCKSS team. Here we introduce the LOCKSS plugin generation tool, which provides a user interface that allows you to create and test a plugin for most simple journals with no programming. The user provides input describing the chosen journal, the tool outputs a definition of the plugin for the chosen journal in XML. In cases where a single publishing platform, for example HighWire Press, publishes a large number of journals a single (complex) plugin will support all of the journals. This document focuses on the simpler journals, whose plugin is within the capabilities of the plugin generation tool, so we assume below that there is a single plugin for each journal.

Eventually, the LOCKSS daemon will obtain this XML file via a plugin registry, a search facility linking bibliographic information about journals to the appropriate plugin. Until the plugin registry is working, the XML files defining plugins must be e-mailed to Image:EmailLockssSupportBold.gif. After testing, the LOCKSS team will include them in the daemon distribution.

One major function of the plugin for a journal is to divide the journal's content into manageable chunks called archival units (AUs). Typically, an AU will consist of a year's run of a journal, or a volume. Among the information the plugin needs about an AU is its crawl rules. These tell the LOCKSS web crawler where to stop when it is trying to find newly published content to collect. Other information includes the publisher manifest page, which tells the crawler where to start, and the crawl interval, which tells it how often to start. We use examples of some real journals to show how you find this information from the journal's web site, how you feed it into the tool to generate a suitable plugin, and how you test the plugin to be sure you have the information correct.

[edit] Obtaining and Running the Tool

The latest version of the plugin tool is 0.10.2. It is available in two ways: by download or via CVS. You can also read the release notes for this version.

[edit] Download

You can download two versions: a .zip archive (whose SHA1 is 43934dd2145774544bf5ab2c7d1fd5de2f104122) or a .tgz archive (whose SHA1 is 401c6822c5f10e98421d7387eabf4bd51fa21559) from one of the following locations.

Download and unpack the archive into a directory, change into that directory, and invoke the script in that directory (plugintool for Linux and Mac OS X, and plugintool.bat for Windows).

[edit] CVS

You can check out the lockss-daemon project from lockss.cvs.sourceforge.net:/cvsroot/lockss (more instructions at http://sourceforge.net/cvs/?group_id=47774) and run the command:

ant run-tool -Dclass=org.lockss.devtools.plugindef.PluginDefinerApp

You will need a working Java environment, and have set the JAVA_HOME environment variable.

[edit] Reporting Bugs

If you have encounter any bugs or have feature requests, please put them in our Sourceforge Issue Trackers, selecting "Plugin Tool" as the category.

[edit] Tutorial

See the Plugin Tool Tutorial for a brief overview of the LOCKSS Plugin Tool.

[edit] Evaluating Your Journal

Now you are ready to evaluate your chosen journal and collect the information you need to create a plugin for it. In this section you will collect the information, in following sections we cover some of the issues you may run into that we haven't covered so far. Point your browser at your journal's home page and start answering the questions below:

  1. What is the URL of the journal's home page?
  2. Is the journal structured as years, volumes, a sequence of issues, or some other way?
  3. If there is a volume table of contents page:
    1. What is the URL for the volume before the current one?
    2. Does this page link to all other volumes, to the next and previous volumes, or to no other volumes?
    3. Chose a typical issue in this volume.
  4. If there is a year table of contents page:
    1. What is the URL for the year 2003?
    2. Does this page link to all other years, to the next and previous years, or to no other years?
    3. Choose a typical issue from 2003.
  5. Is there a table of contents page for your chosen issue? If so:
    1. What is its URL?
    2. Does this page link to all other issues, all other issues in this volume or year, the next and previous issues, or to no other issues?
  6. Is there a publisher manifest page? If there is, what is its URL? You will not be able to finish specifying a plugin for the journal or test the plugin you have specified until the publisher manifest page is in place.
  7. Choose a typical article in your chosen issue. What is its URL?
  8. Check the format in which the article is delivered (e.g. PDF, HTML, ...).
  9. If the article format is HTML does it contain:
    1. Advertisements? If so, what is the URL for a typical advertisement?
    2. Personalizations? If so, what do they look like?
    3. Images? If so, what is the URL for a typical image? (NB sites often use images for mathematical and other non-standard characters).
    4. Links to cited articles in the same journal? If so, what is the URL for a typical intra-journal citation?
    5. Links to cited articles in other journals? If so, what is the URL for a typical inter-journal citation?
    6. Links to sound clips? If so:
      1. What is the URL for a typical sound clip?
      2. Is the sound streamed or downloaded?
      3. What format is used?
    7. Links to movies? If so:
      1. What is the URL for a typical movie?
      2. Is the movie streamed or downloaded?
      3. What format is used?
    8. Javascript? If so, what does the Javascript implement (e.g search button).
    9. Links to all other articles in the same volume, year or issue, the next and previous articles, or to no other articles (except for citations).
    10. Hit refresh. Identify any elements on the page that have changed.
  10. Review your answers above. If you have pages that link to all other similar pages rather than simply next and previous similar pages, or if you have advertisements, personalizations or page elements that changed on refresh the plugin for your journal is beyond the current capabilities of the plugin tool (see below). Please discuss your findings with the LOCKSS team. Otherwise you can proceed to analyze your journal's URL structure.

[edit] Journal Structure with Article IDs

Some journals encode the volume and issue in their URLs for articles. An example we used above is EMLS, whose article URLs are like http://www.shu.ac.uk/emls/09-3/finntabl.htm, which includes the volume (09) and issue (3). In this case the crawl rules should include everything that matches Start followed by base_url followed by volume padded with zeros to a field width of 2 followed by a literal string of - followed by Any Number followed by a string literal of / followed by Anything followed by End.

Some journals encode the issue but not the year or volume in their URLs for articles. An example we used above is Disputatio, whose article URLs are like http://disputatio.com/articles/016-3.pdf -- it is the third article in issue number 16 of the journal. In this case we want to use the exclusionary crawl rules described above.

Some journals use parameters on their URLs for articles. An example is Studies in Nolinear Dynamics and Econometrics, hosted on BePress. A typical article has a URL like http://www.bepress.com/cgi/viewcontent.cgi?article=1208&context=snde. The parameters are the part after the ?, in this case the context= parameter selects the journal among all the journals hosted on BePress and the article= parameter selects the individual article from the journal. In this case we want to exclude everything that doesn't match base_url (http://www.bepress.com/) followed by cgi/viewcontent.cgi? followed by Anything followed by context=snde.

In some cases of a publisher platform that hosts multiple journals, it may not be possible to tell from the article URLs whether the article is part of the journal of interest or not. If this is the case a plugin for your journal is beyond the current capabilities of the plugin tool (see below). Please discuss your findings with the LOCKSS team.

[edit] Archival Unit Design

So far, we have assumed that the publisher manifest file is a given. In practice, it is typically the result of a negotiation between the library responsible for selecting a journal and generating the plugin, and the publisher. The questions that arise are two-fold:

  • How to divide up the publisher's content into Archival Units that are of a manageable size, and which do not last too long? As a rule of thumb, if the publisher divides the journal into volumes, each volume should be an AU. Otherwise, each year of the journal should be an AU.
  • What to put in the publisher manifest page, and where to put it? As a rule of thumb, if each issue of the journal has a table of contents page, then the publisher manifest page should point to the table of contents page for each issue that makes up the AU (volume or year). Otherwise, if there is a single table of contents page for all issues in a volume or year pointing directly to their contents, the manifest page should do the same.

Many journals are published using a sophisticated platform, such as HighWire Press, Atypon, the Open Journal System or Project Muse. In these cases a single plugin will generally work for all journals hosted on the publishing platform. To do this, the configuration parameters must be sufficient to distinguish between the AUs of the multiple journals on the platform. There are two common cases:

[edit] Crawl Rule Design

In the examples above we have seen two different approaches to designing crawl rules; inclusionary rules which primarily list the URLs that should be fetched and exclude everything else, and exclusionary rules that primarily list patterns for URLs that should not be fetched and fetch everything else. Which approach should you choose?

  • If the journal has a deep URL structure with a separate directory for each volume or year, write rules that exclude everything not under the appropriate volume or year directory. For example, if the URL looks like http://bmj.bmjjournals.com/cgi/content/full/328/7453/1405 then excluding everything that doesn't match base_url followed by cgi/content/full followed by volume is a good approach.
  • If the journal uses parameters in the URL and they include the volume or year, write rules that exclude everything that doesn't have the appropriate parameters.
  • If the journal has a flat structure, with everything in a single directory, write rules that include the files that are actually needed. For example, if the URL looks like http://disputatio.com/articles/016-3.pdf then including everything that matches base_url followed by articles/ followed by a number followed by - followed by a number followed by anything is a good approach.
  • If the journal uses parameters in the URL and they do not include the volume or year, write rules that include the files that are actually needed.

[edit] Limitations of the Plugin Tool

The plugin generation tool has limitations; the plugins for many more complex journals will exceed them. In general, any plugin that requires knowledge of the Java classes in the daemon is likely to fall into this class. Although the Expert Mode described in the next section allows plugins to use some pre-defined Java classes, we aren't yet ready to explain how you find out what these classes are or what exactly they do for you. We are still working to expand the capabilities of these pre-defined classes as we build and test plugins for complex journals; anyone else attempting a complex journal is also likely at present to find their capabilities inadequate and need to add to them by writing Java.

There are a number of warning signs that a journal is complex enough to be beyond the capabilities of the plugin tool. If you find any of these features in your chosen journal you should consult the LOCKSS team:

  • Some journals add advertisements, personalizations and other dynamic content to the otherwise static pages. The LOCKSS daemon must filter the pages of these journals to remove the additions before comparing them with the same pages at other caches, which will have received different advertisements, etc. Filtering requires the use of special filter classes and the tool support for these is incomplete.
  • Some journals experience massive spikes of reader interest immediately a new issue is published. They typically want LOCKSS not to crawl during these predictable periods of high load, and the LOCKSS daemon has classes that provide suitable crawl windows in time.
  • Some journals do not return normal HTTP error codes, such as 404 for "Page not found" but instead return "helpful" pages with a different return code. The LOCKSS daemon has classes that recognize these error pages and re-map them to be conventional errors.
  • Some journals have sophisticated access control methods or crawler traps to prevent theft of their valuable content.
  • Some journals have media types for which LOCKSS support is not yet available, for example streaming media. In some cases, such as Real Audio, LOCKSS does support non-streamed versions but this again is beyond the plugin tool's current capability.

[edit] Using Expert Mode

Selecting Expert Mode in the Plugin menu will bring up the remaining parameters:

Default Crawl Depth
When doing a new content crawl, how many levels down should it go to check for changes. The default is 1. It will only check the manifest page for new content.
Crawl Window Class
The name of the class which encapsulates the temporal crawl restrictions for the publishers site.
Filter Classes
The list of filters by mime-type for filtering content before performing a hash.
Crawl Exception Class
The name of the class to be called when HTTP errors are generated. This only needs to be implemented if the site uses HTTP return codes in non-standard ways.
Cache Exception Map
A map of return codes to error handlers. This only needs to be used if a return code is being remapped on the site. So returning 404 for some 200 types.

[edit] Overriding Plugin Settings

Most of the information in a plugin is fixed, but some items can be overridden by the LOCKSS daemon's property settings, obtained locally or from one of the LOCKSS property servers. The list of such information is:

  • The refetch depth, which is the depth from the starting URL to which the crawler's search for new content will proceed. If each new article is linked from the publisher's manifest page directly, this depth should be 1. If the manifest page links to the table of content for an issue, which links to each new article this depth should be 2, and so on. Each time it looks for new content, up to this depth from the start URL, the crawler will re-fetch each URL even if it is already in the cache.
  • The new content crawl interval.
  • A flag that enables and disables the crawl window (but not the crawl window itself).
  • The set of pre-configured titles.