Jump to: navigation, search

Contents

[edit] Overview

The LOCKSS Crawler needs explicit permission to harvest content. It gets this permission in the form of a publisher manifest page.

The publisher manifest page serves two purposes:

  • It grants permission for LOCKSS to collect and preserve the ArchivalUnit (AU). LOCKSS will not collect any content for an AU unless one of the Supported Permission Statements is found.
  • It provides a starting point for collection. All material to be collected as part of the AU should be reachable by following links from this page. (Alternatively, the starting points may be defined by an OAI query.)

The LOCKSS crawler then requires a plugin to define the inclusion and exclusion rules necessary to allow the crawler to collect the correct content. This plugin is generally developed by the LOCKSS team. Although we accept plugins written by publisers, we do not expect that most will do so.

[edit] Manifest pages for open access materials

Open access web sites are able to become LOCKSS compliant by doing the following:

  • Put one of the Supported Permission Statements on your web site. It can go anywhere on your site.
  • Contact the LOCKSS team at Image:EmailLockssSupportBold.gif to tell us the URL at which you've put a permission statement.

In most instances, this one-off requirement is the only technical contribution that is needed on your part. The LOCKSS team will undertake the plugin development needed for the collection of continuing journal volume releases.

[edit] Manifest pages for subscription materials

LOCKSS collects content delimited by what we term Archival Units. These typically represent either a full year's worth of content or a complete journal volume.

A LOCKSS crawler comes from the same IP range as that specified for the host institution. In this way, the LOCKSS crawler can only access the content to which the institution has a subscription. Note, also, that due to the composition of Archival Units the institution necessarily must have access to that complete volume or year run.

These requirements have implications for manifest page design and implementation. To achieve LOCKSS compliance with subscription based content you will need to make per-volume (or per-year) manifest pages.

  • The manifest page must only be accessible to IP addresses that can access the full content of the AU. (In some cases, the authentication system in use does not allow the manifest pages to be under such IP authentication. If access to the manifest page is not constrained, the manifest may instead contain a probe URL to some page within that archival unit that is so constrained. See Probe URL section below)
  • The manifest page must reside at a URL that can be derived from a pattern in the plugin, and values for the defining parameters of the AU. Eg, the parameters are usually "base_url" and "volume". If the manifest pages are kept at URLs such as
http://jtitle.pub.com/clockss-manifests/vol_nn.html

where nn is the volume name/number, then the plugin could derive those URLs using a printf string such as

sprintf("%s/clockss-manifests/vol_%s.html", base_url, volume);

In this way, we can access multiple volumes by setting parameter values such as

base_url=http://sometitle.somepub.com, volume=23
base_url=http://sometitle.somepub.com, volume=24
base_url=http://anothertitle.someotherpub.com, volume=2006
  • The manifest page contains starting URLs pointing to journal content. This starting URL needs to allow the LOCKSS crawler to collect the content within that journal volume. You may choose to point to a journal volume table of contents (which allows us to then crawl issue table of contents, then collect individual articles themselves), or you may choose to include links to individual issue table of contents. It is preferable for us to crawl from the issue page, as this means the user ends up with a familiar page from which they can navigate to content.

See PluginToolTutorial for more information.

[edit] Sites with Multiple Hosts

The permission page must be present on the *same host* as the content to be collected. Sites that spread content over multiple hosts must place a permission page on each host.

This restriction is necessary to ensure that permission was granted by the same authority that controls the content. E.g., there's no reason to trust a manifest page at http://othersite.com/manifest pointing to (and granting permission to collect) content at http://publisher.com because there's no evidence that the operator of publisher.com has any control over (or is even aware of) othersite.com.

We have chosen not to relax the "same host" rule for sub-domains. E.g., http://example.com/manifest cannot grant permission to collect content at http://www.example.com/. The authority relationship between a parent- and sub-domain reflects only naming authority; there is no way to determine whether it also extends to content authority.

[edit] Publisher Manifest Template

See Publisher_Manifest_Template.

Breck Witte, Columbia University created a LOCKSS manifest page automated form. Automated Form