Use of OAI-PMH in the LOCKSS System
May 2004
The LOCKSS team plans to use OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) in two ways, by exporting it and by importing it.
[edit] Export of OAI-PMH from LOCKSS systems
The LOCKSS system will be enhanced to act as an OAI-PMH Repository, i.e. to respond to the 6 OAI-PMH queries with the metadata it has available about its content. The design of this facility is under way. Questions to be resolved include:
- The granularity at which an OAI-PMH "item" is represented. Is this a LOCKSS Archival Unit (AU), or a smaller unit defined by a journal such as an issue or an article?
- What OAI-PMH "unique identifier" to use for the 'items"? The natural identifier in the LOCKSS context is the URL, but the details of how it would be used depend on the item granularity.
- What metadata will be included in OAI-PMH "records"? The LOCKSS plugin will be enhanced so that the plugin responsible for a particular AU can export the metadata it collects from the two sources it has available:
- Some metadata is available in the individual files preserved, including MIME type and in some cases bibliographic data.
- Some metadata is available from the publisher manifest page.
- Is it possible to enhance the available format metadata using JHOVE?
[edit] Import of OAI-PMH by LOCKSS systems
Currently, the LOCKSS ingest mechanism involves publishers giving permission to LOCKSS systems to crawl their websites collecting and preserving their content by means of a "publisher manifest page" in HTML. This includes a permission statement, optional Dublin Core metadata and links to the actual content in question. Each LOCKSS cache fetches and validates this page then uses the links it contains as starting points for a web crawl. The rules that specify where the crawl stops are normally expressed in XML and obtained out-of-band, but they could also be obtained via a link on the publisher manifest page.
The LOCKSS team plans to add a second ingest mechanism which uses OAI-PMH rather than the publisher manifest page. Publishers would export OAI-PMH metadata describing their content. The LOCKSS OAI-PMH ingest mechanism would harvest this metadata and use it to control fetching of the content. This offers several advantages:
- The use of selective harvesting can make collection more efficient by identifying exactly what content has changed since the last crawl.
- OAI-PMH allows publishers to provide LOCKSS systems with better metadata.
- Publishers can tell lOCKSS that a document has a new version rather than the LOCKSS systems having to infer that it has.
- Publishers can support both OAI-PMH and LOCKSS with a single mechanism.
The design of this facility is under way. It is clear that this mechanism will not be a generic OAI-PMH ingest mechanism. For both legal and technical reasons the publisher will have to conform to certain conventions with respect to the metadata records they export via OAI-PMH if the ingest mechanism is to work:
- The DMCA requires that LOCKSS systems get explicit permission from the publisher to collect and preserve the content. There will need to be a convention for how this permission is to be expressed in OAI-PMH records.
- OAI-PMH does not require that the metadata record for an item include a URL for the item's content. There will need to be a convention requiring this.
- When an OAI-PMH metadata record does include a URL for an item, it appears to implicitly assume that the item's content consists of the single file described by the URL. In the LOCKSS context a single item is made up of many files, which can be found only by following the internal links in the file fetched from the original URL. Not all links in these files point to files forming part of the item, some are links to other items. Rules for distinguishing between them are required, and in most cases the LOCKSS system expresses these rules as XML. Conventions will be needed by which these XML rules may be encoded in OAI-PMH records so that the LOCKSS OAI-PMH ingest mechanism can ingest the correct set of files for each item.