The LOCKSS Program ingests content via web harvest and file transfer. This page is primarily for publishers with whom we work via web harvest. Ingest via web harvest has a number of advantages.
The web content is preserved within the web environment. This approach enables LOCKSS to direct readers to the publisher’s website. The publisher’s website serves content to the reader when it is available; LOCKSS serves content to the reader when the publisher’s website is unavailable. The publisher gets all available reader hits. The reader always receives the content as the publisher published it on their website.
Publishers must take several actions to preserve content in a LOCKSS network. The Stanford University LOCKSS staff provides technical support to qualified publishers, contact us.
Make Content Archivable. Making content available for Google indexing will help make content archivable. See: Steps to a Google-friendly site and Content guidelines. Stanford University Libraries has recently released guidelines on building archivable web sites.
Enable Content For Preservation. Publishers enable the LOCKSS software to preserve content by offering a LOCKSS manifest page with a LOCKSS permission statement, for each web-published “Archival Unit” to be preserved. An Archival Unit or “AU” is typically a journal volume or a book.
The LOCKSS permission statement can be one of the following texts, as appropriate:
- LOCKSS system has permission to collect, preserve, and serve this Archival Unit
- LOCKSS system has permission to collect, preserve, and serve this open access Archival Unit
- A recognized Creative Commons license
The permission statement should be visible only to IP addresses belonging to institutions authorized to access the full text of content in the AU. This is typically done by placing the permission statement on the manifest page (or some other page), and restricting access to the manifest (or that other page) to subscriber IPs only. Alternatively, the permission statement can be conditionally included (only for subscriber IPs) in a page that is not access controlled.
The LOCKSS manifest page for an Archival Unit tells the LOCKSS software where to collect the content to be preserved as part of that AU. The manifest page lists enough top-level URLs for the LOCKSS software to discover all the content in the AU. For example, it can contain a link to a journal volume table of contents, which then links to issue tables of contents, which in turn lead to individual articles.
The manifest page for an Archival Unit must reside at a URL that can be derived predictably for each AU, for example from a pattern combining a journal identifier (e.g. ISSN, short journal code, etc.) and the volume name.
Metadata is important. It ensures content is discoverable and assists with preservation. There are two ways to provide metadata: embed metadata in the HTML pages; provide separate metadata files (for example BibTeX or RIS). The more metadata provided by the publisher, the better the content will work with industry tools.
Versions. Content with an authoritative version can be preserved in total. Content that changes frequently or presents different views to different readers present preservation challenges. Which version to preserve? In most cases, in consultation with the publisher, the LOCKSS Program can preserve some version(s) of the content.
Statistics. The COUNTER Code of Practice says:
Activity generated by LOCKSS or a similar cache system during the process of loading, refreshing, or otherwise maintaining the cache must be excluded from all COUNTER reports.
Publishers receive hits from LOCKSS boxes when a LOCKSS box collects content to preserve or a user requests a page.
- Requests with the User-Agent header “LOCKSS cache” in HTTP are part of the collection process and should be excluded from usage statistics.
- Readers’ requests for a URL proxied through LOCKSS boxes do not include this User-Agent header. These reader requests will contain a Via header identifying the particular LOCKSS box. LOCKSS forwards all reader requests to the publisher. If the LOCKSS box has the content (because it has previously collected it), it adds an If-Modified-Since header to the HTTP request, and if your server returns a 304 (Not Modified) response or does not respond within a short time, it will serve that content to the user. If your site returns content, that content will always be served to the user.
Sites with multiple hosts. The permission statement must be present on the same host as the content to be preserved. Sites with content on multiple hosts must place a permission statement on each host.
Open access publishers face particular challenges.
Highlight your participation. Add the LOCKSS logo to your website.