In 2004 The LOCKSS technology won the Computer Science Research Award from the Association of Computing Machinery. In 2014, the Council Of Research Libraries (CRL) audited the CLOCKSS archive against the TRAC criteria and awarded it their first ever perfect score in the Technologies, Technical Infrastructure, Security category for its use of the LOCKSS technology. This page explains in more detail what sets the LOCKSS software apart, and how preservation works in LOCKSS networks.
Steps To Preservation
A publisher gives permission for the LOCKSS system to preserve authorized content by putting online a LOCKSS permission statement and a LOCKSS manifest. A library uses the LOCKSS software to turn a mid-range PC, or the hardware equivalent, into a digital preservation appliance called a LOCKSS Box.
Preservation requires three actions: a publisher to give permission for the target content to be preserved; for a library to bring online a LOCKSS box that has authorized access to the content; and for that LOCKSS box to be registered with one of a number of associated LOCKSS Alliance networks.
This section describes how a LOCKSS Box works. Specifically, a LOCKSS Box performs five main functions:
- It ingests content from target websites using a web crawler similar to those used by search engines.
- It preserves content by continually comparing the content it has collected with the same content collected by other LOCKSS Boxes, and repairing any differences.
- It delivers authoritative content to readers by acting as a web proxy, cache or via Metadata resolvers when the publisher’s website is not available.
- It provides management through a web interface that allows librarians to select new content for preservation, monitor the content being preserved and control access to the preserved content.
- It dynamically migrates content to new formats as needed for display.
Stanford University LOCKSS Program staff analyze the target content’s URL structure, file formats and delivery mechanisms. They design, implement and update a tailored, content-specific preservation action plan that serves publishers, librarians and readers.
The publisher permits the LOCKSS system to collect, preserve and provide access to the content by putting a LOCKSS manifest page on the content’s website. The manifest page contains a LOCKSS permission statement and links to the issues (or other parts) of the content as they are published. The required manifest page is ingested and preserved with the original content, negating the need for paper contracts.
Software called a LOCKSS Plugin tells each institution’s LOCKSS Box where to find the publisher’s LOCKSS manifest page, and how far to follow the chains of web links. A LOCKSS Plugin encapsulates a publisher’s content model by listing parameters specific to each publishing platform. The LOCKSS team builds, tests and distributes plugins to LOCKSS Boxes registered with the LOCKSS Alliance.
Every LOCKSS Box is at an IP address, and this IP address falls within its parent University’s IP address range. Authorized LOCKSS Boxes independently collect ‘subscribed to’ content or ‘open access’ content directly from the publisher’s website. The publisher authorizes or denies a LOCKSS box’s access to content through their IP address access control system. Thus, all LOCKSS activity is registered on a publisher’s web logs. Publishers have access to real time statistics through their own systems.
The LOCKSS software continually monitors the content in each LOCKSS Box to ensure that it is being properly preserved, by cooperating over the Internet with other LOCKSS boxes to compare each box’s copies of the same content using technology that won an ACM research award:
- Once ingest is complete, the monitoring technology ensures that each LOCKSS Box has collected all intended content, thus preserving the authoritative version.
- The software monitors LOCKSS Boxes at regular intervals to determine whether any content has been damaged or lost, and can arrange for content repair from another LOCKSS Box.
The administrator of each LOCKSS Box can monitor the preservation status of the content in their Box, by looking at delivered content and the management tools available through the LOCKSS Box web administrative interface.
An institution’s LOCKSS Box can provide readers with continual, seamless access to branded publisher content. The LOCKSS system preserves content at it’s original URL, critically retaining the content’s relationship to other web resources. An institution’s LOCKSS Box delivers content to authorized readers only when the publisher’s website is unavailable (subscription canceled, network traffic, publisher server down). The LOCKSS Program works to preserve, and to deliver to readers, the publisher’s original artifact, in other words – what the publisher published.
LOCKSS Boxes provide three main ways for readers to access the content they preserve: by proxying (i.e. acting like a web cache), by serving (acting like a web server) or by serving through integration with an OpenURL resolver.
Proxying Institutions often run web proxies to allow off-campus users to access subscription content. Libraries that integrate their LOCKSS Box into a proxy (PAC Files, EZ Proxy, ICP, Squid) ensure a reader’s URL request is seamlessly fulfilled when the content is unavailable from the publisher’s website.
Basic Serving In the basic serving model, articles are accessed using a local URL pointing to the LOCKSS Box. The LOCKSS Box checks if the publisher will provide content to fulfill a reader’s request. If the content is not available from the publisher, the LOCKSS Box serves its own copy to the reader.
Post cancellation access to all preserved content is ensured as the content is under the library’s local custody.
Librarians administer their institution’s LOCKSS Boxes through a web browser that allows them to easily select new content for preservation, monitor content’s preservation status and a variety of other functions. The Stanford University LOCKSS staff provides support to LOCKSS Alliance participants.
Three audit and verification tools detail what content is in a library’s LOCKSS Box and the content’s preservation status.
- On demand, a LOCKSS Box produces a KBART (Knowledge Bases And Related Tools) report of the locally preserved content.
- A LOCKSS Box displays detailed preservation status for each Archival Unit. (An Archival Unit is typically a volume of a journal, or a complete book).
- A LOCKSS Box administrator can use a properly configured web browser from an authorized IP address to view preserved content through an “audit proxy.” The viewer sees the content as it was collected by the LOCKSS system.
Sustainable Format Migration
LOCKSS preserves all web published formats (animations, datasets, moving images, still images, software, sound, text) and genres (journals, books, blogs, websites, scanned files, audio, video). The LOCKSS software is format-agnostic and preserves all content in its original format, as delivered from the publisher, including the format metadata that enables a browser to render the content.
There is a risk that web content becomes obsolete when a reader’s browser cannot render a requested format. This has yet to occur; however, the LOCKSS Program’s approach to this risk is to migrate content on access. When a format is obsolete, in other words, when a reader’s web browser cannot display the content, the LOCKSS Box dynamically migrates the content to a newer format for display. This method, called “migration on access,” leverages the capabilities built into HTTP. If a reader requests LOCKSS-preserved content and that reader’s browser cannot display the content in its original format, the LOCKSS Box converts the original format to a format that the browser can display (a temporary access copy) and delivers the content to the reader. This “on the fly” migration ensures that readers see the latest and best version of scholarly material.
The LOCKSS Program’s “migration on access” approach has significant advantages over “format normalization” as it preserves original artifact, has much less overhead and expense, and uses the most up to date technology.
- Preserving the content in its original format satisfies archival requirements. It also allows the LOCKSS system to be frugal with storage space. We know of no preservation system that discards the original bits after migrating them to a new format. Migrating and keeping both the original and the migrated copy multiplies the storage requirements for a preservation system by the number of migrations.
- Preserved content is migrated by the most recent, and presumably best, technology available at the time the reader requests access.
- Preserved content is rarely accessed. Performing migration only when and if it is needed reduces the resource cost.
- Content can be migrated directly from the original to the current format, minimizing the effects of format conversion artifacts.
- The format converters, once developed, can themselves be preserved to document the original format.
The LOCKSS Program has been providing libraries with open-source software since its founding in 1998. Open-source software has the following advantages over proprietary closed source software. Open Source Software:
- Promotes interoperability between systems.
- Provides vendor independence. In community-driven development efforts, the efforts of many are leveraged through collaboration.
- Provides long term sustainability and low-cost of maintenance because of the contributions of many.
- Promotes open standards.
- Provides transparency. A knowledgeable reader can read the software code and understand a program’s behavior.
- Provides forward growth. The community can evolve the software to meet the community’s evolving needs and requirements.
- With the Library of Congress, the LOCKSS Program completed the Certification and Accreditation process for FIPS 199 (Federal Information Processing Standard for categorizing security risks of federal information and systems).
- In 2014 the Council Of Research Libraries (CRL) audited the CLOCKSS Archive, a Private LOCKSS Network, against the Trusted Repository Audit Criteria and awarded an overall score matching the previous best. All non-confidential materials submitted for this audit are available here, and there are blog posts announcing the conclusion of the audit, the audit process, some lessons to be learned from the audit, and how you can run the demos the auditors saw.
- The TRAC Criteria are based on the OAIS Reference Model. OAIS is not a standard with rigorous conformance criteria; it is an architectural reference model. The TRAC criteria and their successor ISO16363 provide conformance criteria based on the conceptual architecture of OAIS. Most of these criteria apply to how organizations use technology, not to the technology itself. The LOCKSS technology satisfies the relevant criteria; and the satisfactory audit of the CLOCKSS Archive against the TRAC criteria shows that organizations using the technology can satisfy the full set of criteria. The CLOCKSS Archive’s matching of the concepts of the OAIS reference model is discussed in detail in a set of documents available here. Individual pages for each ISO16363 criterion showing how the CLOCKSS Archive satisfies that particular criterion can be found here. A blog post includes discussion of some issues with OAIS that arose during the audit.
- Librarians can audit the content preserved in their LOCKSS Box via its user interface.
- KBART (Knowledge Bases And Related Tools) is a UKSG/NISO emerging best practice for data communication within the OpenURL supply chain. The LOCKSS system exports KBART bibliographic metadata to specify holdings and preservation status.
- The LOCKSS system inter-operates with other repositories, including the Internet Archive, The German National Library, ContentDM, and DSpace, by exporting and importing WARC files.
- The LOCKSS system performs bit preservation and migrates content forward in time by leveraging the capabilities of HTTP and HTTPS.