Everything you ever wanted to know about LOCKSS, and more.
We continue to update and expand upon this knowledge base, based on the questions we receive and feedback from the community.
Table of contents:
What is LOCKSS?
LOCKSS is a principle, a program, a community, and a software application. What Is LOCKSS? »
Why use LOCKSS?
LOCKSS empowers communities to steward their most important digital materials. Why LOCKSS? »
How can I use LOCKSS?
What does it cost to use LOCKSS?
The LOCKSS software and technical documentation are available at no cost and under open licenses. Participation fees for the LOCKSS Alliance sustain ongoing improvement of core LOCKSS technologies and provide for technical support by the LOCKSS Program. Additional services are available from the LOCKSS Program for negotiable fees.
How does the LOCKSS technical architecture differ from that of other digital preservation systems?
The LOCKSS technical architecture is informed by and built to mitigate a threat model that more strongly weighs operator error, internal attack, and external attack. Preservation Principles »
What is the minimum number of recommended copies for a robust preservation system? (i.e., why does a LOCKSS system require "lots of copies" when other systems use fewer)?
LOCKSS stands for "Lots of Copies Keep Stuff Safe," a cornerstone principle for robust digital preservation. More copies of data will tend to make it safer, regardless of the system used to manage that data. A LOCKSS system, however, makes better use of the copies it manages, by enlisting them to validate integrity against each other, rather than relying uncritically on comparisons against a centralized fixity store.
Over the time horizons of concern for digital preservation (i.e., decades, centuries), it is reasonable to assume that one or more copies may be unavailable for an extended period of time. Over shorter time frames, one or more copies may also be temporarily unavailable.
If the integrity information supplied by the canonical fixity store cannot necessarily be trusted, any digital preservation system — not just LOCKSS — needs at least three copies of data, to allow for the possibility of a majority consensus on the "correct" integrity information. With two copies, if the integrity check yields disagreement, there is no way to know which is corrupted.
Considering the likelihood of at least one copy being unavailable at any given time, we recommend four copies as the minimum for LOCKSS networks, with more preferable, to increase the margin of copies that can be unavailable and still be able to achieve a majority consensus on the integrity values of the remaining copies. See a visual representation of this explanation, from Mark Jordan.
How do LOCKSS technologies ensure the integrity of preserved data (i.e., how does LOCKSS "polling and repair" work)?
LOCKSS technologies have been built from the ground up to ensure the long-term integrity of preserved data. The unique architectural features of LOCKSS are extensively documented in published research and presentations. The core component safeguarding data is the Library Content Audit Protocol (LCAP). Through LCAP, on a continual basis, mutually-distrusting LOCKSS peers operating in a network conduct randomized polls to confer on the integrity values of co-preserved content. Nonces are employed to force re-hashing and prevent caching and replay of previously-computed integrity values.
Overwhelming agreement on integrity values provides reassurance that the preserved contact is intact and unchanged across the conferring nodes. Overwhelming disagreement on integrity values indicates that the copy of data held by minority dissenting node(s) is likely corrupted, and a repair from the source (if available) or another peer is automatically prompted. In cases where a plurality of votes cluster around varying integrity values, no repair actions are undertaken and an alert is raised for the discrepancy to be investigated further.
Over time, a node's participation in polls and the results establish its reputation in the network, weighting it as more likely to be responsive and trustworthy for the co-preservation of the specific content in question. Multiple copies of data are read regularly but randomly, and repaired as needed.
How do LOCKSS technologies compare to blockchain?
Unsurprisingly, blockchain often comes up in discussions about LOCKSS, as both are technologies for distributed networks that intend to provide reliable attestations without trusting peers. With our long history of development and support of peer-to-peer technology for the assurance of data integrity, we observe that blockchain-based systems frequently fail to achieve their touted decentralization, especially by neglecting centralization risks in their real-world implementation. You can read more about our perspective on blockchain on the blog of LOCKSS Chief Scientist Emeritus, Dr. David S.H. Rosenthal.
Can a LOCKSS system run on a virtual machine?
There are no special concerns for running a LOCKSS system on a virtual machine; the LOCKSS Program and many LOCKSS Alliance members already do so, and this is likely to only become more common.
Can a LOCKSS system run in a cloud environment?
Optimally supporting running a LOCKSS system in a cloud environment has not historically been a major priority, given that doing so often means foregoing local content custody, a key part of the philosophy and value of LOCKSS. As a technical matter, a LOCKSS system can be run in the cloud just as well as it could be on a virtual machine in any other environment, with a few caveats. LOCKSS technologies cannot yet interface directly with back-end object storage (e.g., S3); they require a more traditional POSIX filesystem. The need to regularly read data may not align with the cost structure of certain cloud services. To minimize the potential for correlated risk, we recommend against relying on any individual cloud services provider as the platform for an entire network. As we continue to evolve LOCKSS technologies, cloud services may become an increasingly useful complement or supplement to LOCKSS networks and network services.
How is content ingested into a LOCKSS network?
LOCKSS networks can use a variety of mechanisms to ingest content. Currently, these methods are part of workflows that make content available for retrieval over the Web. The LOCKSS software currently expects to be able to retrieve content for ingest over HTTP, in keeping with the original use case of harvesting web-based scholarly publications. However, any digital content can just as well be ingested and preserved, provided it can be made accessible over HTTP.
The LOCKSS-O-Matic software built by Simon Fraser University abstracts the need to stage content to a web server for ingest into a LOCKSS network, instead providing a more familiar drag-and-drop interface for depositing and retrieving files. The next major version of the LOCKSS software will enable new, more flexible ingest methods that are not dependent on web harvesting.
What is the value of open-source software for digital preservation?
Open-source software, such as the core LOCKSS technology, is a natural complement for digital preservation. Closed-source digital preservation, by contrast, is not fundamentally auditable. Open-source software additionally provides the ability to know how the software is working, the flexibility to extend the software to satisfy local needs, and the opportunity for more direct community participation.
Can LOCKSS scale to accommodate large files and/or collections?
We have not yet realized the operational limits of the LOCKSS software, with the largest production network (the CLOCKSS Archive) preserving over 100 TB of content and individual files in the multi-GB range. We are investing to support better horizontal scaling by enabling multiple instances of an individual web service to run as part of a composite LOCKSS system.
How do LOCKSS technologies relate to other web archiving systems?
The LOCKSS software was originally built with tightly-coupled digital preservation and web harvesting capabilities, focused on the use case of web-based scholarly publications. Initial development occurred around the same time as the Heritrix archival crawler and the Wayback Machine web archive replay platform.
While there was some overlap in the capabilities of LOCKSS with these other two platforms (e.g., highly-configurable crawl rules, link rewriting or proxy access, etc.), LOCKSS did not use the ARC file format and had unique functionality for its more specific web content preservation use case (e.g., parsing bibliographic metadata, stripping institution-specific personalization to enable logical fixity comparisons, etc.).
The web archiving ecosystem has matured and now boasts many more tools and libraries for working with the widely-accepted ARC file format successor, WARC. The next generation of the LOCKSS software will also natively support the WARC file format, aligning LOCKSS with the web archiving mainstream and enabling greater cross-pollination of technologies.
What software license applies to LOCKSS technologies?
Copyright (c) 2000-2018, Board of Trustees of Leland Stanford Jr. University All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
LOCKSS Architected As Web Services (LAAWS)
What is the LAAWS initiative?
Not unlike others of its vintage, the LOCKSS software was originally developed as a monolithic Java application. In the intervening two decades, we have made numerous enhancements but all largely within the constraints of the original architecture. To better align LOCKSS technologies with contemporary technology ecosystems, we undertook a major re-architecture effort starting in 2016.
When the project concludes, the formerly-monolithic LOCKSS software will have been broken apart into discrete web services corresponding to major areas of functionality. These components will be deployable in a constellation, for an end-to-end LOCKSS system, or, for the first time, as standalone components that can be integrated into other systems.
For more information about the software re-architecture, please see our recent presentations and press release announcing receipt of funding from the Mellon Foundation to support the project.
What benefits will LAAWS provide to LOCKSS users?
Major anticipated benefits include extensibility, developer documentation, and more efficient feature development.
LAAWS will enable LOCKSS technologies to be implemented and integrated in more flexible ways. Examples could be more seamless ingest of content into LOCKSS networks from diverse sources, integration of LOCKSS preservation capabilities into other digital preservation platforms, enabling arbitrary storage back-ends for LOCKSS systems, enhancing curation workflows with LOCKSS metadata extraction tools.
Stable specifications for the new web services are already available and technical documentation describing how to deploy and configure them will be available in 2019. The LAAWS documentation effort will encompass system functions that have long been present but may not otherwise have been formally documented. This should lower the barrier to entry for software development leveraging LOCKSS technologies.
LAAWS will also allow us to swap in community-supported open-source software to enable new functionality in end-to-end LOCKSS systems, or in some cases to obviate the need to maintain existing functionality ourselves. This will simplify maintenance of the LOCKSS technology stack, yielding additional software engineering capacity to focus on other core improvements.
When will the LAAWS-enhanced LOCKSS software be available?
Stable API specifications for the LAAWS web services are already available. Major software development is ongoing and scheduled to conclude in 2019, with internal beta testing and deployment preparation through the end of the year. Production-level code and releases can be expected toward the end of 2018 or early 2019. Working code is available throughout the project on GitHub. We will continue to keep the community updated on the re-architecture progress.
What will be needed to upgrade existing LOCKSS systems?
We have not yet determined the precise upgrade path for systems participating in different LOCKSS networks. Likely steps include data conversion and either migration-in-place or provisioning a new system and copying the data onto it. We will disseminate more detailed information as we prepare and document deployment steps, and will work with LOCKSS Alliance Members to ensure a successful upgrade.
How are communities working together to support each other?
The collaboratively-hosted nature of LOCKSS networks fosters a community of practice around shared challenges and approaches to digital preservation. Communities operating LOCKSS networks contribute to open documentation, exchange best practices at the annual LOCKSS Alliance Meeting, and engage in other advocacy and educational efforts.
Recent examples include:
- The Council of Prairie and Pacific University Libraries (COPPUL) sharing their LOCKSS-O-Matic software and workflows with current and prospective LOCKSS users as a way to simplify content ingest and management for LOCKSS networks;
- Communities in Europe interested in national networks for the preservation of nationally-licensed electronic resources collaborating on model contractual language, use case elaboration for local content custody, and making the case to funders;
- The institutions behind the Digital Federal Depository Library Program host a node for the Canadian Government Information LOCKSS Network, and Instituto Brasileiro de Informação em Ciência e Tecnologia (IBICT) (which runs Cariniana) is exploring hosting a PKP Preservation Network node, to help improve the robustness of other networks.
What is the difference between LOCKSS and CLOCKSS?
"LOCKSS" is sometimes used as a shorthand for the Global LOCKSS Network, the original LOCKSS network. The Global LOCKSS Network is a mechanism for individual libraries to secure post-cancellation and perpetual access to electronic resources that they are otherwise authorized to access (i.e., via license, or open access). Participating libraries run a LOCKSS system that caches eligible resources and can then make them available to their user communities when the publisher's website is unavailable.
The CLOCKSS Archive is an independent non-profit organization jointly governed by libraries and publishers that operates a globally-distributed LOCKSS-based dark archive to preserve the scholarly record. Content is only brightened from the archive when it becomes no longer available by any other means and pursuant to a trigger determination by the Board of Directors, at which point it becomes freely available to all under a Creative Commons license.
Both networks utilize LOCKSS technologies for preservation, use common tooling for content collected via web harvest, are substantially supported by libraries, and have significant overlap in terms of participating publishers. The LOCKSS Program supports the Global LOCKSS Network, the CLOCKSS Archive, and many more LOCKSS networks, representing a wide range of use cases.
How are communities using LOCKSS?
LOCKSS is the foundation for many community-based digital preservation services, preserving all types of content. Case Studies »
Can LOCKSS technologies serve use cases outside of the academic and cultural heritage sectors?
We see potential for the application of LOCKSS technologies where in-sourced data control and integrity assurance are paramount — for example, government, legal, medical, and non-profit organizations. Decentralized administrative control of the nodes participating in a preservation network yields the strongest data protection, which favors community-based implementations. However, an intra-organizational deployment could still realize many of the benefits, particularly with the right administrative controls.
How can publishers have their content preserved in a LOCKSS network?
Depending on the publisher's preservation needs, available resources, business model, publishing platform, and content focus, one or more of a number of LOCKSS networks might be an appropriate fit. For Publishers »
How can I assess whether LOCKSS is a good fit for my use case?
We encourage you to learn more about the preservation principles that make LOCKSS different, take a look at how other communities are using LOCKSS, reach out to those communities directly to ask questions, download the LOCKSS software and try it out for yourself, and contact us if you would like to explore partnering further.