Plugins
From LOCKSS
Contents |
[edit] Plugin Layer
- Tutorial and instructions for getting and using the plugin tool: Plugin Tool
- Submit bug reports and request new features to Sourceforge issue trackers: http://sourceforge.net/tracker/?group_id=47774
[edit] What is the purpose of the plugin layer?
A few parts of the collection and preservation process require specific knowledge about publisher platforms and journal formats. For example, when collecting a volume, the crawler must follow links to pages that are part of the same volume, but should not fetch linked pages that are external to the volume. The rules about which URLs are "part of the volume" depend on site layout details and naming conventions that vary from publisher to publisher.
The plugin layer allows the bulk of the daemon code to be generic, with all the publisher-specific knowledge and code packaged into a set of separate "plugin" modules. This makes it possible to incrementally add new platforms and journals without making changes to the daemon itself. Plugins can be written and supplied by publishers or the community or the LOCKSS staff.
[edit] What knowledge does a plugin have?
Plugin knowledge can be divided into three areas:
- The publisher's web site
- The structure of journals on the site
- Content within journals
The following are representative examples; this is not a complete listing of plugin knowledge.
Web sites may vary in areas such as:
- Authentication. Some are based on IP address; others may use HTTP authentication or web forms.
- Error handling. Some servers, instead of returning an HTTP error for a page not found, indicate success and return an HTML page that says, "not found". These pages must be recognized and treated as errors, not as a content pages.
- Crawl limits. Some sites may impose either time or bandwidth limitations on crawlers such as LOCKSS caches.
The structure of the journals on a site can cause them to vary in areas such as:
- Crawl schedule. Some journals are published monthly while others are published quarterly. The system must know when to check for new content and how often to audit the preserved content.
- Crawl rules. In order to crawl (collect) a journal volume or issue, the system must have a place to start crawling (one or more starting URLs) and rules that determine which URLs (found in links) are considered to be part of the volume being collected.
- Configuration information. Each site needs to request through the user interface the information it needs to distinguish between journals on the sites. For many journals this may be as simple as a starting URL, the name and volume for the journal being collected. Others may require more complex information.
The variation of the content within journals can affect the:
- Filter rules. Caches regularly audit the content they are preserving, to ensure it is compete and undamaged. This is done by, in effect, comparing content with other caches. But many of the preserved pages have components that are not part of the core content, and some of this content, such as ads or personalization, is not the same each time the page is fetched. Since each cache collects its own copy of the content from the publisher, different caches may have different versions of the same page, even though they agree on the core content. Journal- (or publisher-) specific rules tell the system which parts of a page are allowed to be different.
[edit] How is the plugin layer structured?
LOCKSS includes a Java framework to support plugin writers. At the most basic level, this comprises a handful of interfaces for which every plugin must provide an implementation. At every point in the system where special knowledge or behavior might be required, a call is made to a method of one of these interfaces. The Javadoc of the current plugin API is available online and is updated nightly.
Most plugins won't need to take any special action in the majority of cases, so we have written a set of base classes that implement the plugin interfaces, which provide default behavior for most of the methods. The Javadoc for the base plugin API is also available online.
[edit] What must be done to write and test a plugin?
We have developed a tool to help plugin developers write and test plugins.
- Tutorial and instructions for getting and using the plugin tool: Plugin Tool
- Submit bug reports and request new features to Sourceforge issue trackers: http://sourceforge.net/tracker/?group_id=47774
We expect most plugins to be written using this plugin tool. Some more complicated plugins may need to be written by extending our base classes, implementing a couple required methods (e.g. to specify crawl rules) and otherwise overriding only those methods for which the default behavior is not suitable. Typically this will consist of a couple classes with 4 or 5 methods each.
There is no requirement to use our base classes. If the publisher's site or the content type is sufficiently different from anything we have envisioned, the plugin writer is free to write complete implementations of any or all of the plugin interfaces. Additionally, we plan to develop a set of generic plugins that can be used for the most common types of publishing platforms. A plugin "developer" would configure one of these plugins by answering a series of questions, rather than writing Java code.
