How MetaArchive Works

MetaArchive In a Nutshell

MetaArchive uses the free open source LOCKSS archiving software to operate a network of preservation servers. Due to the low-cost participation, it is affordable for libraries of all sizes. LOCKSS is an ACM award-winning digital preservation technology which preserves all formats and genres of web-published content, from full-fledged websites to simple web hosted directories.

Content is stored in and restored to its original format. Participating institutions identify valuable digital assets that they wish to preserve safely. They make the corresponding digital content accessible to MetaArchive network servers, so-called LOCKSS caches, which are configured to copy content, update it to its latest versions on a regular basis, and ensure its integrity over time.

All content is stored in multiple copies on multiple caches at geographically dispersed locations. The MetaArchive network manages the number of replication so that a loss of all copies becomes extremely unlikely. If an institution loses preserved content for whatever reason, its content is restored in its original form.

The Steps of Preservation with MetaArchive

Phase 1: Preparing Content

For some collections, Content Preparation is very easy. In other cases, some more effort may be necessary. Ingest Content workflow provides an even more practical set-by-step walkthrough of the final stages of content preparation (plugin development, manifest pages, Conspectus entries, etc.). A general outline follows for the average reader.

A Content Owner identifies valuable digital content that needs to be safely preserved, for example:
- electronic theses and dissertations
- data sets
- image masters
- journals
- other
The Content Owner prepares (or stages) content for preservation by:
- making content accessible in a firewalled web hosted directory;
- organizing content so that document files and metadata can be harvested together by LOCKSS caches; and
- discussing with the MetaArchive central staff, when needed, how to harvest content files and METS/OAI metadata from a database backed institutional repository (CONTENTdm, DSpace, homegrown, etc.)
The Content Owner prepares a collection description in the MetaArchive's Conspectus tool:
- gives the collection a title and archive designation;
- enters the source URL (base_url) for the web hosted directory (see above); and
- provides some descriptive metadata for the collection
A Technical Person reviews the prepared (or staged) content, by:
- planning the crawl procedure used by LOCKSS caches when ingesting/updating content;
- tailors this procedure to the website being crawled (i.e., defines Plugin crawl rules), for example:
  - defines rules to ignore links to ephemeral information such as 'Recent Announcements', 'Latest News', etc.; and
  - defines rules to include all intended files (e.g., tiffs) but exclude all unintended files (e.g., low res jpgs)
- planning how to organize contents so that LOCKSS caches archive big collections in manageable archival units (generally between 1GB-30GB)
- making sure that the harvesting procedure will guide LOCKSS caches to copy all content needed to restore the collection in the event of total loss of the originals

Member institutions prepare content for preservation, producing packages of content according to their local needs and workflows.

Preparing different digital contents and adding them to an Archival Unit (AU)

Phase 2: Configuring the MetaArchive Network to Ingest Content

Phase 2 starts once the Content Owner and the Tech Person agree that the approach taken will preserve the intended content. This happens when:

The Tech Person publicizes the crawl procedure in the MetaArchive code repository;
The Tech Person and/or Content Owner enters configuration parameters for the content that is now available for preservation in the MetaArchive Conspectus tool; and
The MetaArchive Central Staff adds the configuration parameters to the MetaArchive title database

Phase 3: Ingesting Content

Ingest of the new content becomes possible once MetaArchive servers have noticed the new configuration settings. This happens when:

The MetaArchive Central Staff assigns the LOCKSS caches that are to preserve the new content;
1. Decisions are guided by the goal to replicate all content 7 times, thus seven caches are identified;
2. Chosen caches need to have sufficient disc space to store content; and
3. Caches chosen are geographically dispersed
A Tech Person at each assigned location adds the identified content to his/her cache's configuration; and
The LOCKSS software starts to take care of the new content along with other preserved content.

The member creates an entry in the MetaArchive Conspectus, describing the collection(s) it is submitting for ingest.

Member makes content available to the network via a web “staging” server.

Diagram of the AU on the Staging Server, and the plugin being used to send the AU to the member storage node

Five member storage nodes are assigned to ingest the new content.

Diagram of the Staging Server sending the AU to five other LOCKSS box servers in the MetaArchive network

Phase 4: Preserving Content

The LOCKSS software running on each cache executes its processes on a routine schedule by:
- harvesting content through the Internet from the URL locations given in its configuration;
- including all content that pass through the crawl rules defined for the collections;
- initiating polls about the makeup of preserved content with other LOCKSS caches in the network;
- voting in polls initiated by peers; and
- repairing content if a poll result shows convincingly that contents stored locally is inconsistent with the copies held by the majority of caches (memory failure can lead to such a situation)
1. Helping other caches to repair content;
The MetaArchive Central Staff uses the Cache Manager and Conspectus tools to monitor the network status to ensure that:
- content is replicated sufficiently; and
- the network operates correctly
The Content Owner revisits Phase 1 decisions and solutions whenever content structure changes; and
The Content Owner and Tech Person regularly audit content as it is preserved on caches to ensure that the harvesting procedures put in place guide the LOCKSS caches correctly when ingesting and updating content (If not, please go back to Phase 1, Step 4)

Members can monitor progress and completion of the ingest process via the Conspectus.

Screenshot of a successful ingest page in Conspectus

After ingest, storage nodes regularly and iteratively check in with each other (called "polling and voting") to make sure that all five copies of the content remain identical over time.

Diagram of the polling and voting mechanism between five servers

If a mismatch is detected between two nodes, the servers determine regarding which copies are correct and which do not match, and then the network repairs the corrupted files and records that action.

Diagram of a corrupted file being identified and replaced with a healthy copy

Phase 5: Restoration of Content

If the original data is lost:

The MetaArchive Central Staff helps the institution to identify caches where their content is replicated; and
The Tech Staff use the proxy feature of one of these caches to restore preserved content

The member alerts the network administrator via email and requests a preserved copy.

The network administrator retrieves a copy from the network and makes it available via download to the member institution.

In case of a local disaster or hardware failure, members can contact the network administrator to set up a replacement server and recover the content from the MetaArchive.

How MetaArchive Works

MetaArchive In a Nutshell

The Steps of Preservation with MetaArchive

Phase 1: Preparing Content

Phase 2: Configuring the MetaArchive Network to Ingest Content

Phase 3: Ingesting Content

Phase 4: Preserving Content

Phase 5: Restoration of Content

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally