Demo of CAST

A presentation of how to use the CAST for web archiving, in crawling, analysis and packaging of web sites.

Overview of CAST

The picture above shows an overview of CAST (the blue area). Today we use CAST to collect sites that published on the Internet but our intent is to use CAST also for other types of material.

Start crawl

Collection in CAST is done using the crawl tool Heritrix. Heritrix is ​​designed specifically for preservation of web and is used in the largest online collections around the world. (Read more about Heritrix on the page CAST.)

Heritrix is ​​complicated to use and we have built an interface that facilitates the start of a collection. This was used earlier in the project Test Platform, nowdays we configure each new collection directly in Heritrix because we want to be able to customize collections more fine-grained.

One of the strengths of Heritrix is ​​that it can preserve collected sites in the file format WARC, created for this purpose by the people behind the Internet Archive and IIPC. (Read more about WARC on the page CAST.)

View the collection

When a collection is completed, after a few hours or days depending on its size, it is managed through a webpage.

For each collection is displayed the date it was made, URL to the site, a description and its name. The columns to the right gives three options:

  • View the page as it appeared at the time of collection
  • View analysis
  • Delete the collection

Visual analysis - view the page in the Wayback Machine

View website ("Visa") shows the collected web using the software Wayback Machine, installed on our own server. The user chooses a date from a table displaying all occations this particular site was crawled by us.

Wayback viewes the WARC files in the same way as when they were published on the internet, ie you can click through menus and links and everything looks as it did on the actual website.

What does not work is search boxes and dynamic pages with content depending on a choice made by the user. External links can be seen but are not clickable.

The address bar shows that this is a downloaded site on our server at the LDP center. There is also information about the date and time when the collection was made. (More information on the Wayback Machine on the page CAST.)

Content Analysis

In the picture above, we have chosen the second option from the table, View analysis ("Visa analys").

Above the table is an ID for the package that contains all the files downloaded, the date of collection and the domain in beeing crawled. The total number of files collected on this analysis was 22,862.

The two tables presents the files in MIME format and version, what types of files existed at the time of collection, how many files from each type, and three references to each file:

  1. Where the specific file can be found in the collection
  2. Where it is publised on the website
  3. Which pages contain a reference (link) to the file.

Outside the picture is more information: one table with files not reliably identified, one listing all http status codes (eg 404, file not found) and one with results from virus control, all with links to every file and its references.

All this is summarized by us to a analysis report with the aim to be used for creating better archive packages but also to make the published site better for users today and easier for future preservation.

Metadata

CAST was developed to create metadata files and automatically fill them with data from the tools used, and finally verify them against schemas. The purpose of these functions are primarily to facilitate for the authorities as metadata structures has been built in collaboration with the Swedish National Archives. Another major advantage is that the error rate decreases.

Some of the metadata needed can not be downloaded automatically but must be filled in manually by the user. For this we have created a form where the user enters information about the organization, registration number, etc. The form on the picture is for creation of National Archives ADDML files.

We offer to develop customized forms according to other organizations needs.

Create information packages

Finally, CAST creates an information package of the collected web site files including log files and report files from the tools. At this stage the user can choose which metadata files to be included in the package, even if some are mandatory to our tool.

The physical package

The customer receives an information packet in the form of a TAR file. The package includes the collection (of one or more WARC files, depending on size), metadata files (ADDML, PREMIS, METS and File info, all in XML) and a number of reports and log files from crawling tool, including check sums.

Finally, we deliver the TAR file and the analysis report in digitally as agreed, for example, with FTP, or on storage media.

Published: 14 November 2011

Updated: 16 November 2011

Luleå University of Technology