Metadata extraction tools

In our project CODA-META (2008) a review of four tools for extracting metadata was made.

CODA-META

In the project CODA-META 2008 a review of four tools for extraction of metadata was done. The evaluation was made against a total of 27 files of different formats and versions. The files were valued according to a number of metadata lists produced by the CODA group.

The aim was to evaluate and rank four different software by how well they can fulfil the metadata lists above. For more information see the final report CODA-META.

Software tests

The tests of software done in this project had the aim to analyze how well the tools are at extracting metadata from different types of files. The demands were: to get metadata for the specific metadata lists and to examine how good they are at accessing specific technical metadata to some designated fileforma. All tools are open source and downloaded from the web.

The tools tested were:

  • Exiftool

  • Jhove

  • Metadata extraction tool

  • File identifier

Test material

As test material 27 files were used, 16 of them categorised as follows:

  • Text (7 items)

  • Images (7 items)

  • Sound/video (2 items)

Remaining 11 files were used in specific technical metadata test. The selection of filetypes was based on a compilation of the format lists made by all institutions in the CODA project (National Library, National Archives of Recorded Sounds and Moving Image and National Archives of Sweden). The files used were created in different ways, by scanner, digital camera, software and conversion from one format to an other.

In 16 files extra metadata was filled in to get as rich content of metadata as possible in the test files. Files were then studied with a hexadecimal editor (HxD hexeditor) to see if any metadata was missed. All metadata for one file was then documented to be available during tests.

Method

Rating of tools were made towards a number of metadata lists from the CODA project group.

Lists are as follows:

  • General technical and descriptive metadata from the National Library of Sweden

  • Technical metadata for the tiff format

  • Technical metadata for text format

  • Technical metadata for soundformat

  • Technical metadata for videoformat

First an evaluation was done towards a general metadata list from National Library investigating general technical and descriptive metadata and after that an extended control of what technical metadata the tools could deliver. A number of textformats, video and audioformats and the tiff format was viewed against lists of metadata. Thereafter all test data was put togeter and acted as base for the final evaluation and rating.

Exiftool

The metadata program Exiftool is a platform independent Pearl library and a command-driven application. It also exists as a Windows execution file for Microsoft platforms and for Macintosh OS X package.

The program can read, write and edit metadata information in image, audio and videofiles. It supports a big amont of different types of metadata.

Input/Output data
Input can be done via the command-driven window (exiftool.exe) a separate file or a whole catalogue. Output data can be extracted as a textfile, HTML file or on the screen depending on how the metadata information needs to be presented.

The generated information can also be sorted and structured in different ways, after group (EXIF, XMP and so on). Or if you like to limit the information to only one category (for example only IPTC) of metadata. This can be done by sending commands by execution of the software.

Output data can be presented in a colon separated list or tab separated list. More information can be read in the program help file. The version tested here was Exiftool 7.30.

Exiftool

[http://www.sno.phy.queensu.ca/~phil/exiftool/]

Jhove

Jhove-jstor/Harvard Object Validation Environment version 1.1is a tool for identification, validation and extracting metadata from file formats, developed by the Harvard University Library. The tool is written in Java, platform independent and fully documented.

Every fileformat has its own module that identifies, validates and extracts metadata, written in Java. At this point there is 12 different modules for fileformats to Jhove.

As the program code/source code is available at GNU Lesser General Public License (LGPL) there is possibilities to develop new format modules in Java. Development is beeing done and a new version of Jhove is in progress.

Communication with the program is done by a graphical interface (jhoveview.jar) or a command interface (jhoveapp.jar).

Input/Output data
Input of files for extraction can be done one at the time or as whole catalogues.

Result from Jhove can be shown on screen, as an XML structured print-out or as a text file. Also a audit file can be generated. Information is very rich, filename, version, status and lots of metadata is possible to get from the file.

Jhove

[http://hul.harvard.edu/jhove/]

Metadata extraction tool

Metadata extraction tool is built by Sytec Resources for the National Library of New Zeeland in 2003. In 2007 it was deceded to release the tool as open source. The program is built to extract metdata from files for digital preservation. It is written in Java and has a command-driven interface for Unix, a graphic interface for Windows platform and a command-driven window.

Rich documentation follows downloaded compressed program package together with installation guide, information on software structure and user manual. The source code is available for further development and/or for costumizing the tool for internal needs. The program is built as a number of modules that extracts metadata from files.

Input/Output data
Input of files can be done using the graphical user interface or throug a list that the tool works towards. Extraction of metadata can be made from one single file, a cataglouge with files or every file from a list.

The result can be shown on screen or taken out as a structured XML file. Two alternatives for output exists, every files metadata in a new XML file (one for every file) or metadata on every analyzed file in one XML file, depending on settings on the software application.

As Metadata extraction tool´s source code is avaiable it can be expected some development will be done on the software. The latest version 3.1A is tested here.

Metadata extractor tool

[http://meta-extractor.sourceforge.net/]

File identifier

File identifier 0.6.1 is a beta version produced to identify and extract metadat from file formats. The program is made by Optima SC Inc and the version tested is freeware. File identifier supports  Windows 32 bytes and Linux x86. Communicatin with the program is done through a commando promt (file.exe) and at this point about 600 file formats are supported for identification and about 30 for extraction of metadata.

Input/Output data
Files for extraction of metadata can be analyzed one by one or as a whole catalogue. Output from program is filename, file class, MIME type, absolute search path to file and some metadata like creation date, data of modification and som data for specific file class.

The result from File identifier can be shown on screen, as a HTML report as a SFV report.

File identifier

[http://www.optimasc.com/products/fileid/index.html]

Result

In the report CODA-META a table presentates the result of which of the 16 files each tools could open up and read.

The result also shows how well the tools fulfil the lists of metadata made for the tests. Altogether more than 150 tests were done and all tests are presented in detail with text and tables in the report CODA-META.

CODA-META - in Swedish

[https://ldb.project.ltu.se/main.php/ldb.project.ltu.se/main.php/projects/portalproject/docs/Publikationer/Svenska%20publikationer/CODA-META.pdf?fileitem=7473811]

CODA-META - in English

[https://ldb.project.ltu.se/main.php/ldb.project.ltu.se/main.php/projects/portalproject/docs/Publikationer/Svenska%20publikationer/CODA-META_English.pdf?fileitem=7475396]

Published: 18 December 2008

Updated: 22 February 2011

Luleå University of Technology