What is METS/ALTO?

METS and ALTO are XML standards maintained by the Library of Congress.

The METS standard is a flexible schema for describing a complex digital object (like a digitized newspaper issue). METS describes the structure of the object but does not encode the actual textual content of the object. The ALTO standard fills this void by encoding the textual content of a digitized page in great detail, including styles and layouts. As well as encoding the digitized text itself ALTO encodes the spatial coordinates of every column, line, and word as it appears on the page.

The combination of METS and ALTO (often written METS/ALTO) is the current industry standard for newspaper digitization used by hundreds of modern, large-scale newspaper digitization projects (and lots of smaller projects too!) A very small sample of projects using METS/ALTO are listed below.

 

More about METS

The Metadata Encoding and Transmission Standard (METS) is a highly flexible schema for encoding descriptive, administrative, and structural metadata to describe complex digital objects. In a METS file you find information such as the title, author, publisher and date of the original work, and also information about the digital object itself, including the digitization process and the physical and logical structure of the object.

When used to describe digitized newspapers there is typically a single METS file to describe each newspaper issue.

What METS XML contains

A METS XML file has 5 separate metadata sections. Each section describes a different aspect of the digital object.

Section 1 – Descriptive Metadata — <dmdSEC>

Uses MODS or similar metadata to describe the object itself. Here you find the title of the object, as well as other information like author, publisher, and date.

Section 2 – Administrative Metadata — <amdSEC>

Uses MIX or a similar metadata schema to describe the digitization process and the resulting digital files. Here you find information about the scanning process, hardware, digitization software, compression, file types and more.

Section 3 – File Section — <fileSEC>

Lists, describes, and links to the files that make up the complex digital object described by the METS file. For a newspaper issue those files typically include page-level images (in TIFF and/or JPEG 2000 format), ALTO XML files describing the layout and content of each individual page, and page-level and/or issue-level PDF files.

Section 4 – Physical Structure — <structMap LABEL=”Physical Structure”>

Describes the physical structure of a complex digital object. For a digitized newspaper this section “points to” and describes the pages that make up the newspaper issue. It includes metadata associated with the physical pages (e.g. page numbers and/or ordering information) and links to files (e.g. images and ALTO XML files) that describe each page.

Section 5 – Logical Structure — <structMap LABEL=”Logical Structure”>

Describes the “logical” structure of a complex digital object. For newspapers, if articles have been identified during digitization, this section lists the “table of contents” of articles in the newspaper issue, as well as any metadata (e.g. headlines and bylines) associated with individual articles.

 

More about ALTO

The Analyzed Layout and Text Object (ALTO) is a schema for capturing the word content, styles, and layout elements on a digitized textual page, including the spatial coordinates of text elements like columns and lines. It is often used in tandem with METS XML, which provides descriptive and administrative metadata about the object to which the ALTO XML file belongs.

What ALTO XML contains

An ALTO XML document comprises the physical description, composition, and the page content of digital objects. ALTO files generally have 3 sections.

Section 1 – Description
    <Description>
      <MeasurementUnit>mm10</MeasurementUnit>
      <SourceImageInformation>
          …

The Description section contains descriptive information pertaining to the ALTO file itself, including measurement units, source file information, processing software and creator, and OCR information.

Section 2 – Styles
    <Styles>
          …

The Styles section contains descriptions of fonts and paragraphs. Common information includes the font-family and size, font styling, and paragraph alignments and line spacing.

Section 3 – Layout
    <layout>
      <TopMargin ID=”P1_TM00001″ HPOS=”0″ VPOS=”0″ WIDTH=”4516″ HEIGHT=”323″/>
      <LeftMargin ID=”p1_LM00001″ HPOS=”0″ VPOS=”323″ WIDTH=”133″ HEIGHT=”5981″/>
          …
      <PrintSpace …/>
        <TextBlock …/>
          <TextLine ID=”p1_TL00001″ HPOS=”163″ VPOS=”1909″ WIDTH=”4198″ HEIGHT=”23″/>
            <String …/>
            <SP …/>
            …

The layout section is where the actual content (String) and dimensions (HPOS, VPOS, WIDTH, and HEIGHT) are located. Each block of text is listed and absolutely positioned in units, typically fractions of inches or millimeters, from the top-left corner of the page. Further detail and positioning is provided for every line and each word of content on the page. The layout section also describes and positions any other object, such as pictures, tables, and formula, that may be on the page.