alto_search

What does this program do?

This application performs a search for terms in ALTO files. ALTO files are XML files that store the output of OCR. A term can be several words and multiple terms can be searched at the same time. The output is in XML format and contains the coordinates of the words that were found as well as textual context around the hits. The program does a case insensitive search. This is achieved by converting everything into lowercase and then comparing. In addition, any punctuation marks (or brackets etc.) at the beginning or end of terms are ignored for comparison purposes. The detection of punctuation marks and the conversion to lowercase is based on Unicode 5.2 data.

Usage

alto_search -alto filename -term XXXX [-term XXXX] [-utf8term XX,YY,ZZ] [-help] [-block YYYY] [-mode ZZZZZ]

  • alto : "filename" gives the full pathname of the ALTO file in which to search (must be UTF-8)
  • term : "XXXX" is a search term in UTF-8 (one term can be several words enclosed in quotation marks). There can be several -term parameters.
  • utf8term : This is meant for the case where you cannot pass non-ANSI characters to the program. Here you pass the bytes composing the UTF-8 string as decimal numbers separated by commas.
  • block : "YYYY" is the ID of a TextBlock in the ALTO file. As soon as 1 block ID is given, the search is restricted to only those blocks. There can be several -block parameters.
  • mode: "ZZZZZ" selects one of the modes to give context from the following. no_context, context, blocks, all_blocks
  • help : prints a help text in XML

Example output

<?xml version="1.0" encoding="UTF-8"?>
<ResultSet>
    <Description>
        <MeasurementUnit>mm10</MeasurementUnit>
    </Description>
    <result BLOCKID="P2_TB00017" NHITS="1" CHARS_HIT="9">
        <line>
            <text>Abgang U Uhr Morgens;  Ankunft zu Mer-</text>
        </line>
        <line>
            <text>tert 6 Uhr 24;</text>
        </line>
        <line>
            <text>zu</text>
            <text HIT="HIT" HPOS="1013" VPOS="1014" WIDTH="142" HEIGHT="34">Oetringen</text>
            <text>7 Uhr 36;</text>
        </line>
        <line>
            <text>zu Remich 11 Uhr.</text>
        </line>
        <line>
            <text>Remich à Grevenmacher.</text>
        </line>
    </result>
    <info>
        <time_elapsed>50</time_elapsed>
    </info>
</ResultSet>

Explanation of result XML

  • MeasurementUnit

The measurement unit used in the ALTO file. This is usually mm10, which means that in order to transform them into pixels, you need to use the formula pixel=mm10 * dpi / 254. The dpi is the same as the one from the image that the ALTO file corresponds to.

  • ResultSet

The container for the whole XML

  • result BLOCKID="P2_TB00017" NHITS="1" CHARS_HIT="9"

A result with its context. The BLOCKID comes from the TextBlock's ID where the hit was found. NHITS tells you how many terms have been found (can be the same term several times). CHARS_HIT says how many characters are in the terms that were found. This could be used to rank results according to the number of characters that are hit.

  • line

A textline as defined by the ALTO file.

  • text

The fulltext from the ALTO file. If it is not a hit, then the word doesn't have attributes. However when a word is a hit, the attributes specify its position on the page.

  • text HIT="HIT" HPOS="1013" VPOS="1014" WIDTH="142" HEIGHT="34"

A hit. This is always one single word (as defined by the segmentation in the ALTO). One word can be split into two hits if it was hyphenated on the original page. If a search term consists of several words, they will be adjacent hits. HPOS is the horizontal position, VPOS is the vertical position, WIDTH is the width and HEIGHT is the height of the block that delimits the word on the image. These values are in the unit defined in MeasurementUnit.

  • info

Information about the time needed to perform the search

 

Last edit: Bib nationale de Luxembourg 2011-11-23