Main Page
From tei-comparator
We should probably split this up into separate pages at some point.
Contents |
About the TEI-Comparator
What is the TEI-Comparator?
The TEI-Comparator is a text-comparison engine designed to compare two XML files where items of a paragraph-like granularity have been moved around, split up or otherwise reorganized. It is called the 'TEI-Comparator' because originally it was designed to compare two large XML files that follow the Guidelines of the Text Encoding Initiative (TEI). However, the TEI-Comparator does not specifically require TEI XML and should work with any XML as long as the units being compared are of a similar paragraph-like size. It is a database-backed (MySQL or HSQL or indeed any database that is supported by Hibernate) java web application built using the Google Web Toolkit API. Texts are specially pre-prepared by having unique namespaced IDs applied to them and optionally having the comparator attempt to find proposed matches for each ID. Texts are then loaded into the Comparator which provides a web interface for confirming and deleting proposed matches, making new matches, and annotating either the links between the matching items or either of the items themselves. The links and annotations can then be output as a TEI file. These are also included in the TEI output file.
Why was it made?
The TEI-Comparator was initially made by the Research Technology Services of the Oxford University Computing Services of the University of Oxford to provide assistance for the Holinshed Project. Holinshed's Chronicles of England, Scotland, and Ireland was the crowning achievement of Tudor historiography and an important historical source for contemporary playwrights and poets. Holinshed's Chronicles was first printed in 1577 and a second revised and expanded edition followed in 1587. The project is interested in examining the EEBO-TCP project had already encoded a version of the 1587 edition, and for the Holinshed Project they created a version based on the 1577 edition using the same methodology. This enabled a research assistant to confirm proposed matches, delete any incorrect ones, make new ones, and provide annotations. What preprocessing is necessary? There are two steps to the pre-processing:
- ID'ing the files: Deciding on what elements will be compared and adding special text-comparator IDs to each of these elements. TEI-Comparator comes with a script markup-input-file.sh which helps accomplishing this task but should be customised to the elements that are important to your project.
- Initial Comparison: An initial comparison is run which for each ID'ed element looks through all the similarly ID'ed elements in the other file, attempting to ascertain which of these elements might match. This can be time-consuming with very large files (for Holinshed's Chronicles, that consist of about 20,000 paragraphs a full comparison takes about 2 hours; measured on a two year old mac book pro) but only needs to be done once. The TEI-Comparator script for doing this is initial-comparison.sh
What does the TEI-Comparator need to work?
The TEI-Comparator is a java web application and so runs in an appropriate servlet container such as Apache Tomcat. It is written using the Google Web Toolkit.
Getting the TEI-Comparator
Download the latest released version from Sourceforge or get the latest version directly from the SVN. The released version comes with a prebuilt webapp. When checking out the latest version from the SVN you have to build the webapp with Google's GWT compiler (http://code.google.com/intl/de-DE/webtoolkit/).
Access to the SVN
The TEI-Comparator is available from Sourceforge subversion at "https://tei-comparator.svn.sourceforge.net/svnroot/tei-comparator/TEI-Comparator" You might check out a copy of this repository with the following command:
svn checkout https://tei-comparator.svn.sourceforge.net/svnroot/tei-comparator/TEI-Comparator ./TEI-Comparator
This will get you the most recenty up-to-date version of the TEI-Comparator.
Installation of TEI-Comparator
Configuration In the src/properties.xml there are a number of properties that you can adjust. These include:
<entry key="input.file1">/usr/local/TEI-Comparator/resources/1577.xml</entry>
<entry key="input.file2">/usr/local/TEI-Comparator/resources/1587.xml</entry>
<entry key="stylesheets.TEItoHTML">/usr/local/TEI-Comparator/resources/render.xsl</entry>
where you can set the paths for the two input files and the XSLT stylesheet used to render them for presentation inside the web app. There are also settings used for rendering and presentation of the XML texts.
If you want to use a database other than HSQL (for example mysql) then you must adjust the properties in src/hibernate.cfg.xml to match that database.
Build the web app
To build a new TEI-Comparator.war you have to zip up everything in the war directory and name the file TEI-Comparator.war (in case you have checked out TEI-Comparator from the SVN you first have to build the web app using the Google GWT Compiler). This file can then be deployed in any servlet container (e.g. Tomcat). Alternatively you can simply copy everything in the war directory into a subdirectory of your servlet container's web-app directory. After deploying the comparator will be available at http://localhost:8080/TEI-Comparator/TEI_Comparator.html (8080 is the default port and TEI-Comparator the name of your .war file).
Set up the database
The TEI-Comparator has primarily been used with HSQL as it is simple and straightforward. However, it should be possible to run it under other database systems such as mysql.
To set up and use the HSQL database involves several steps:
- start the HSQL server using "nohup sudo start-hsql.sh &". This should start the HSQL database running in the background.
- start the HSQL manager using "./start-hsql-manager.sh"
- configure the database to use jdbc:hsqldb:hsql://localhost (sql standalone) in the HSQL manager.
- create the database tables by running the SQL script "doc/db/hsql/build_hsql.sql" in the HSQL manager.
(Screenshot of HSQL setup? )
Pre-processing the files
Marking-up input files
The file markup-input-file.sh should be customized to apply IDs to the appropriate elements. The defaults, used for the Holinshed Project that the TEI-Comparator was initially made for are: <p>, <q>, <sp>, <stage>, <list>, <table>, <lg>, <epigraph>, <byline>, <closer>, and <opener>. These were what we identified as paragraph-like chunks in the Holinshed files that we were using, but your needs may differ. Running this script is non-destructive in that doesn't hurt ID attributes that are already there, nor replace previous Text-Comparator IDs. The Text-Comparator IDs are added in their own namespace. To run use:
markup-input-file.sh [inputFileName] [outputFileName]
Running an initial comparison
In order to pre-load the database with suggested comparisons an initial comparison must be run. The uses the bespoke fuzzy text comparison algorithm, based on the n-gram overlap approach that was designed by Arno Mittelbach who is the lead technical developer of the TEI-Comparator.
About the Shingle Cloud algorithm
This algorithm, called Shingle Cloud, transforms both input texts (needle and haystack) into sets of n-grams. It matches the haystack's n-grams against the needle's and constructs a huge token string where they match. This token string is then interpreted by the algorithm to determine whether the needle can be found in the haystack and if so where. To create the token string that will later form the basis of the match-operation, it tests each n-gram from the haystack (in the same order as they occur) to see whether it is also present in the set of n-grams extracted from the needle. The results of these comparisons are recorded in the string mentioned above as either a match token in case the n-gram is matched,a non-match token if the n-gram does not exist in the needle's n-grams or a group token in case paragraph boundary was crossed . Furthermore, a concept of so called magic shingles exists, that allows to insert “magic” n-grams that match with anything. This was included in the algorithm to achieve better results with paragraphs that were only partly legible. The resulting string is called a shingle cloud, which gave the algorithm its name. In the creation of the haystack and matching of the needle there are a number of parameters to the algorithm which can be adjusted to give it flexibility in the way it works. These are:
- size of n-grams: The algorithm works with any size of n-grams. The smaller n is chosen the less the algorithm penalizes the rearrangement of single tokens.
- number of zeros between matches: If a high value is chosen, the algorithm will ignore large inserts without splitting the match.
- minimum number of ones per match: The s maller the value, the higher the risk to match n-grams that matched by chance. If the value is too high, one can easily miss smaller matches.
The algorithm runs in linear time, and while this process of comparison may take a long time for large texts like Holinshed's Chronicles (it takes about 200 ms to match one paragraph and Holinshed's Chronicles consist of roughly 20,000 paragraphs), but this initial comparison only needs to be run once. Comparisons made in the frontend web interface use the same technique but are only comparing a single item to all the others and so is much quicker than the initial comparison which looks at all the items in one edition compared to all the items in the other. This approach certainly works well for paragraph-like chunks of text, however it may lose its efficiency if used for much smaller textual components such as individual words which will have many matches.
A first evaluation where we compared the revised set of matches (created by the projects main research assistant using the TEI-Comparator) to the initial comparison proposed by the TEI-Comparator shows that the comparison worked quite well. In the configuration used, the ShingleCloud algorithm achieved a recall of 92% and a precision of 98%.
To perform an initial comparison run:
initial-comparison.sh [inputFile1] [inputFile2] [outputFile]
Loading the comparisons into the database
Once the initial comparison has finished the output file needs to be loaded into the database so hsql (or mysql if you are using that) should already be running. To load the comparisons into the database run:
load-db.sh [comparisonFile]
Deploying the web app
Once the database has been primed with initial comparisons, the Web ARchive (.war) file can be built and deployed. To do this using a running Tomcat you can simply put a .war archive (a zip of the contents of the 'war' directory) into the Tomcat 'webapps/' directory. The default location for the TEI-Comparator will be: http://localhost:8080/TEI-Comparator/TEI-Comparator.html
Backing up modified comparisons
To backup the current state of the database of links and annotations there is a db2comparison.sh script. To do so you need to provide the names of the input files concerned and an output file. Hence you can dump the database with:
db2comparison.sh [inputFile1] [inputFile2] [outputFile]
It might be good practice to have a cron-job output this to a file on a regular basis during active periods of use.
Using the TEI-Comparator front-end
The TEI-Comparator comes with a web interface allowing someone to confirm, remove, annotate, or create new links between one edition and the other. This interface to the TEI-Comparator was constructed in Java using the Google Web Toolkit API. This has been tested running under the java servlet container Apache Tomcat to allow a research assistant to confirm or create new links between the two texts, provide notes, and otherwise interact with the TEI-Comparator as a web application. The TEI-Comparator can be run locally on one's own machine or made available online.
The meanings of the background colors
The current state of a paragraph is indicated by different background colors depending on its current status. The defaults for these are:
- White Background: No match is proposed for this paragraph
- Yellow Background: There is an as-yet unconfirmed match for this paragraph
- Green Background: There is one-or-more matches confirmed for this paragraph
- Blue Background: This paragraph is currently selected.
Confirming matches
Confirming matches is simply a case of selecting the paragraph to match on the left-hand side, double-cliocking on it, selecting the proposed match on the right-hand side you wish to confirm and using the 'Match' menu and selecting the 'Confirm Match' menu item. If multiple proposed matches from pre-processing are detected for a paragraph then these appear as tabs. Each needs to be selected and confirmed (or deleted) separately.
Deleting matches
Deleting unconfirmed matches is simply a case of selecting the paragraph to match on the left-hand side, double-cliocking on it, selecting the proposed match on the right-hand side you wish to delete and using the 'Match' menu and selecting the 'Delete Match' menu item. If multiple proposed matches from pre-processing are detected for a paragraph then these appear as tabs. Each needs to be selected and deleted separately.
Adding new (or additional) links
If one searches for more matches using the 'Find Additional Matches' menu and selecting 'Propose Matches', the TEI-Comparator will then propose more matches based on a threshold set in the configuration file. There is not necessarily a one-to-one relationship between paragraphs being matched but really a many-to-many relationship since multiple forms of fragmentation and rearrangement can take place in both editions. The beginnings bit of text of the proposed matches are displayed on the right-hand side and currently confirmed matches will be shaded in green. More information about any individual proposed match is available, including of course showing the entire text of the match or links out to a static version of the text or images if available. The matches are ordered by their indirect ranking percentage but their algorithmic direct ranking is also shown. The direct ranking is the number of matching n-grams in a match divided by the number of n-grams, whereas the indirect ranking is the number of matching n-grams divided by the number of n-grams in the needle (that is, the paragraph-like object we are looking for).
Manual Navigation or Searching
In some cases it might be possible that the TEI-Comparator does not find the correct match. In the Holinshed project this was owing to a large number of illegible passages marked as such in some paragraphs. In these cases it might be more efficient to have the user manually select these matches through navigating to the correct paragraph or doing a literal string search for some of the text in the paragraph to be matched. Both of these are available as 'Manual Search' and 'Manual Navigation' menu items on the 'Find Additional Matches' menu.
Highlighting Matching Text
Once a candidate match had been selected the user is able to highlight the matching words of the source and matching paragraphs in order to help confirm that this match is accurate. This also helps with such tasks as finding the location of a short paragraph in an originally longer or unfragmented paragraph. The algorithm responsible for highlighting the matching words was an adapted version of Greedy String Tiling (GST), that was changed to work on XML input. To highlight the matching text in the base text, the 'Highlight Match in Source' menu item from the 'Highlight' menu on the right-hand side is used. The matching text will be highlighted by coloring its background bright green.
Annotating links or items
Any paragraph that is being matched (in either edition) can be annotated, in addition a link between the two paragraphs can be annotated. To annotated a paragraph in the base text on the left-hand side, select a paragraph, and select the 'Add Note to Paragraph' menu item from the 'Selected Paragraph' menu. To add a note to a matched paragraph on the right-hand side, select the paragraph and then use the 'Add Note to Paragraph' menu item from the 'Match' menu. To add a note to the link that links these two paragraphs instead after selecting both paragraphs use the 'Add Note to Match' menu item from the right-hand 'Match' menu.
Troubleshooting
The FAQs, and Known Issues sections of the documentation will be filled out when we receive more bug reports on the TEI-Comparator.
FAQs
[none yet]
Known Issues
[none yet]
ToDos
- Putting the configuration of the initial comparison into a config file
- Putting the configuration of shingle cloud into a config file
