Menu

Tree [210f15] master /
 History

HTTPS access


File Date Author Commit
 lib 2024-07-31 Tom Rainsford Tom Rainsford [210f15] Modified error message for fixed aligner if tok...
 sample 2020-01-20 Tom Rainsford Tom Rainsford [5d3881] Added sample files
 README.TXT 2023-08-09 Tom Rainsford Tom Rainsford [6ec35d] Added XML-W as file format
 tta.py 2024-07-30 Tom Rainsford Tom Rainsford [c877b7] Added mismatch option to inject from multiple B...

Read Me

############################################################################
# Tokenized Text Aligner                                                   #
# (c) T. M. Rainsford, Universität Stuttgart, 2015-2023                    #
############################################################################

All the software in this repository is released free of charge under the GNU
General Public License, version 3.
http://www.gnu.org/licenses/gpl-3.0.de.html

By downloading and running this software, you agree to the terms set out in the
license.

OVERVIEW
========
The tokenized text aligner performs two main functions:

	1. Alignment: Token-by-token alignment of two files containing broadly "the
	same" text.
	2. Injection: Copying of non-word properties from the aligned text into the
base text.

The script currently supports three file formats: headerless CSV/TSV, CSV with header
and CoNLL-U as defined at https://universaldependencies.org/format.html.

INSTRUCTIONS FOR USE
====================

1. Installation
2. Preparing files
3. Alignment and injection
4. Interpreting the output
5. Getting the best results

1. Installation
****************
	- Download the source code.
	- Ensure that you have Python 3 (minimum: Python 3.2) installed.
	- Launch the script "tta.py" using the Python 3 interpreter. The
program's root directory must be either the current working directory or on
your Python Path.

	$python3 tta.py -h

2. Preparing files
******************
- The texts to be processed must be saved in one of three formats:
	1. Headerless CSV/TSV: first column contains a unique ID and second column
	contains the tokens. Subsequent columns may contain annotation. 
	Delimiter must be comma, quote character must be speech mark.
	2. CSV with header. Must contain a column "id" with a unique ID and
	a column "word" containing the token. Delimiter must be comma, quote
	character must be speech mark.
	3. CoNLL-U file as defined at https://universaldependencies.org/format.html. 
	4. NEW Aug 2023: XML-W file. Each <w> element must be on a single line,
	and only ONE <w> element is allowed per line. The word must be stored
	as a single text node situated directly before the closing </w> element
	and any tags to inject must be stored as attributes in the <w> element node.

- On all platforms, the default encoding is UTF-8. You may override this
default with the "-e" argument on the command line. Encoding must be supported
by Python (see https://docs.python.org/3.5/library/codecs.html#standard-encodings).

- You can find sample files in .conllu and .csv format in the "sample" folder. 

3. Alignment and injection
**************************
The TTA can align, inject or align and then inject depending on the
command line arguments passed.

Align Mode
----------
Pass two positional arguments (base text and text to align) and do not
specify anything to inject:

	$python3 tta.py base-text.conllu text-to-align.conllu
	
The only output is a .csv file showing the alignment between the two texts, by 
default called "aligned.csv". This file name can be changed by using the
--alignout argument, e.g.:

	$python3 tta.py base-text.conllu text-to-align.conllu --alignout my-alignment-data.csv
	
Further arguments can be passed to fine-tune the alignment process:
--variants, --threshold, --caps and --ratio (see below "Getting the best results").
	
Inject Mode
-----------
Pass three positional arguments: base text, text to align, and .csv file 
containing alignment data in an identical format to that outputted in align
mode. Also, set either -I flag (inject all) or use the --inject argument.

	- To inject all columns (except ID and WORD) in a CoNLL-U file:
		$python3 tta.py -I base-text.conllu text-to-align.conllu alignment-data.csv
		
	- To inject only column 3 (LEMMA) in a CoNLL-U file:
		$python3 tta.py base-text.conllu text-to-align.conllu alignment-data.csv --inject 3
		
The only output is a new version of the base text with the extra data injected,
by default called "out.conllu" or "out.csv", depending on the input format
This file name can be changed by using the --output argument, e.g.:
	- To inject only column 3 (LEMMA) in a CoNLL-U file, saving it as "new-text.conllu":
	$python3 tta.py base-text.conllu text-to-align.conllu alignment-data.csv --output new-text.conllu --inject 3

Align > Inject
--------------
Pass two positional arguments and set either -I flag (inject all) or use the
--inject argument:

	- To inject all columns (except ID and WORD) in a CoNLL-U file:
		$python3 tta.py -I base-text.conllu text-to-align.conllu
		
	- To inject only column 3 (LEMMA) in a CoNLL-U file:
		$python3 tta.py base-text.conllu text-to-align.conllu --inject 3
		
Two output files will be produced: a .csv containing the alignment data (as in
ALIGN mode) and a new version of the base text with the extra data injected
(as in INJECT mode)

A note on file types
--------------------
- The aligner will try to guess the format of the two input texts from the 
filename extension, but this can be overridden with the --aformat and --bformat
arguments. In particular ".csv" is assumed by default to indicate a HEADERLESS
.csv file, so all CSV files with headers must be specified as such on the
command line, e.g.:
	
	- align two headered CSVs and inject the column "lemma":
	$python3 tta.py base-text.csv text-to-align.csv --aformat csv_header --bformat csv_header --inject lemma
	
- There is no requirement for the base text and the text to align to have the
same format; however, the output will be in the same format as the base-text.
	
4. Interpreting the alignment data
**********************************
The aligner maps text B onto text A, and outputs a five column CSV file
containing:
	- column 1: text A ID (CoNLL: line number - 1)
	- column 2: text A token
	- column 3: corresponding text B ID(s) (CoNLL: line number - 1)
	- column 4: corresponding text B token(s)
	- column 5: human-readable notes
	
The notes are quite self-explanatory:

TOKENIZATION DIFFERENCES:
	- "absent": text A token not present in B
	- "tokenization_a": one token in B = two tokens in A
	- "tokenization_b": one token in A = two tokens in B
	- "add_b_tokens": B contains additional tokens after the matched tokens
	which are not present in A.
	- "add_b_tokens_before": B contains additional tokens after the matched
	token which are not present in A.
	- MISMATCH. There's a tokenization difference, but the aligner can't be
	more specific.

WITHIN-TOKEN DIFFERENCES:
	- '"x" is not "y"': string "x" in A is replaced by string "y" in B.
	- '"x" missing in b': string "x" in A is missing in B.
	- 'additional "y" in b': string "y" in B is missing in A.
	
The alignment data file can be manually edited to correct mistakes and then
reloaded by the TTA in "Inject" mode.
	
5. Getting the best results
***************************

5.1 Use similar texts
---------------------
The more similar the texts are, the better the aligner works. If you are
aware of consistent differences (e.g. text B has no capital letters or no
diacritics), it is best to eliminate them BEFORE passing the text to the 
aligner.

5.1.1 The similarity ratio
--------------------------
Large, dissimilar texts may take a long time for the diff algorithm to process.
Before beginning the alignment, the TTA makes a rough estimate of how
similar the texts are on a scale of 0 (totally different) to 99 (identical).
If this figure is below 90, the aligner aborts with an error message.

This can be overridden by setting the --ratio on the command line to a lower
value. However, if the aligner estimates similarity to be substantially below 90,
it's worth first checking WHY the texts are so dissimilar and trying to improve
the situation:
	- check the file names are correct;
	- check that the files contain the same extract of a text.

The only case of dissimilarity that the aligner will handle well
is where one of the two texts contains only the start of the other. Here, the
"--ratio" argument can be used to lower the threshold and there are no
performance issues.

5.1.2 Caps and variants
-----------------------
If you don't wish to modify the input texts, but don't want to have every
single divergence signalled, you can use:
	- the "-c" argument to suppress messages signalling capitalization
	mismatches;
	- a "variants" file. This is a two column CSV file containing:
		- column 1: standardized character
		- column 2: regex denoting characters to standardize to this
		character.
Both of these suppress the "within-token differences" messages in the output,
but do not affect the underlying text on which the alignment is performed.

5.2 Split your text in half
---------------------------
The underlying diff algorithm is quadratic time in the worst case 
(https://docs.python.org/3.5/library/difflib.html). A long text may therefore
be more quickly processed by splitting it into two parts.

5.3 The threshold parameter
---------------------------
The diff is run using two passes. The first scans the entire text ignoring common
characters ("junk"), and identifies all sequences of length "threshold" which
do not contain common characters. These are treated as definite matches, and
the second pass fills in the gaps between them, this time including common
characters. This two-pass approach provides a substantial timing improvement
on a one-pass, no junk approach.

	- If the threshold is too low, there is a risk that repeated sequences
	will by incorrectly matched in the first pass, and the alignment will be
	inaccurate.
	- If the threshold is too high, there is a risk that very few sequences
	will be matched in the first pass, and processing time on the second
	pass will be very high.
	
The default threshold is 20 characters.

5.4 Align, Check, Inject 
------------------------
- The aligner is intended to be 100% accurate when the ONLY differences 
between the two texts are due to token division. If there are other kinds of
differences, it is strongly recommended to manually check alignment results
before re-injection.

- When there are other differences (e.g. capitalization, use of punctuation,
diacritics), the aligner is not 100% accurate, and it is recommended to check and
correct the alignment data manually using the "notes" column before injecting.

- The aligner is particularly liable to produce inaccurate results:
	(i) where the same sequence of tokens are repeated in close proximity to
	each other, causing the aligner to match the FIRST instance in the base
	text to the SECOND in the text to align, or vice versa;
	(ii) where sections of text are in the wrong order. Once a section of 
	text has been marked as "missing", it will not be identified should it
	occur later in the file.

- Occasionally the underlying diff just "gets lost": a long sequence of A tokens
are signalled as absent, and a long sequence of B tokens are signalled as "to
add".