Tokenized Text Aligner Code
Aligns tokens in two versions of a text with differing tokenization.
Status: Alpha
Brought to you by:
rainsfordtm
| File | Date | Author | Commit |
|---|---|---|---|
| lib | 2024-07-31 |
|
[210f15] Modified error message for fixed aligner if tok... |
| sample | 2020-01-20 |
|
[5d3881] Added sample files |
| README.TXT | 2023-08-09 |
|
[6ec35d] Added XML-W as file format |
| tta.py | 2024-07-30 |
|
[c877b7] Added mismatch option to inject from multiple B... |
############################################################################
# Tokenized Text Aligner #
# (c) T. M. Rainsford, Universität Stuttgart, 2015-2023 #
############################################################################
All the software in this repository is released free of charge under the GNU
General Public License, version 3.
http://www.gnu.org/licenses/gpl-3.0.de.html
By downloading and running this software, you agree to the terms set out in the
license.
OVERVIEW
========
The tokenized text aligner performs two main functions:
1. Alignment: Token-by-token alignment of two files containing broadly "the
same" text.
2. Injection: Copying of non-word properties from the aligned text into the
base text.
The script currently supports three file formats: headerless CSV/TSV, CSV with header
and CoNLL-U as defined at https://universaldependencies.org/format.html.
INSTRUCTIONS FOR USE
====================
1. Installation
2. Preparing files
3. Alignment and injection
4. Interpreting the output
5. Getting the best results
1. Installation
****************
- Download the source code.
- Ensure that you have Python 3 (minimum: Python 3.2) installed.
- Launch the script "tta.py" using the Python 3 interpreter. The
program's root directory must be either the current working directory or on
your Python Path.
$python3 tta.py -h
2. Preparing files
******************
- The texts to be processed must be saved in one of three formats:
1. Headerless CSV/TSV: first column contains a unique ID and second column
contains the tokens. Subsequent columns may contain annotation.
Delimiter must be comma, quote character must be speech mark.
2. CSV with header. Must contain a column "id" with a unique ID and
a column "word" containing the token. Delimiter must be comma, quote
character must be speech mark.
3. CoNLL-U file as defined at https://universaldependencies.org/format.html.
4. NEW Aug 2023: XML-W file. Each <w> element must be on a single line,
and only ONE <w> element is allowed per line. The word must be stored
as a single text node situated directly before the closing </w> element
and any tags to inject must be stored as attributes in the <w> element node.
- On all platforms, the default encoding is UTF-8. You may override this
default with the "-e" argument on the command line. Encoding must be supported
by Python (see https://docs.python.org/3.5/library/codecs.html#standard-encodings).
- You can find sample files in .conllu and .csv format in the "sample" folder.
3. Alignment and injection
**************************
The TTA can align, inject or align and then inject depending on the
command line arguments passed.
Align Mode
----------
Pass two positional arguments (base text and text to align) and do not
specify anything to inject:
$python3 tta.py base-text.conllu text-to-align.conllu
The only output is a .csv file showing the alignment between the two texts, by
default called "aligned.csv". This file name can be changed by using the
--alignout argument, e.g.:
$python3 tta.py base-text.conllu text-to-align.conllu --alignout my-alignment-data.csv
Further arguments can be passed to fine-tune the alignment process:
--variants, --threshold, --caps and --ratio (see below "Getting the best results").
Inject Mode
-----------
Pass three positional arguments: base text, text to align, and .csv file
containing alignment data in an identical format to that outputted in align
mode. Also, set either -I flag (inject all) or use the --inject argument.
- To inject all columns (except ID and WORD) in a CoNLL-U file:
$python3 tta.py -I base-text.conllu text-to-align.conllu alignment-data.csv
- To inject only column 3 (LEMMA) in a CoNLL-U file:
$python3 tta.py base-text.conllu text-to-align.conllu alignment-data.csv --inject 3
The only output is a new version of the base text with the extra data injected,
by default called "out.conllu" or "out.csv", depending on the input format
This file name can be changed by using the --output argument, e.g.:
- To inject only column 3 (LEMMA) in a CoNLL-U file, saving it as "new-text.conllu":
$python3 tta.py base-text.conllu text-to-align.conllu alignment-data.csv --output new-text.conllu --inject 3
Align > Inject
--------------
Pass two positional arguments and set either -I flag (inject all) or use the
--inject argument:
- To inject all columns (except ID and WORD) in a CoNLL-U file:
$python3 tta.py -I base-text.conllu text-to-align.conllu
- To inject only column 3 (LEMMA) in a CoNLL-U file:
$python3 tta.py base-text.conllu text-to-align.conllu --inject 3
Two output files will be produced: a .csv containing the alignment data (as in
ALIGN mode) and a new version of the base text with the extra data injected
(as in INJECT mode)
A note on file types
--------------------
- The aligner will try to guess the format of the two input texts from the
filename extension, but this can be overridden with the --aformat and --bformat
arguments. In particular ".csv" is assumed by default to indicate a HEADERLESS
.csv file, so all CSV files with headers must be specified as such on the
command line, e.g.:
- align two headered CSVs and inject the column "lemma":
$python3 tta.py base-text.csv text-to-align.csv --aformat csv_header --bformat csv_header --inject lemma
- There is no requirement for the base text and the text to align to have the
same format; however, the output will be in the same format as the base-text.
4. Interpreting the alignment data
**********************************
The aligner maps text B onto text A, and outputs a five column CSV file
containing:
- column 1: text A ID (CoNLL: line number - 1)
- column 2: text A token
- column 3: corresponding text B ID(s) (CoNLL: line number - 1)
- column 4: corresponding text B token(s)
- column 5: human-readable notes
The notes are quite self-explanatory:
TOKENIZATION DIFFERENCES:
- "absent": text A token not present in B
- "tokenization_a": one token in B = two tokens in A
- "tokenization_b": one token in A = two tokens in B
- "add_b_tokens": B contains additional tokens after the matched tokens
which are not present in A.
- "add_b_tokens_before": B contains additional tokens after the matched
token which are not present in A.
- MISMATCH. There's a tokenization difference, but the aligner can't be
more specific.
WITHIN-TOKEN DIFFERENCES:
- '"x" is not "y"': string "x" in A is replaced by string "y" in B.
- '"x" missing in b': string "x" in A is missing in B.
- 'additional "y" in b': string "y" in B is missing in A.
The alignment data file can be manually edited to correct mistakes and then
reloaded by the TTA in "Inject" mode.
5. Getting the best results
***************************
5.1 Use similar texts
---------------------
The more similar the texts are, the better the aligner works. If you are
aware of consistent differences (e.g. text B has no capital letters or no
diacritics), it is best to eliminate them BEFORE passing the text to the
aligner.
5.1.1 The similarity ratio
--------------------------
Large, dissimilar texts may take a long time for the diff algorithm to process.
Before beginning the alignment, the TTA makes a rough estimate of how
similar the texts are on a scale of 0 (totally different) to 99 (identical).
If this figure is below 90, the aligner aborts with an error message.
This can be overridden by setting the --ratio on the command line to a lower
value. However, if the aligner estimates similarity to be substantially below 90,
it's worth first checking WHY the texts are so dissimilar and trying to improve
the situation:
- check the file names are correct;
- check that the files contain the same extract of a text.
The only case of dissimilarity that the aligner will handle well
is where one of the two texts contains only the start of the other. Here, the
"--ratio" argument can be used to lower the threshold and there are no
performance issues.
5.1.2 Caps and variants
-----------------------
If you don't wish to modify the input texts, but don't want to have every
single divergence signalled, you can use:
- the "-c" argument to suppress messages signalling capitalization
mismatches;
- a "variants" file. This is a two column CSV file containing:
- column 1: standardized character
- column 2: regex denoting characters to standardize to this
character.
Both of these suppress the "within-token differences" messages in the output,
but do not affect the underlying text on which the alignment is performed.
5.2 Split your text in half
---------------------------
The underlying diff algorithm is quadratic time in the worst case
(https://docs.python.org/3.5/library/difflib.html). A long text may therefore
be more quickly processed by splitting it into two parts.
5.3 The threshold parameter
---------------------------
The diff is run using two passes. The first scans the entire text ignoring common
characters ("junk"), and identifies all sequences of length "threshold" which
do not contain common characters. These are treated as definite matches, and
the second pass fills in the gaps between them, this time including common
characters. This two-pass approach provides a substantial timing improvement
on a one-pass, no junk approach.
- If the threshold is too low, there is a risk that repeated sequences
will by incorrectly matched in the first pass, and the alignment will be
inaccurate.
- If the threshold is too high, there is a risk that very few sequences
will be matched in the first pass, and processing time on the second
pass will be very high.
The default threshold is 20 characters.
5.4 Align, Check, Inject
------------------------
- The aligner is intended to be 100% accurate when the ONLY differences
between the two texts are due to token division. If there are other kinds of
differences, it is strongly recommended to manually check alignment results
before re-injection.
- When there are other differences (e.g. capitalization, use of punctuation,
diacritics), the aligner is not 100% accurate, and it is recommended to check and
correct the alignment data manually using the "notes" column before injecting.
- The aligner is particularly liable to produce inaccurate results:
(i) where the same sequence of tokens are repeated in close proximity to
each other, causing the aligner to match the FIRST instance in the base
text to the SECOND in the text to align, or vice versa;
(ii) where sections of text are in the wrong order. Once a section of
text has been marked as "missing", it will not be identified should it
occur later in the file.
- Occasionally the underlying diff just "gets lost": a long sequence of A tokens
are signalled as absent, and a long sequence of B tokens are signalled as "to
add".