Tokenized Text Aligner Code
Aligns tokens in two versions of a text with differing tokenization.
Status: Alpha
Brought to you by:
rainsfordtm
File | Date | Author | Commit |
---|---|---|---|
lib | 2024-07-31 | Tom Rainsford | [210f15] Modified error message for fixed aligner if tok... |
sample | 2020-01-20 | Tom Rainsford | [5d3881] Added sample files |
README.TXT | 2023-08-09 | Tom Rainsford | [6ec35d] Added XML-W as file format |
tta.py | 2024-07-30 | Tom Rainsford | [c877b7] Added mismatch option to inject from multiple B... |
############################################################################ # Tokenized Text Aligner # # (c) T. M. Rainsford, Universität Stuttgart, 2015-2023 # ############################################################################ All the software in this repository is released free of charge under the GNU General Public License, version 3. http://www.gnu.org/licenses/gpl-3.0.de.html By downloading and running this software, you agree to the terms set out in the license. OVERVIEW ======== The tokenized text aligner performs two main functions: 1. Alignment: Token-by-token alignment of two files containing broadly "the same" text. 2. Injection: Copying of non-word properties from the aligned text into the base text. The script currently supports three file formats: headerless CSV/TSV, CSV with header and CoNLL-U as defined at https://universaldependencies.org/format.html. INSTRUCTIONS FOR USE ==================== 1. Installation 2. Preparing files 3. Alignment and injection 4. Interpreting the output 5. Getting the best results 1. Installation **************** - Download the source code. - Ensure that you have Python 3 (minimum: Python 3.2) installed. - Launch the script "tta.py" using the Python 3 interpreter. The program's root directory must be either the current working directory or on your Python Path. $python3 tta.py -h 2. Preparing files ****************** - The texts to be processed must be saved in one of three formats: 1. Headerless CSV/TSV: first column contains a unique ID and second column contains the tokens. Subsequent columns may contain annotation. Delimiter must be comma, quote character must be speech mark. 2. CSV with header. Must contain a column "id" with a unique ID and a column "word" containing the token. Delimiter must be comma, quote character must be speech mark. 3. CoNLL-U file as defined at https://universaldependencies.org/format.html. 4. NEW Aug 2023: XML-W file. Each <w> element must be on a single line, and only ONE <w> element is allowed per line. The word must be stored as a single text node situated directly before the closing </w> element and any tags to inject must be stored as attributes in the <w> element node. - On all platforms, the default encoding is UTF-8. You may override this default with the "-e" argument on the command line. Encoding must be supported by Python (see https://docs.python.org/3.5/library/codecs.html#standard-encodings). - You can find sample files in .conllu and .csv format in the "sample" folder. 3. Alignment and injection ************************** The TTA can align, inject or align and then inject depending on the command line arguments passed. Align Mode ---------- Pass two positional arguments (base text and text to align) and do not specify anything to inject: $python3 tta.py base-text.conllu text-to-align.conllu The only output is a .csv file showing the alignment between the two texts, by default called "aligned.csv". This file name can be changed by using the --alignout argument, e.g.: $python3 tta.py base-text.conllu text-to-align.conllu --alignout my-alignment-data.csv Further arguments can be passed to fine-tune the alignment process: --variants, --threshold, --caps and --ratio (see below "Getting the best results"). Inject Mode ----------- Pass three positional arguments: base text, text to align, and .csv file containing alignment data in an identical format to that outputted in align mode. Also, set either -I flag (inject all) or use the --inject argument. - To inject all columns (except ID and WORD) in a CoNLL-U file: $python3 tta.py -I base-text.conllu text-to-align.conllu alignment-data.csv - To inject only column 3 (LEMMA) in a CoNLL-U file: $python3 tta.py base-text.conllu text-to-align.conllu alignment-data.csv --inject 3 The only output is a new version of the base text with the extra data injected, by default called "out.conllu" or "out.csv", depending on the input format This file name can be changed by using the --output argument, e.g.: - To inject only column 3 (LEMMA) in a CoNLL-U file, saving it as "new-text.conllu": $python3 tta.py base-text.conllu text-to-align.conllu alignment-data.csv --output new-text.conllu --inject 3 Align > Inject -------------- Pass two positional arguments and set either -I flag (inject all) or use the --inject argument: - To inject all columns (except ID and WORD) in a CoNLL-U file: $python3 tta.py -I base-text.conllu text-to-align.conllu - To inject only column 3 (LEMMA) in a CoNLL-U file: $python3 tta.py base-text.conllu text-to-align.conllu --inject 3 Two output files will be produced: a .csv containing the alignment data (as in ALIGN mode) and a new version of the base text with the extra data injected (as in INJECT mode) A note on file types -------------------- - The aligner will try to guess the format of the two input texts from the filename extension, but this can be overridden with the --aformat and --bformat arguments. In particular ".csv" is assumed by default to indicate a HEADERLESS .csv file, so all CSV files with headers must be specified as such on the command line, e.g.: - align two headered CSVs and inject the column "lemma": $python3 tta.py base-text.csv text-to-align.csv --aformat csv_header --bformat csv_header --inject lemma - There is no requirement for the base text and the text to align to have the same format; however, the output will be in the same format as the base-text. 4. Interpreting the alignment data ********************************** The aligner maps text B onto text A, and outputs a five column CSV file containing: - column 1: text A ID (CoNLL: line number - 1) - column 2: text A token - column 3: corresponding text B ID(s) (CoNLL: line number - 1) - column 4: corresponding text B token(s) - column 5: human-readable notes The notes are quite self-explanatory: TOKENIZATION DIFFERENCES: - "absent": text A token not present in B - "tokenization_a": one token in B = two tokens in A - "tokenization_b": one token in A = two tokens in B - "add_b_tokens": B contains additional tokens after the matched tokens which are not present in A. - "add_b_tokens_before": B contains additional tokens after the matched token which are not present in A. - MISMATCH. There's a tokenization difference, but the aligner can't be more specific. WITHIN-TOKEN DIFFERENCES: - '"x" is not "y"': string "x" in A is replaced by string "y" in B. - '"x" missing in b': string "x" in A is missing in B. - 'additional "y" in b': string "y" in B is missing in A. The alignment data file can be manually edited to correct mistakes and then reloaded by the TTA in "Inject" mode. 5. Getting the best results *************************** 5.1 Use similar texts --------------------- The more similar the texts are, the better the aligner works. If you are aware of consistent differences (e.g. text B has no capital letters or no diacritics), it is best to eliminate them BEFORE passing the text to the aligner. 5.1.1 The similarity ratio -------------------------- Large, dissimilar texts may take a long time for the diff algorithm to process. Before beginning the alignment, the TTA makes a rough estimate of how similar the texts are on a scale of 0 (totally different) to 99 (identical). If this figure is below 90, the aligner aborts with an error message. This can be overridden by setting the --ratio on the command line to a lower value. However, if the aligner estimates similarity to be substantially below 90, it's worth first checking WHY the texts are so dissimilar and trying to improve the situation: - check the file names are correct; - check that the files contain the same extract of a text. The only case of dissimilarity that the aligner will handle well is where one of the two texts contains only the start of the other. Here, the "--ratio" argument can be used to lower the threshold and there are no performance issues. 5.1.2 Caps and variants ----------------------- If you don't wish to modify the input texts, but don't want to have every single divergence signalled, you can use: - the "-c" argument to suppress messages signalling capitalization mismatches; - a "variants" file. This is a two column CSV file containing: - column 1: standardized character - column 2: regex denoting characters to standardize to this character. Both of these suppress the "within-token differences" messages in the output, but do not affect the underlying text on which the alignment is performed. 5.2 Split your text in half --------------------------- The underlying diff algorithm is quadratic time in the worst case (https://docs.python.org/3.5/library/difflib.html). A long text may therefore be more quickly processed by splitting it into two parts. 5.3 The threshold parameter --------------------------- The diff is run using two passes. The first scans the entire text ignoring common characters ("junk"), and identifies all sequences of length "threshold" which do not contain common characters. These are treated as definite matches, and the second pass fills in the gaps between them, this time including common characters. This two-pass approach provides a substantial timing improvement on a one-pass, no junk approach. - If the threshold is too low, there is a risk that repeated sequences will by incorrectly matched in the first pass, and the alignment will be inaccurate. - If the threshold is too high, there is a risk that very few sequences will be matched in the first pass, and processing time on the second pass will be very high. The default threshold is 20 characters. 5.4 Align, Check, Inject ------------------------ - The aligner is intended to be 100% accurate when the ONLY differences between the two texts are due to token division. If there are other kinds of differences, it is strongly recommended to manually check alignment results before re-injection. - When there are other differences (e.g. capitalization, use of punctuation, diacritics), the aligner is not 100% accurate, and it is recommended to check and correct the alignment data manually using the "notes" column before injecting. - The aligner is particularly liable to produce inaccurate results: (i) where the same sequence of tokens are repeated in close proximity to each other, causing the aligner to match the FIRST instance in the base text to the SECOND in the text to align, or vice versa; (ii) where sections of text are in the wrong order. Once a section of text has been marked as "missing", it will not be identified should it occur later in the file. - Occasionally the underlying diff just "gets lost": a long sequence of A tokens are signalled as absent, and a long sequence of B tokens are signalled as "to add".