This tool performs token-by-token alignment of two versions of a text with differing tokenization by interpreting the results of a file diff (https://docs.python.org/3/library/difflib.html). It is intended for use in the preparation of annotated linguistic corpora, where differences in tokenization may arise (i) following corrections or modifications to the source text or (ii) through the creation of different layers of annotation (part-of-speech, treebank) requiring different tokenization. In its default implementation, it produces a human-readable CSV table associating tokens in text A with tokens in text B, and can also inject token-level annotation from text B to text A. The Aligner class on which the default implementation is based can be incorporated into more complex workflows.
Tokenized Text Aligner
Aligns tokens in two versions of a text with differing tokenization.
Status: Alpha
Brought to you by:
rainsfordtm
Downloads:
0 This Week
Linux