This tool performs token-by-token alignment of two versions of a text with differing tokenization by interpreting the results of a file diff (https://docs.python.org/3/library/difflib.html). It is intended for use in the preparation of annotated linguistic corpora, where differences in tokenization may arise (i) following corrections or modifications to the source text or (ii) through the creation of different layers of annotation (part-of-speech, treebank) requiring different tokenization. In its default implementation, it produces a human-readable CSV table associating tokens in text A with tokens in text B, and can also inject token-level annotation from text B to text A. The Aligner class on which the default implementation is based can be incorporated into more complex workflows.

Project Activity

See All Activity >

Categories

Linguistics

License

GNU General Public License version 3.0 (GPLv3)

Follow Tokenized Text Aligner

Tokenized Text Aligner Web Site

Other Useful Business Software
Turn Your Content into Interactive Magic - For Free Icon
Turn Your Content into Interactive Magic - For Free

From Canva to Slides, Desmos to YouTube, Lumio works with the tech tools you are already using.

Transform anything you share into an engaging digital experience - for free. Instantly convert your PDFs, slides, and files into dynamic, interactive sessions with built-in collaboration tools, activities, and real-time assessment. From teaching to training to team building, make every presentation unforgettable. Used by millions for education, business, and professional development.
Start Free Forever
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of Tokenized Text Aligner!

Additional Project Details

Operating Systems

BSD, Linux

Intended Audience

Advanced End Users, Science/Research

User Interface

Command-line

Programming Language

Python

Related Categories

Python Linguistics Software

Registered

2015-09-23