Download Latest Version tprdb2.1.zip (159.4 MB)
Email in envelope

Get an email when there's a new version of CRITT TPR-DB

Home / multiLing
Name Modified Size InfoDownloads / Week
Parent folder
README.txt 2021-03-31 2.2 kB
multiLing_STannotations.txt 2021-03-31 40.0 kB
Totals: 2 Items   42.1 kB 0
README file for multiLing_STannotations.txt (first and only release so far)

The multiLing_STannotations.txt file is a tab-delimited table containing token-level annotations for the multiLing source texts in the CRITT TPR-DB. These annotations were created by Haruka Ogawa, Devin Gilbert, and Samar Almazroei for the following publication: "redBird: Rendering Entropy Data and ST-Based Information into a Rich Discourse on Translation: Investigating relationships between MT output and human translation" which is a chapter in the book "Explorations in Empirical Translation Process Research," edited by Michael Carl as part of the Springer book series "Machine Translation: Technologies and Applications" (series editor Andy Way).

The "Text" and "Id" columns contain integers denoting the text number each token comes from as word as the token index (1–n) for each token within each text. These values are identical to what you will find in any TPR-DB .st table. The "Text_Id" column contains strings which are a concatenation of the first two columns.

All the rest of the columns contain strings or empty strings. "SToken" is identical to what you would find in a TPR-DB .st table and is the source token. "wordClass" is a more general part-of-speech category based off of the "PoS" column. "PoS" is the corrected part-of-speech tag (see the above-mentioned study for more details on this). "old_PoS" is the former PoS tag that was automatically generated by the TPR-DB NLP chain; this column only has a value for those PoS tags that were corrected by the researchers (otherwise it is an empty string).

Almost all the rest of the columns ("Figurative", "Passive", "Anaphora") contain annotations that are described in the above-mentioned study. All of these annotations are binary (containing either an empty string or "x" denoting that the column annotation category applies to the token in the current row) except the "Figurative" column, which contains "m" for Metaphorical expressions and "x" for Fixed expressions (i.e., idiomatic expressions). The only annotation column that was not discussed in the above-mentioned study is "adjNouns" which marks any token that is an adjectival noun.
Source: README.txt, updated 2021-03-31