#42 tokenizer with letter t for euphony-TXM 0.6

Windows
open
nobody
None
5
2012-06-11
2012-06-11
Anonymous
No

The splitting tokens set obtained with the TXM tokenizer is different from the usual tokenizer.pl when we consider expression like "chanta-t-il".
This lead TreeTagger to make erroneous tagging.
Hereafter is a small example :

Résults extracted from TXM (format TT for easier comparison)

147 Chanta VER:simp chanter
148 - PUN -
149 t VER:pper t
150 - PUN -
151 il PRO:PER il
152 pour PRP pour
153 m' PRO:PER me
154 écouter VER:infi écouter
155 sans PRP sans
156 s' PRO:PER se
157 assoupir VER:infi assoupir
158 ? SENT ?

2/ Results with Tokeniser.pl

Chanta
-t-il
pour
m'
écouter
sans
s'
assoupir
?

Résults withTreeTagger case 2/

Chanta VER:simp chanter f VER:simp
-t-il PRO:PER il f PRO:PER
pour PRP pour f KON PRP
m' PRO:PER me f PRO:PER
écouter VER:infi écouter f VER:infi
sans PRP sans f KON PRP
s' PRO:PER se f KON PRO:PER
assoupir VER:infi assoupir f VER:infi
? SENT ? f SENT

By the way, we observe also the XML file and found that
for pronouns (m', s') (n #153 #156) there no "type" attribute and is asked for if it was a bug or not ?

<w id="w_propreounon_147" n="147" type="w"><txm:form>Chanta</txm:form><txm:ana resp="#txm" type="#frutf8pos">VER:simp</txm:ana><txm:ana resp="#txm" type="#frutf8lemma">chanter</txm:ana></w>
<w id="w_propreounon_148" n="148" type="pon"><txm:form>-</txm:form><txm:ana resp="#txm" type="#frutf8pos">PUN</txm:ana><txm:ana resp="#txm" type="#frutf8lemma">-</txm:ana></w>
<w id="w_propreounon_149" n="149" type="w"><txm:form>t</txm:form><txm:ana resp="#txm" type="#frutf8pos">VER:pper</txm:ana><txm:ana resp="#txm" type="#frutf8lemma">t</txm:ana></w>
<w id="w_propreounon_150" n="150" type="pon"><txm:form>-</txm:form><txm:ana resp="#txm" type="#frutf8pos">PUN</txm:ana><txm:ana resp="#txm" type="#frutf8lemma">-</txm:ana></w>
<w id="w_propreounon_151" n="151" type="w"><txm:form>il</txm:form><txm:ana resp="#txm" type="#frutf8pos">PRO:PER</txm:ana><txm:ana resp="#txm" type="#frutf8lemma">il</txm:ana></w>
<w id="w_propreounon_152" n="152" type="w"><txm:form>pour</txm:form><txm:ana resp="#txm" type="#frutf8pos">PRP</txm:ana><txm:ana resp="#txm" type="#frutf8lemma">pour</txm:ana></w>
<w id="w_propreounon_153" n="153"><txm:form>m'</txm:form><txm:ana resp="#txm" type="#frutf8pos">PRO:PER</txm:ana><txm:ana resp="#txm" type="#frutf8lemma">me</txm:ana></w>
<w id="w_propreounon_154" n="154" type="w"><txm:form>écouter</txm:form><txm:ana resp="#txm" type="#frutf8pos">VER:infi</txm:ana><txm:ana resp="#txm" type="#frutf8lemma">écouter</txm:ana></w>
<w id="w_propreounon_155" n="155" type="w"><txm:form>sans</txm:form><txm:ana resp="#txm" type="#frutf8pos">PRP</txm:ana><txm:ana resp="#txm" type="#frutf8lemma">sans</txm:ana></w>
<w id="w_propreounon_156" n="156"><txm:form>s'</txm:form><txm:ana resp="#txm" type="#frutf8pos">PRO:PER</txm:ana><txm:ana resp="#txm" type="#frutf8lemma">se</txm:ana></w>
<w id="w_propreounon_157" n="157" type="w"><txm:form>assoupir</txm:form><txm:ana resp="#txm" type="#frutf8pos">VER:infi</txm:ana><txm:ana resp="#txm" type="#frutf8lemma">assoupir</txm:ana></w>
<w id="w_propreounon_158" n="158" type="pon"><txm:form>?</txm:form><txm:ana resp="#txm" type="#frutf8pos">SENT</txm:ana><txm:ana resp="#txm" type="#frutf8lemma">?</txm:ana></w></s>
</p></text></TEI>
hereafter is the example :

Discussion

  • Hi,

    We are aware of this problem, we have planned to resolve it in 2 ways in the next releases of TXM.

    1) We have reimplemented the token.pl script of TreeTagger in Groovy.
    But we haven't had the time to plug it in as an option for TXM tokenization.
    See in the Toolbox sources : src/groovy/filters/TTTokenizer.groovy

    2) We also have another Tokenizer with more parameters which is able to solve that problem. But also not yet pluged.
    See in the Toolbox sources : src/groovy/filters/TEITokenizer.groovy

    Matthieu Decorde