I wonder if it might be more suggestive if the tokenizer would only split non trivial punctuation marks, such as ",", ";", "'",... and process "." iff it is at the ending of a sentence.
Since the Tokenizer expects a sentence per line (or did I get something wrong?) it could be avoided to have to think about things like "e.g." or "i.e." twice. The Sentence-Detector detects these abbreviations accurately, but the Tokenizer turns "e.g." into "e.g . ". As "." is a very strong sentence delimiter it might be more reasonable to limit the Tokenizer for "." to the sentence ending.
;) - I hope I could make myself clear ;)
Sincerely
Charly
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
The tokenizer needs to split things like "don't" between the o and the n so there isn't a small set of characters that you only need to consider.
I couldn't think of an example (in english) were you'd want to split a period that isn't sentence final and did a corpus search to see if such things exist. The elipse "..." is a problematic case I found. I also found out there are a ton of tokenization errors in treebank involving periods whic probably explain the models poor performance. I'll try and correct them and retrain for the next release. Thanks...Tom
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There is definitly an issue with the tokenization of "i.e." and "e.g.". There are a couple ways I see to address this:
The tokenizer hasn't seen these forms and we could give it some data with them in it. (there are two examples of i.e. and none of e.g in the training data)
Two my for inter-sentential periods revealed a bunch of cases where the tokenization is wrong in the Penn Treebank, those cases could be fixed and the tokenizer retrained.
Lastly I think the best way to accomidate the trend you've pointed out is to add a feature which tells the model when it is processing the last token. This may help it better distinguish the cases it is currently having problems with.
This case demonstrates my preference to not try and fix annotation errors in the processing code. I think the best place to address those errors is in the models themselves and prefer solutions which help the model better capture something either with new features or trainig data.
If you're interested in helping in this process let me know. What probably need to be done is:
Add features to captures the last token and think about what other features one should combine them with.
Clean up the tokization errors in the tree bank which involve this case (There's not a huge number)
Retrain the tokenizer on the cleaner data with the new features. Let me know...Tom
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Right, such a feature would be great. Alternatively one could use the same features as for sentence boundary detection for dealing cases such as "e.g." Sentence boundary detection features seem to work good on this... this could be another conditional predicate saying if such a token is recognized, then this should not be processed by the tokenizer. There would be no need of a new implementation since these features already exist.
Otherwise I would be glad to be of any help.
Where is the tree bank, so that I could have a look at it to clean up the tokization errors?
Thanks
Charly
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi there,
I wonder if it might be more suggestive if the tokenizer would only split non trivial punctuation marks, such as ",", ";", "'",... and process "." iff it is at the ending of a sentence.
Since the Tokenizer expects a sentence per line (or did I get something wrong?) it could be avoided to have to think about things like "e.g." or "i.e." twice. The Sentence-Detector detects these abbreviations accurately, but the Tokenizer turns "e.g." into "e.g . ". As "." is a very strong sentence delimiter it might be more reasonable to limit the Tokenizer for "." to the sentence ending.
;) - I hope I could make myself clear ;)
Sincerely
Charly
Hi,
The tokenizer needs to split things like "don't" between the o and the n so there isn't a small set of characters that you only need to consider.
I couldn't think of an example (in english) were you'd want to split a period that isn't sentence final and did a corpus search to see if such things exist. The elipse "..." is a problematic case I found. I also found out there are a ton of tokenization errors in treebank involving periods whic probably explain the models poor performance. I'll try and correct them and retrain for the next release. Thanks...Tom
Hi Tom,
your are right, that there isn't a small set of characters to consider; I was referring to the "." punctation mark. I tried the test sentence:
"The SentenceDetector tries to split the sentences up, i.e. every sentence is returned as a separate String."
and got
"The SentenceDetector tries to split the sentences up , i.e . every sentence is returned as a separate String ."
Thanks for your answer ;)
Charly
There is definitly an issue with the tokenization of "i.e." and "e.g.". There are a couple ways I see to address this:
The tokenizer hasn't seen these forms and we could give it some data with them in it. (there are two examples of i.e. and none of e.g in the training data)
Two my for inter-sentential periods revealed a bunch of cases where the tokenization is wrong in the Penn Treebank, those cases could be fixed and the tokenizer retrained.
Lastly I think the best way to accomidate the trend you've pointed out is to add a feature which tells the model when it is processing the last token. This may help it better distinguish the cases it is currently having problems with.
This case demonstrates my preference to not try and fix annotation errors in the processing code. I think the best place to address those errors is in the models themselves and prefer solutions which help the model better capture something either with new features or trainig data.
If you're interested in helping in this process let me know. What probably need to be done is:
Add features to captures the last token and think about what other features one should combine them with.
Clean up the tokization errors in the tree bank which involve this case (There's not a huge number)
Retrain the tokenizer on the cleaner data with the new features. Let me know...Tom
Right, such a feature would be great. Alternatively one could use the same features as for sentence boundary detection for dealing cases such as "e.g." Sentence boundary detection features seem to work good on this... this could be another conditional predicate saying if such a token is recognized, then this should not be processed by the tokenizer. There would be no need of a new implementation since these features already exist.
Otherwise I would be glad to be of any help.
Where is the tree bank, so that I could have a look at it to clean up the tokization errors?
Thanks
Charly