the tutorials which describe how to train the tokenizer usually use already tokenized
training data which must be de-tokenized to be applicable as training data.
I suggest that we add a java Detokenizer to the tokenize package. A Detokenizer
can also be useful for other tasks like machine translation, where the text in the
target language must be de-tokenized again.
The Detokenizer assigns an operation to every token, the token should either be attached
to the token before it, to the token which follows it, or just left as it is.
A rule based implementation just looks up the tokens in a dictionary which maps it to either MERGE_TO_RIGHT,
MERGE_TO_LEFT or RIGHT_LEFT_MATCHING. If the token is not in the dictionary no operation will be performed
and the token is not merge into one of the surrounding tokens.
RIGHT_LEFT_MATCHING, could be used for tokens like " which must moved to the right on the first occurrence
and moved to the left on the second occurrence.
Here is a small sample:
Dictionary:
. MERGE_TO_LEFT
" RIGHT_LEFT_MATCHING
This tokenized sentence should be de-tokenized:
He said " This is a test " .
The tokens would get these tags based on the dictionary:
He -> NO_OPERATION
said -> NO_OPERATION
" -> MERGE_TO_RIGHT
This -> NO_OPERATION
is -> NO_OPERATION
a -> NO_OPERATION
test -> NO_OPERATION
" -> MERGE_TO_LEFT
. -> MERGE_TO_LEFT
Any opinions ?
Jörn
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi everyone,
the tutorials which describe how to train the tokenizer usually use already tokenized
training data which must be de-tokenized to be applicable as training data.
Jason wrote a small perl script for this task:
https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Detokenizing_script
I suggest that we add a java Detokenizer to the tokenize package. A Detokenizer
can also be useful for other tasks like machine translation, where the text in the
target language must be de-tokenized again.
The Detokenizer assigns an operation to every token, the token should either be attached
to the token before it, to the token which follows it, or just left as it is.
This could be described by this interface:
A rule based implementation just looks up the tokens in a dictionary which maps it to either MERGE_TO_RIGHT,
MERGE_TO_LEFT or RIGHT_LEFT_MATCHING. If the token is not in the dictionary no operation will be performed
and the token is not merge into one of the surrounding tokens.
RIGHT_LEFT_MATCHING, could be used for tokens like " which must moved to the right on the first occurrence
and moved to the left on the second occurrence.
Here is a small sample:
Dictionary:
. MERGE_TO_LEFT
" RIGHT_LEFT_MATCHING
This tokenized sentence should be de-tokenized:
The tokens would get these tags based on the dictionary:
Any opinions ?
Jörn
No opinion other than it looks good.
+1