OpenNLP / Discussion / Developers: Detokenizer

Hi everyone,

the tutorials which describe how to train the tokenizer usually use already tokenized
training data which must be de-tokenized to be applicable as training data.

Jason wrote a small perl script for this task:
https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Detokenizing_script

I suggest that we add a java Detokenizer to the tokenize package. A Detokenizer
can also be useful for other tasks like machine translation, where the text in the
target language must be de-tokenized again.

The Detokenizer assigns an operation to every token, the token should either be attached
to the token before it, to the token which follows it, or just left as it is.

This could be described by this interface:

public interface Detokenizer {
  
  /**
   * This enum contains an operation for every token to merge the
   * tokens together to their detokenized form.
   */
  public static enum DetokenizationOperation {
    /**
     * The current token should be attached to the begin token on the right side.
     */
    MERGE_TO_RIGHT,
    
    /**
     * The current token should be attached to the string on the left side.
     */
    MERGE_TO_LEFT,
    
    /**
     * Do not perform a merge operation for this token, but is possible that another
     * token can be attached to the left or right side of this one.
     */
    NO_OPERATION
  }
  
  /**
   * Detokenize the input tokens.
   * 
   * @param tokens the tokens to detokenize.
   * @return the merge operations to detokenize the input tokens.
   */
  DetokenizationOperation[] detokenize(String tokens[]);
}

A rule based implementation just looks up the tokens in a dictionary which maps it to either MERGE_TO_RIGHT,
MERGE_TO_LEFT or RIGHT_LEFT_MATCHING. If the token is not in the dictionary no operation will be performed
and the token is not merge into one of the surrounding tokens.

RIGHT_LEFT_MATCHING, could be used for tokens like " which must moved to the right on the first occurrence
and moved to the left on the second occurrence.

Here is a small sample:

Dictionary:
. MERGE_TO_LEFT
" RIGHT_LEFT_MATCHING

This tokenized sentence should be de-tokenized:

He said " This is a test " .

The tokens would get these tags based on the dictionary:

He -> NO_OPERATION
said -> NO_OPERATION
" -> MERGE_TO_RIGHT
This -> NO_OPERATION
is -> NO_OPERATION
a -> NO_OPERATION
test -> NO_OPERATION
" -> MERGE_TO_LEFT
. -> MERGE_TO_LEFT

Any opinions ?

Jörn

Detokenizer

Forums

Help

Detokenizer

Detokenizer

Forums

Help

Detokenizer document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Detokenizer