Detokenizer

Developers
2010-08-12
2013-04-16
  • Joern Kottmann
    Joern Kottmann
    2010-08-12

    Hi everyone,

    the tutorials which describe how to train the tokenizer usually use already tokenized
    training data which must be de-tokenized to be applicable as training data.

    Jason wrote a small perl script for this task:
    https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Detokenizing_script

    I suggest that we add a java Detokenizer to the tokenize package. A Detokenizer
    can also be useful for other tasks like machine translation, where the text in the
    target language must be de-tokenized again.

    The Detokenizer assigns an operation to every token, the token should either be attached
    to the token before it, to the token which follows it, or just left as it is.

    This could be described by this interface:

    public interface Detokenizer {
      
      /**
       * This enum contains an operation for every token to merge the
       * tokens together to their detokenized form.
       */
      public static enum DetokenizationOperation {
        /**
         * The current token should be attached to the begin token on the right side.
         */
        MERGE_TO_RIGHT,
        
        /**
         * The current token should be attached to the string on the left side.
         */
        MERGE_TO_LEFT,
        
        /**
         * Do not perform a merge operation for this token, but is possible that another
         * token can be attached to the left or right side of this one.
         */
        NO_OPERATION
      }
      
      /**
       * Detokenize the input tokens.
       * 
       * @param tokens the tokens to detokenize.
       * @return the merge operations to detokenize the input tokens.
       */
      DetokenizationOperation[] detokenize(String tokens[]);
    }
    

    A rule based implementation just looks up the tokens in a dictionary which maps it to either MERGE_TO_RIGHT,
    MERGE_TO_LEFT or RIGHT_LEFT_MATCHING. If the token is not in the dictionary no operation will be performed
    and the token is not merge into one of the surrounding tokens.

    RIGHT_LEFT_MATCHING, could be used for tokens like " which must moved to the right on the first occurrence
    and moved to the left on the second occurrence.

    Here is a small sample:

    Dictionary:
    . MERGE_TO_LEFT
    " RIGHT_LEFT_MATCHING

    This tokenized sentence should be de-tokenized:

    He said " This is a test " .

    The tokens would get these tags based on the dictionary:

    He -> NO_OPERATION
    said -> NO_OPERATION
    " -> MERGE_TO_RIGHT
    This -> NO_OPERATION
    is -> NO_OPERATION
    a -> NO_OPERATION
    test -> NO_OPERATION
    " -> MERGE_TO_LEFT
    . -> MERGE_TO_LEFT

    Any opinions ?

    Jörn

     
  • No opinion other than it looks good.

    +1