Menu

Tokenizer problem encountered

2006-01-18
2013-04-16
  • harry20051219

    harry20051219 - 2006-01-18

    There are two sentences that the tokenizer can not tokenize them correctly. Since the sentences were written by our editor, I hope they are perfect English sentences.

    "What they signify, however, has not changed." is tokenized to What | they | signi | f | y | , | however | has | not | changed | .

    "You will be spending a lot of time in your baby’s nursery, so comfort and safety are key!" is tokenized to You | will | be | spending | a  | lot | of | time | in | your | bab | y’s |  nursery | , | so | comfort | and | safety | are  | key | !

    The problem is that "signify" is split into "signi",  "f", and "y"; "baby's" is split into "bab", "y's".

     
    • Thomas Morton

      Thomas Morton - 2006-01-18

      Hi,
        These are perfectly fine sentences and I verified your output.  I'll take a look and see if I can figure out what's going on in the next day or so.  Thanks...Tom

       
    • mfkilgore

      mfkilgore - 2006-02-03

      Interesting post, I have a similar problem where the unexpected results are:
      (NP (NP (NN coug) (NN hing))
      (NP (NN chest) (JJ tigh) (NN tness))

      If I can provide further information to assit your efforts let me know.

      Full Parse Output-
      (TOP (S (NP (NNP Asthma)) (VP (VBZ is) (NP (NP (DT a) (JJ chronic) (JJ inflammatory) (NN disease)) (PP (IN of) (NP (NP (DT the) (NNS airways)) (, ,) (VP (VBN characterized) (PP (IN by) (NP (NP (NN coug) (NN hing)) (, ,) (NP (NNP wheezing)) (, ,) (NP (NN chest) (JJ tigh) (NN tness)) (, ,) (CC and) (NP (NN difficu) (JJ lt) (NN breathing))))))))) (. .)))

       
    • mfkilgore

      mfkilgore - 2006-02-13

      Just checking in to see if there is any update on this thread...

      Thanks again.

       
      • Thomas Morton

        Thomas Morton - 2006-02-13

        Hi,
           This is still on my radar, but I haven't had time to track down what might be the source of it.  I'll look at it this week and post an update here by next week.  Thanks...Tom

         
    • Thomas Morton

      Thomas Morton - 2006-02-16

      Hi,
         I looked at this in some more depth yesterday and it turns out that the smoothing option in the modeling is producing odd (although not unpredictable) effects.  I re-trained the  model without this option and the cases cited are now tokenized correctly.  Overall performance on the development set (which is reported during training) was down slightly compared to the smoothed model, but I'd need to perform a more rigorous evaluation on unseen data to determine if there is a real effect. I also tokenized a document I use for sanity checking and it seemed fine.  I'm going to hold off on updating the model until I can do more rigorous testing, but I'm making the new model available at:

      http://opennlp.sourceforge.net/EnglishTok.bin.gz

      I encourage you to try it out and let me know if its working better on your data.  If my evaluation and your feedback are good, I'll update the models in the release. 

      Thanks for reporting these issues...Tom

       
    • mfkilgore

      mfkilgore - 2006-02-17

      Tom,

      I have downloaded and tried the new model and you change appears to be working great.  I will do some additional testing and let you know of any other issues.  Also, at least for my samples, I did not notice any performance degradation.

      Thanks again!

       

Log in to post a comment.