Menu

#25 ArrayIndexOutOfBoundsException in BeamSearch.java

open
None
5
2010-09-14
2008-09-13
No

When PosTagger is used with tag dictionary, sometimes ArrayIndexOutOfBoundsException in (BeamSearch.java:155) is raised.

The reason seems to be unchecked index 0 reference here:
%<==================
return bestSequences(1, sequence, additionalContext,zeroLog)[0]
%<==================

It seems to happen when due to tag dictionary restrictions, bestSequences returns no sequences.

Excerpt from exception stack:
%<==================
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
at opennlp.tools.util.BeamSearch.bestSequence(BeamSearch.java:155)
at opennlp.tools.postag.POSTaggerME.tag(POSTaggerME.java:180)

%<==================

Discussion

  • Thomas Morton

    Thomas Morton - 2008-09-14
    • assigned_to: nobody --> tsmorton
     
  • Thomas Morton

    Thomas Morton - 2008-09-14

    Hi,
    There are several posts on this topic. This occurs when there is a word in your tag dictionary which is mapped solely to tags that the model has not seen. In most cases this would seem to be a typo in your tag dictionary.

    One option for catching this in a more graceful way is to include code to validate the tag dictionary against the model when its loaded. This however will introduce some overhead in loading the model every time its used.

    Please check your dictionary and report back if this is what is going on in your case. Thanks...Tom

     
  • Aliaksandr Autayeu

    Hi Thomas,

    Thank you for a clarification! We've used standard English tag dictionary "tagdict" from tools distribution with a custom model. The model is trained by our colleagues on 8000+ tokens (a DMOZ subset). This subset is biased towards nouns and adjectives, so the reason you provided seems to be the most probable one, but I'm curious and will explore this issue further.

    I think the overhead of checking the tag dictionary is acceptable in our case.

    Thank you very much for the idea and explanation!

     
  • Aliaksandr Autayeu

    This class tests dictionary against model

     
  • Aliaksandr Autayeu

    So, further exploration has shown, you were absolutely right.

    a) I've found out that our model has been trained one 1st version of the PennTreeBank tagset and uses NP instead of NNP, NPS instead of NNPS and PP instead of PRP so there is a slight mismatch between dictionary and model tagset.
    b) Our model has been trained on the dataset which is biased towards NN,JJ,CC and therefore there are tags "the model has not seen", like -RRB- and NNP from a)
    c) And there were entries in the dictionary "mapped solely to tags that the model has not seen".

    Attached is a small program I've used to check the dictionary against model. It might be of use to somebody else.

    Thank you very much for your help in resolving this issue!
    File Added: TestDictionaryAgainstModel.java

     
  • Aliaksandr Autayeu

    • status: open --> closed
     
  • Joern Kottmann

    Joern Kottmann - 2010-09-14

    We should add such a check to the new pos model package.

     
  • Joern Kottmann

    Joern Kottmann - 2010-09-14
    • status: closed --> open
     

Log in to post a comment.