The OpenNLP Maximum Entropy Package / Discussion / Help: Adding Unseen outcomes

Adding Unseen outcomes

Forum: Help

Creator: Daniel Neiberg

Created: 2008-08-27

Updated: 2013-04-11

Daniel Neiberg - 2008-08-27

Hi,

I want to create an unigram language model that is dependent on a second modality, in this case position. For example:

pos=home word=hello hello
pos=home word=hello hello
pos=home word=goodbye goodbye
pos=out word=hello hello

i sthis an appropiate way to structure the data?

How do I add unseen outcomes, for example a word "thanks", that is not seen in data?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Thomas Morton - 2008-08-31
  
  Hi,
  There are a couple of things to say. You're set up looks fine from the "technically correct perspective".
  
  Unfortunately, maxent is poorly suited for language modeling or really any task where there are a large number of outcomes (say more than 100). This is because the code is set up to always produce the distribution for all outcomes in order to normalize that distribution.
  
  For unknown outcomes you need to simulate their occurrence in the training data. To do this you might convert some selection of your data (say all words occurring just once) to be treated as unknown.
  
  Hope this helps...Tom
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.