Hi,
There are a couple of things to say. You're set up looks fine from the "technically correct perspective".
Unfortunately, maxent is poorly suited for language modeling or really any task where there are a large number of outcomes (say more than 100). This is because the code is set up to always produce the distribution for all outcomes in order to normalize that distribution.
For unknown outcomes you need to simulate their occurrence in the training data. To do this you might convert some selection of your data (say all words occurring just once) to be treated as unknown.
Hope this helps...Tom
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I want to create an unigram language model that is dependent on a second modality, in this case position. For example:
pos=home word=hello hello
pos=home word=hello hello
pos=home word=goodbye goodbye
pos=out word=hello hello
i sthis an appropiate way to structure the data?
How do I add unseen outcomes, for example a word "thanks", that is not seen in data?
Hi,
There are a couple of things to say. You're set up looks fine from the "technically correct perspective".
Unfortunately, maxent is poorly suited for language modeling or really any task where there are a large number of outcomes (say more than 100). This is because the code is set up to always produce the distribution for all outcomes in order to normalize that distribution.
For unknown outcomes you need to simulate their occurrence in the training data. To do this you might convert some selection of your data (say all words occurring just once) to be treated as unknown.
Hope this helps...Tom