I posted much of this in another thread, but thought it might be worth putting in its own thread.
Quick summary: I'm trying to use the CMU SLMT to create my own language model. I have used two different text corpii and get the same result: Null pointer exception.
the problem is definitely with the toolkit:
I put my smaller text corpus through the Online QuickLM tool (the small corpus has 660 words) and that LM works for me.
When I run that exact same corpus thru the CMU SLMT, I get the null pointer exception, specifically:
Exception in thread "main" java.lang.NullPointerException
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.getInitialSearchState(LexTreeLinguist.java:461)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.compileGrammar(LexTreeLinguist.java:487)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.allocate(LexTreeLinguist.java:406)
at edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager.allocate(WordPruningBreadthFirstSearchManager.java:323)
at edu.cmu.sphinx.decoder.Decoder.allocate(Decoder.java:109)
at edu.cmu.sphinx.recognizer.Recognizer.allocate(Recognizer.java:182)
====================================================================
What I noticed is three things:
1. the working LM has all units in UPPERCASE while the non-working LM has all units in lowercase
2. the working LM has entries for silence tags - <S> and </S> while the non-working LM does not.
3. the working LM doesn't have an entry for the unknown tag - <UNK> while the non-working LM does have that entry.
I also noted that the weights were significantly different for the identical words.
Any ideas?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I posted much of this in another thread, but thought it might be worth putting in its own thread.
Quick summary: I'm trying to use the CMU SLMT to create my own language model. I have used two different text corpii and get the same result: Null pointer exception.
I put my smaller text corpus through the Online QuickLM tool (the small corpus has 660 words) and that LM works for me.
When I run that exact same corpus thru the CMU SLMT, I get the null pointer exception, specifically:
Exception in thread "main" java.lang.NullPointerException
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.getInitialSearchState(LexTreeLinguist.java:461)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.compileGrammar(LexTreeLinguist.java:487)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.allocate(LexTreeLinguist.java:406)
at edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager.allocate(WordPruningBreadthFirstSearchManager.java:323)
at edu.cmu.sphinx.decoder.Decoder.allocate(Decoder.java:109)
at edu.cmu.sphinx.recognizer.Recognizer.allocate(Recognizer.java:182)
====================================================================
What I noticed is three things:
1. the working LM has all units in UPPERCASE while the non-working LM has all units in lowercase
2. the working LM has entries for silence tags - <S> and </S> while the non-working LM does not.
3. the working LM doesn't have an entry for the unknown tag - <UNK> while the non-working LM does have that entry.
I also noted that the weights were significantly different for the identical words.
Any ideas?
well, I figured it out ....
silence tags must be included in the original text corpus.
I also am going through the trouble of converting everything to UPPERCASE before adding the silence tags (using tr and sed)
should be fun ..
Ren