I'm trying to get the Language Modeling Toolkit to emulate the web-based Language Modeling Toolkit, but I'm having a problem with either the discounting information and/or the back-off weights (assuming those are the numbers on the front and back of each line under the n-gram headers). Most of the lines end up showing up with -99.9990 as the first number, when they should be down between -0.2 to -3. The web-based tool makes these correctly. With this problem, testing with the toolkit lm results in no text being recognized. Does anybody know the tools and/or commands used by the web-based tool so I can get these made correctly?
Also, the Toolkit doesn't have an obvious way to include context cues. I tried using a .ccs file, but it basically ignored the <s> and </s> in that file and put everything all references to <s> to <UNK> as the out-of-vocabulary words. The web-based tool gives me a .sent file, but this concept doesn't seem to exist in the Toolkit.
Thanks for any help you can provide.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm trying to get the Language Modeling Toolkit to emulate the web-based Language Modeling Toolkit, but I'm having a problem with either the discounting information and/or the back-off weights (assuming those are the numbers on the front and back of each line under the n-gram headers). Most of the lines end up showing up with -99.9990 as the first number, when they should be down between -0.2 to -3. The web-based tool makes these correctly. With this problem, testing with the toolkit lm results in no text being recognized. Does anybody know the tools and/or commands used by the web-based tool so I can get these made correctly?
Also, the Toolkit doesn't have an obvious way to include context cues. I tried using a .ccs file, but it basically ignored the <s> and </s> in that file and put everything all references to <s> to <UNK> as the out-of-vocabulary words. The web-based tool gives me a .sent file, but this concept doesn't seem to exist in the Toolkit.
Thanks for any help you can provide.