Here is a bit of a puzzler. Since OpenEars 1.x that uses
pocketsphinx/sphinxbase .7 it is apparently no longer possible to load .DMP
files that were compiled on Linux (DMP files compiled internally in OpenEars
work fine). Here is what the logging says when trying to load hub4.5000.DMP
with the hub4.5000.dic and the wsj acoustic model:
INFO: ngram_model_arpa.c(79): No \data\ mark in LM file
INFO: ngram_model_dmp.c(142): Will use memory-mapped I/O for LM file
INFO: ngram_model_dmp.c(197): ngrams 1=5001, 2=436879, 3=418286
INFO: ngram_model_dmp.c(244): 5001 = LM.unigrams(+trailer) read
1: offset is 81672 // This is my added logging to see what the offset value is
at the beginning of if (do_mmap) at line 247
filesize is 5642542 // This is my added logging to see what the overall
filesize is
2: offset is 5324232 // This is my added logging to see what the offset value
is at the beginning of (do_mmap) at line 273
INFO: ngram_model_dmp.c(297): 436879 = LM.bigrams(+trailer) read
3: offset is 8670520 // This is my added logging to see what the offset value
is at the beginning of if (do_mmap) at line 303
INFO: ngram_model_dmp.c(326): 418286 = LM.trigrams read
4: offset is 8670520 // This is my added logging to see what the offset value
is at the beginning of if (do_mmap) { at line 335
fread error detected // This is my added logging of the error condition after
ngram_model_dmp.c line 339 "if (fread(&k, sizeof(k), 1, fp) != 1) {" which is
apparently the cause of the problem
premature eof detected // This is my added logging of the eof condition for
the line of code mentioned in the previous comment
Position: 8670520 // This is my added logging of the seek position when the
eof is happening
ERROR: "ngram_search.c", line 211: Failed to read language model file:
/Users/username/Library/Application Support/iPhone
Simulator/5.1/Applications/035669A4-AB0E-4E2C-852C-
C771386CB4DF/OpenEarsSampleApp.app/hub4.5000.DMP
I've decompiled the .DMP in question into an arpa file and verified that those
1-gram/2-gram/3-gram counts are correct.
Do you have any idea why this fails in the section of ngram_model_dmp.c under
the comment that follows, after apparently reading the n-grams correctly:
/ read n_prob2 and prob2 array (in memory) /
Thanks for any leads you can give me in troubleshooting this. I have checked
out in sphinx_config.h whether there are any type sizes that are set wrongly
(since this seems to be about seek position in a binary and it could be due to
wrong bytesizes) but I can't see any.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, on further investigation this turned out to be an own goal. I had made
some changes to the ARPA/DMP model templating system to fix some other library
issues and it broke DMP reading for most DMPs. Thanks for your help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Nickolay,
Here is a bit of a puzzler. Since OpenEars 1.x that uses
pocketsphinx/sphinxbase .7 it is apparently no longer possible to load .DMP
files that were compiled on Linux (DMP files compiled internally in OpenEars
work fine). Here is what the logging says when trying to load hub4.5000.DMP
with the hub4.5000.dic and the wsj acoustic model:
INFO: ngram_model_arpa.c(79): No \data\ mark in LM file
INFO: ngram_model_dmp.c(142): Will use memory-mapped I/O for LM file
INFO: ngram_model_dmp.c(197): ngrams 1=5001, 2=436879, 3=418286
INFO: ngram_model_dmp.c(244): 5001 = LM.unigrams(+trailer) read
1: offset is 81672 // This is my added logging to see what the offset value is
at the beginning of if (do_mmap) at line 247
filesize is 5642542 // This is my added logging to see what the overall
filesize is
2: offset is 5324232 // This is my added logging to see what the offset value
is at the beginning of (do_mmap) at line 273
INFO: ngram_model_dmp.c(297): 436879 = LM.bigrams(+trailer) read
3: offset is 8670520 // This is my added logging to see what the offset value
is at the beginning of if (do_mmap) at line 303
INFO: ngram_model_dmp.c(326): 418286 = LM.trigrams read
4: offset is 8670520 // This is my added logging to see what the offset value
is at the beginning of if (do_mmap) { at line 335
fread error detected // This is my added logging of the error condition after
ngram_model_dmp.c line 339 "if (fread(&k, sizeof(k), 1, fp) != 1) {" which is
apparently the cause of the problem
premature eof detected // This is my added logging of the eof condition for
the line of code mentioned in the previous comment
Position: 8670520 // This is my added logging of the seek position when the
eof is happening
ERROR: "ngram_search.c", line 211: Failed to read language model file:
/Users/username/Library/Application Support/iPhone
Simulator/5.1/Applications/035669A4-AB0E-4E2C-852C-
C771386CB4DF/OpenEarsSampleApp.app/hub4.5000.DMP
I've decompiled the .DMP in question into an arpa file and verified that those
1-gram/2-gram/3-gram counts are correct.
Do you have any idea why this fails in the section of ngram_model_dmp.c under
the comment that follows, after apparently reading the n-grams correctly:
/ read n_prob2 and prob2 array (in memory) /
Thanks for any leads you can give me in troubleshooting this. I have checked
out in sphinx_config.h whether there are any type sizes that are set wrongly
(since this seems to be about seek position in a binary and it could be due to
wrong bytesizes) but I can't see any.
This seems to be a regression introduced recently, one need to check change
history to find where did it broke.
OK, are you saying it's a probable regression in one of the .7 distributions
of sphinxbase?
Well, on further investigation this turned out to be an own goal. I had made
some changes to the ARPA/DMP model templating system to fix some other library
issues and it broke DMP reading for most DMPs. Thanks for your help.