Hi. I have been using Sphinx3 with the HUB4 "open source" acoustic model and WSJ language 5k model. I wanted to see how Sphinx2 compares in terms of speed, so I built Sphinx2 and tried to run it with the Sphinx2 HUB4 "open source" acoustic models and WSJ 5k language model. Unfortunately, Sphinx2 cannot load wsj5k.DMP. It aborts with the error message:
INFO: lm_3g.c(864): Reading LM file model/lm/wsj5k.DMP (name "")
FATAL_ERROR: "lm_3g.c", line 522: No \data\ mark in LM file
Does Sphinx2 use a different LM format? I could not find anything about this in the documentation.
Regards,
Mike
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1) Use sphinx3_lm_convert to convert binary compressed model back to text.
2) Use pocketsphinx instead of sphinx2, it's even faster and more efficient.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks. That worked. Now I'm able to use the HUB4 LM (not sure why it failed before -- must have been the dictionary issue) and the WSJ 8kHz AM (the one that comes with PocketSphinx) with the swb.dic file. I get 70.9% word accuracy on my WSJ test set. Sphinx3 gets 82.7% on the same set of sentences (but with 16 kHz bandwidth audio and AM). Does this sound like the accuracy you would expect? Best regards,
Mike
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi. I didn't have WSJ0 so had to order it from LDC. Now I'm set up to test on si_et_05. If I use the WSJ5k bigram model and the 8khz AM that comes with PocketSphinx, I get 79.0% word accuracy. By contrast, I get 92.4% accuracy with the HUB4 AM and WSJ5k LM on Sphinx3.
This is for PocketSphinx 0.4.1. The latest version from svn compiles but does not pass "make check". Same with the latest nightly build.
--Mike
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hmm, that's definitely strange. With PocketSphinx 0.4.1, on Linux, I get 8.05% WER (91.95% accuracy). Here is the script I use for testing on si_et_05. I have the unshortened .sph files in the directory ./si_et_05, and wsj_test.fileids looks like this:
si_et_05/440/440c0201
si_et_05/440/440c0202
...
On a 3.0GHz Pentium4, this runs at an average of 0.16 xRT.
!/bin/sh
expt=$1
if [ x"$expt" = x ]; then
>&2 echo "Usage: $0 EXPTID [DECODER]"
exit 1
fi
decode=${2:-../src/programs/pocketsphinx_batch}
Yes, this is an annoying bug in the Sphinx2 language model code, which PocketSphinx inherited up through version 0.4.1.
The development version of PocketSphinx in the Subversion repository has removed that limit (there is still a limit of 65536 words in a .DMP format language model due to the file format limitations).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ahh I just realized that you are using the rather lousy WSJ5k language model that's included for testing purposes with PocketSphinx. That is not the same as the standard (bcb05cnp.Z) language model which comes with the WSJ0 corpus.
Unfortunately we can't redistribute the bcb05cnp language model, and it's not at all clear what data was used to train it, so I just trained a language model from the acoustic model transcripts to use for testing purposes.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks, David. With bcb05cnp the accuracy is actually worse (77.4% compared to 79.0% with wsj5k). Perhaps it is an acoustic problem. What parameters do you use for feature extraction?
--Mike
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Using those parameters doesn't change the score at all.
I also tried feature extraction directly from the wideband speech (rather than the downsampled speech) and that did not change the score much.
I think the only thing left is the dictionary. You are using "bcb05cnp.dic" (which does not seem to be included with WSJ0) and I am using "swb.dic". Where did bcb05cnp.dic come from?
--Mike
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ah, there's your problem. You have a big mismatch between the language model and the dictionary. bcb05cnp.dic is a dictionary I generated from the bcb05cnp language model and cmudict. I used the 'ngram_pronounce' tool from the (unreleased but available from SVN) CMU language modeling toolkit to do this, but for your convenience I've put a copy of it at:
Great! Actually it's not a matter of restricting the vocabulary, the problem is just that the vocabulary in the language model has to match (or be a subset of) the one in the dictionary. The swb and bcb05cnp language models have different vocabularies (swb is trained on telephone conversations, bcb05cnp is trained on financial news stories), and swb.dic only contains the words that are in the swb language model. So if you use it with the bcb05cnp language model you are actually only able to recognize the intersection of the two vocabularies which is (probably) considerably less than 5000 words.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I wrote a little Perl script to read in CMUDICT and the bcb20cnp language model, and write out a new bcb20cnp.dic dictionary that is small enough for PocketSphinx to load. Even with this configuration, PocketSphinx achieves 73.4% accuracy while Sphinx3 achieves 82.9% word accuracy on the si_et_20 set. Is it expected that PocketSphinx accuracy is comparable to that of Sphinx3 for smaller vocabularies but worse for larger vocabularies?
--Mike
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, the scores I reported were for the wv1 (Sennheiser) files. One possible difference is that I used Matlab to downsample these files from 16000 Hz sample rate to 8000 Hz sample rate before performaing feature extraction.
--Mike
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi. I have been using Sphinx3 with the HUB4 "open source" acoustic model and WSJ language 5k model. I wanted to see how Sphinx2 compares in terms of speed, so I built Sphinx2 and tried to run it with the Sphinx2 HUB4 "open source" acoustic models and WSJ 5k language model. Unfortunately, Sphinx2 cannot load wsj5k.DMP. It aborts with the error message:
INFO: lm_3g.c(864): Reading LM file model/lm/wsj5k.DMP (name "")
FATAL_ERROR: "lm_3g.c", line 522: No \data\ mark in LM file
Does Sphinx2 use a different LM format? I could not find anything about this in the documentation.
Regards,
Mike
1) Use sphinx3_lm_convert to convert binary compressed model back to text.
2) Use pocketsphinx instead of sphinx2, it's even faster and more efficient.
Thanks! I uncompressed the LM.
Sphinx2 now complains about the dictionary being too large.
INFO: lm_3g.c(901): 130615 words in dictionary
FATAL_ERROR: "lm_3g.c", line 918: #dict-words(130615) > 65534
Strange that cmudict is too large for CMU Sphinx2. I'll try pocketsphinx.
--Mike
I suppose you can either strip cmudict to unigrams from wsj.dmp or use swb model included in pocketsphinx. It should not be worse.
Thanks. That worked. Now I'm able to use the HUB4 LM (not sure why it failed before -- must have been the dictionary issue) and the WSJ 8kHz AM (the one that comes with PocketSphinx) with the swb.dic file. I get 70.9% word accuracy on my WSJ test set. Sphinx3 gets 82.7% on the same set of sentences (but with 16 kHz bandwidth audio and AM). Does this sound like the accuracy you would expect? Best regards,
Mike
Hi,
There is a pretty big vocabulary and language model mismatch, but that still seems pretty far out of line.
With the "standard" WSJ5k bigram model and the 8khz AM that comes with PocketSphinx, I get between 8.0 and 8.5% WER depending on the beam settings.
This is on the si_et_05 test set which is a bit harder than the si_dt_05 development set.
Hmm.. something's wrong then. I'm evaluating on si_et_20. Is the standard WSJ5k bigram model publicly available? Thanks!
--Mike
Hi. I didn't have WSJ0 so had to order it from LDC. Now I'm set up to test on si_et_05. If I use the WSJ5k bigram model and the 8khz AM that comes with PocketSphinx, I get 79.0% word accuracy. By contrast, I get 92.4% accuracy with the HUB4 AM and WSJ5k LM on Sphinx3.
This is for PocketSphinx 0.4.1. The latest version from svn compiles but does not pass "make check". Same with the latest nightly build.
--Mike
Hmm, that's definitely strange. With PocketSphinx 0.4.1, on Linux, I get 8.05% WER (91.95% accuracy). Here is the script I use for testing on si_et_05. I have the unshortened .sph files in the directory ./si_et_05, and wsj_test.fileids looks like this:
si_et_05/440/440c0201
si_et_05/440/440c0202
...
On a 3.0GHz Pentium4, this runs at an average of 0.16 xRT.
!/bin/sh
expt=$1
if [ x"$expt" = x ]; then
>&2 echo "Usage: $0 EXPTID [DECODER]"
exit 1
fi
decode=${2:-../src/programs/pocketsphinx_batch}
$decode \ -hmm ../model/hmm/wsj1 \ -dict bcb05cnp.dic \ -lm bcb05cnp.z.DMP \ -lw 7.5 -wip 0.5 \ -beam 1e-60 -wbeam 1e-40 -bestpathlw 11.5 \ -cepdir . -cepext .sph \ -adcin yes -adchdr 1024 \ -ctl wsj_test.fileids \ -hyp $expt.hyp \ -latsize 50000 \ > $expt.log 2>&1
66534 (or 66536?) is the maximum number of words Sphinx 2 and Pocketsphinx can accomodate.
CB
PocketSphinx also complains that CMUdict is too big. Are word frequencies available for CMUdict? Are there tools to prune infrequently used words?
It looks like Sphinx2 and PocketSphinx cannot handle a dictionary with more than 65534 words.
Thanks!
--Mike
Yes, this is an annoying bug in the Sphinx2 language model code, which PocketSphinx inherited up through version 0.4.1.
The development version of PocketSphinx in the Subversion repository has removed that limit (there is still a limit of 65536 words in a .DMP format language model due to the file format limitations).
Ahh I just realized that you are using the rather lousy WSJ5k language model that's included for testing purposes with PocketSphinx. That is not the same as the standard (bcb05cnp.Z) language model which comes with the WSJ0 corpus.
Unfortunately we can't redistribute the bcb05cnp language model, and it's not at all clear what data was used to train it, so I just trained a language model from the acoustic model transcripts to use for testing purposes.
Thanks, David. With bcb05cnp the accuracy is actually worse (77.4% compared to 79.0% with wsj5k). Perhaps it is an acoustic problem. What parameters do you use for feature extraction?
--Mike
Hmm, very strange. I am using the default parameters from the wsj1 acoustic model:
-lowerf 1
-upperf 4000
-nfilt 20
-transform dct
-round_filters no
-remove_dc yes
Using those parameters doesn't change the score at all.
I also tried feature extraction directly from the wideband speech (rather than the downsampled speech) and that did not change the score much.
I think the only thing left is the dictionary. You are using "bcb05cnp.dic" (which does not seem to be included with WSJ0) and I am using "swb.dic". Where did bcb05cnp.dic come from?
--Mike
Ah, there's your problem. You have a big mismatch between the language model and the dictionary. bcb05cnp.dic is a dictionary I generated from the bcb05cnp language model and cmudict. I used the 'ngram_pronounce' tool from the (unreleased but available from SVN) CMU language modeling toolkit to do this, but for your convenience I've put a copy of it at:
https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/regression/bcb05cnp.dic
Thanks, David! That was it. WER is now 6.9%. It makes sense that restricting the vocabulary to the proper domain would bring up the accuracy.
--Mike
Great! Actually it's not a matter of restricting the vocabulary, the problem is just that the vocabulary in the language model has to match (or be a subset of) the one in the dictionary. The swb and bcb05cnp language models have different vocabularies (swb is trained on telephone conversations, bcb05cnp is trained on financial news stories), and swb.dic only contains the words that are in the swb language model. So if you use it with the bcb05cnp language model you are actually only able to recognize the intersection of the two vocabularies which is (probably) considerably less than 5000 words.
I wrote a little Perl script to read in CMUDICT and the bcb20cnp language model, and write out a new bcb20cnp.dic dictionary that is small enough for PocketSphinx to load. Even with this configuration, PocketSphinx achieves 73.4% accuracy while Sphinx3 achieves 82.9% word accuracy on the si_et_20 set. Is it expected that PocketSphinx accuracy is comparable to that of Sphinx3 for smaller vocabularies but worse for larger vocabularies?
--Mike
Hi,
It depends on the acoustic model, but in a general sense (and using the default acoustic models), yes.
Also, you are using the .wv1 files, not the .wv2 files from WSJ0, right?
Yes, the scores I reported were for the wv1 (Sennheiser) files. One possible difference is that I used Matlab to downsample these files from 16000 Hz sample rate to 8000 Hz sample rate before performaing feature extraction.
--Mike