I would like to get an information about the data on which en-us-semi model is trained. How many hours of data is used? is it collected from clean lab environment, mobile environment? is it trained on any standard dataset(s)?
thanks
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
How to convert sendump file, generated from sphinxtrain-1.0.8, to text format? I am trying to use printp executable as "printp -mixwfn sendump" but its throwing an error as
INFO: main.c(419): Reading sendumpand normalizing.
ERROR: "s3io.c", line 164: No SPHINX-III file ID at beginning of file
ERROR: "s3io.c", line 265: Error reading header for sendump
What should I use to get the text version of sendump file?
regards
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There are two files for generic english acoustic model on sphinx website, available for download. en-us-semi and en-us-semi-full. I would like to know what is the diference between the two models.
regards
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am doing some tests using en-us-semi model and finding it to be having a rather large footprint than my requirement. Is there any way I can trim down the number of mixtures (to say 64) in that model? Is it possible for sphinx guys to produce en-us-semi model of various complexities?
regards
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That would be very inaccurate. If you want to create a model of specific size you should better provide the size. You should also provide the information about the application you want to implement.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The project is low-memory word recognition. That is, to recognize couple of one/two worded menu/navigational commands.
The memory limit in my framework for means and variances is around 20kB. Now, my problem is that the means/variances are 80kB each for en-us-semi model. I will need to have a same quality model (as that of en-us-semi) with lower number of mixtures to fit in the allowed memory space.
What would be my options in such scenario? Can I trim the existing en-us-semi model to 64 mixtures without affecting the accuracy, as I have not got the data to train the model of quality of en-us-semi.
regards
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Means are tiny part of the model, sendump / mixture weight is 3mb compared to 40kb means. It makes way more sense to reduce mixtures size than means size. I don't quite get you because of that.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I encountered a troublesome question, zh_broadcastnews_16k_ptm256_8000.tar.bz2 is so large can not meet my device, which have only 6Mb flash。 except system image about 4Mb, only 2Mb leave to word recognition application。 can you give some advice to trim zh_broadcastnews_16k_ptm256_8000 acoustic model, thank you very much!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes you are right about the sendump. I will also have to look into that in future. While deciding on the each component of model(mean, var, mixtures etc) I have a limit placed on the memory for it. Currently i can afford upto 10k for means and vars and roughly 1.5M for mixtures. I thought working on means/vars would be more important as the ratio of reduction is much larger (from 80k to 10k) as compared to sendump (3M to 1M). I also figured out that 64 gaussians would fit the requirement for mean/vars.
regards
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am considering sendump as a configurable parameter in the proto and mean/vars as a fixed component in my design.
Configurable information will be taken care by app layer where as, the fixed parameters (mean/var) will be coded in the lib section.
I am focussing on the fixed section which includes Gaussian parameters. So i need en-us-semi model with 64 gaussian params. Let the sendump be of the same complexity (6000 senones) as it is currently present, which is anyways configurable in my design.
thanks for offering to train.
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My keyword is "hello world". As I said before, I have two pronunciations of a word "hello" in the dictionary as follows.
hello HH AH L OW
hello(2) HH AE L OW
The dictionary stores these words as dict->word[0] ("hello") and dict->word[1] (hello(2)) respectively, with associated ciphone entries.
I run the test with -kws "hello world".
I am observing that, in the function kws_search_reinit(..), the id for word "hello" is picked up from the dictionary as follows.
wid = dict_wordid(dict, wrdptr[i]);
and, then all the phones related to the first entry of "hello" are picked and linked. The question is then, what happens to the alternative entry of hello, i.e hello(2)?
Am I giving the keyword correctly? Is there any setting to ask PS to look into alternative keyword?
regards
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi, asm.
Seems you're using slightly outdated pocketsphinx. In current version you can provide file with list of keyphrases via "-kws" or single keyphrase with "-keyphrase".
Alternative pronuncations are currently ignored in kws_search. It will be greatly appreciated if you provide a patch to fix this omission. Join #cmusphinx irc to find help on implementation issues.
The other way is to specify two keyphrases, each having special pronuncation case as a separate word.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
One more issue is that when I use the keyword "good day", and recordings actually contain "mood day", it is detected as the keyword. I am using en-us-semi model to test. is it normal to confuse "good" with "mood"?. What can I do to prevent/minimize such false alarms, other than playing with the threshold?
regards
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What can I do to prevent/minimize such false alarms, other than playing with the threshold?
Not that much. Threshold should help.
Though some papers suggest to use words that are similar to keywords as part of garbage model. I.e. you can specify both "good day" and "mood day" for spotting, set appropriative thresholds for them to gain additional discrimination properties.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am also observing that, I can detect "good day" accurately when i speak it in normal style. When there is a small (say 1s) silence between the words "good" and "day", it is failing to detect it.
So, how do I incorporate silences between the words, for detection?
One thing I am trying is adding SIL to words in the dictionary, as SIL G UH D SIL, i tried adding SIL to either/both ends of the words, but it didnt improve the accuracy.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What is the significance of word position (b,e,i,s) of triphone for keyword spotting? Does PS actually use this information while linking the triphones? Where in the code do I see that?
If I need a triphone as "x y z",i and my acoustic model only has "x,y,z",b and "x,y,z",e, then what happens? Does PS use triphone at word position b or e, instead of i? Is it ok in practice to do such change in mdef file directly (i.e changing position b to i, if i is required and only b and e is present).
thanks
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I would like to get an information about the data on which en-us-semi model is trained. How many hours of data is used? is it collected from clean lab environment, mobile environment? is it trained on any standard dataset(s)?
thanks
asm
400 hours
Mixed environments
As a close approximation you can use tedlium database
How to convert sendump file, generated from sphinxtrain-1.0.8, to text format? I am trying to use printp executable as "printp -mixwfn sendump" but its throwing an error as
INFO: main.c(419): Reading sendumpand normalizing.
ERROR: "s3io.c", line 164: No SPHINX-III file ID at beginning of file
ERROR: "s3io.c", line 265: Error reading header for sendump
What should I use to get the text version of sendump file?
regards
asm
There is no such tool unfortunately.
There is sendump.py in sphinxtrain to convert sendump to mixture_weights file.
hi,
There are two files for generic english acoustic model on sphinx website, available for download. en-us-semi and en-us-semi-full. I would like to know what is the diference between the two models.
regards
asm
full version includes mixture_weights file which is uncompressed sendump file. It is useful for adaptation.
hi,
I am doing some tests using en-us-semi model and finding it to be having a rather large footprint than my requirement. Is there any way I can trim down the number of mixtures (to say 64) in that model? Is it possible for sphinx guys to produce en-us-semi model of various complexities?
regards
asm
What is your requirement then?
Hi
I would like to have en-us-semi model with 64 mixtures (instead of 512).
thanks
asm
That would be very inaccurate. If you want to create a model of specific size you should better provide the size. You should also provide the information about the application you want to implement.
The project is low-memory word recognition. That is, to recognize couple of one/two worded menu/navigational commands.
The memory limit in my framework for means and variances is around 20kB. Now, my problem is that the means/variances are 80kB each for en-us-semi model. I will need to have a same quality model (as that of en-us-semi) with lower number of mixtures to fit in the allowed memory space.
What would be my options in such scenario? Can I trim the existing en-us-semi model to 64 mixtures without affecting the accuracy, as I have not got the data to train the model of quality of en-us-semi.
regards
asm
Means are tiny part of the model, sendump / mixture weight is 3mb compared to 40kb means. It makes way more sense to reduce mixtures size than means size. I don't quite get you because of that.
hi,
I encountered a troublesome question, zh_broadcastnews_16k_ptm256_8000.tar.bz2 is so large can not meet my device, which have only 6Mb flash。 except system image about 4Mb, only 2Mb leave to word recognition application。 can you give some advice to trim zh_broadcastnews_16k_ptm256_8000 acoustic model, thank you very much!
Yes you are right about the sendump. I will also have to look into that in future. While deciding on the each component of model(mean, var, mixtures etc) I have a limit placed on the memory for it. Currently i can afford upto 10k for means and vars and roughly 1.5M for mixtures. I thought working on means/vars would be more important as the ratio of reduction is much larger (from 80k to 10k) as compared to sendump (3M to 1M). I also figured out that 64 gaussians would fit the requirement for mean/vars.
regards
From 3m to 1m you save 2m. From 80k to 10k you save 70k only. You should probably think twice on that ;)
I can train you PTM model which fits 2mb total, let me know if it's ok for you.
Yes thats quite right about savings.
I am considering sendump as a configurable parameter in the proto and mean/vars as a fixed component in my design.
Configurable information will be taken care by app layer where as, the fixed parameters (mean/var) will be coded in the lib section.
I am focussing on the fixed section which includes Gaussian parameters. So i need en-us-semi model with 64 gaussian params. Let the sendump be of the same complexity (6000 senones) as it is currently present, which is anyways configurable in my design.
thanks for offering to train.
asm
Hi
How to use multiple pronunciations in a dictionary? Is the following correct for a word hello?
hello HH AH L OW
hello(2) HH AE L OW
Are the triphone HMMs connected for all these words in parallel, when present in the dictionary?
regards
asm
Yes
Yes
Hi,
My keyword is "hello world". As I said before, I have two pronunciations of a word "hello" in the dictionary as follows.
hello HH AH L OW
hello(2) HH AE L OW
The dictionary stores these words as dict->word[0] ("hello") and dict->word[1] (hello(2)) respectively, with associated ciphone entries.
I run the test with -kws "hello world".
I am observing that, in the function kws_search_reinit(..), the id for word "hello" is picked up from the dictionary as follows.
wid = dict_wordid(dict, wrdptr[i]);
and, then all the phones related to the first entry of "hello" are picked and linked. The question is then, what happens to the alternative entry of hello, i.e hello(2)?
Am I giving the keyword correctly? Is there any setting to ask PS to look into alternative keyword?
regards
asm
Hi, asm.
Seems you're using slightly outdated pocketsphinx. In current version you can provide file with list of keyphrases via "-kws" or single keyphrase with "-keyphrase".
Alternative pronuncations are currently ignored in kws_search. It will be greatly appreciated if you provide a patch to fix this omission. Join #cmusphinx irc to find help on implementation issues.
The other way is to specify two keyphrases, each having special pronuncation case as a separate word.
Oh, Didnt realize it. thanks for the pointers.
One more issue is that when I use the keyword "good day", and recordings actually contain "mood day", it is detected as the keyword. I am using en-us-semi model to test. is it normal to confuse "good" with "mood"?. What can I do to prevent/minimize such false alarms, other than playing with the threshold?
regards
asm
yes
Not that much. Threshold should help.
Though some papers suggest to use words that are similar to keywords as part of garbage model. I.e. you can specify both "good day" and "mood day" for spotting, set appropriative thresholds for them to gain additional discrimination properties.
I am also observing that, I can detect "good day" accurately when i speak it in normal style. When there is a small (say 1s) silence between the words "good" and "day", it is failing to detect it.
So, how do I incorporate silences between the words, for detection?
One thing I am trying is adding SIL to words in the dictionary, as SIL G UH D SIL, i tried adding SIL to either/both ends of the words, but it didnt improve the accuracy.
Share audio you're trying to recognize (add "-rawlogdir /some/directory") and info on models you use
Hi,
What is the significance of word position (b,e,i,s) of triphone for keyword spotting? Does PS actually use this information while linking the triphones? Where in the code do I see that?
If I need a triphone as "x y z",i and my acoustic model only has "x,y,z",b and "x,y,z",e, then what happens? Does PS use triphone at word position b or e, instead of i? Is it ok in practice to do such change in mdef file directly (i.e changing position b to i, if i is required and only b and e is present).
thanks
asm