Hi,
Thank you for the reply.If i use my own language model and the existing wsj acoustic model to convert wav to text, do you think it will work??
thank you again.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm new to speech recognition. Sphinx 4 seems great, but I do have a few questions...
For instance, what is the purpose of the fillerdict? If I replace the cmudict.0.6d with something I created at the cmu lmtool site, do I just leave the fillerdict alone? I tried combining some content from various fillerdict(s) and the program usually complained...?
Also, I'm getting the idea that the lm model uses the SimpleNgramModel. If I'm using *.DMP (a binary version of the lm), then I should configure for the LargeTrigramModel instead. I tried to simply substitute an lm for a DMP, but the run complained about the missing binary. I would like to convert to DMP as discussed above (sphinx3_lm_convert), but was unable to find the utility mentioned in the share cvs branch. (Mar 24, 2008).
Basically, I would like to process a multimedia "stripped" wav file using a transcript-generated lm or dmp.
Please help! I'm demo'ing soon to NBCU.
Thanks,
-Dan Cleary
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> For instance, what is the purpose of the fillerdict?
It lists fillers - words that are inserted after every word in jsgf or lm during search. For example you can match breath using the word BREATH with transcription *BREATH8. Or you can use filler HM with HH M transcription to recognize paralinguistic words. You can add word SAY to fillers to strip garbage word your speaker are using. The advantage of filler words is that you don't need to add them to the language model, they are inserted into the search space automatically. Also they have some lower weight so first of all real word is matched for, then filler.
But remember that phones used in transcription of the filler dictionary must present in acoustic model. For example if your model don't have phone BREATH you won't be able to add this word to filler. Thus it's not practical to combine fillers from different models.
> but was unable to find the utility mentioned in the share cvs branch. (Mar 24, 2008).
It's in sphinx3 package.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the great answer. Seems I have a lot of learning to do! Do you further know if there are any examples of a general "generic" wav file read and translation using an lm model? I am not looking for great results just now, but mostly the pipeline or process flow.
For instance, I'd like to transcribe a news story and create an lm and dict that would provide reasonable results when I process a corresponding wav file (my voice). Then, I'd like to do the same for another media genre, say sports or weather. Either, I could customize each lm / dict separately or develop one that works fairly well, say 60-70%, for every story. The purpose is purely for demo at this point... Do you know of OR can suggest a starting point - JAVA example, etc?
BTW, am I correct about the SimpleNgramModel / LargeTrigramModel, i.e. use Simple for lm and Large for DMP?
Thanks again for all. You are most kind!
-Dan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> Do you know of OR can suggest a starting point - JAVA example, etc?
Demos in sphinx4 are a good start. Actually if Java is not a strong requirement I suggest you to try sphinx3 first, it will give you better accuracy and has some features sphinx4 is missing. But it's ok to use sphinx4 as well. Start with transcriber demo or try this modification:
it transcribes wave file with pretty unusual language model though.
> BTW, am I correct about the SimpleNgramModel / LargeTrigramModel, i.e. use Simple for lm and Large for DMP?
It's just about speed and memory. If your language model is large and sphinx4 is slow you can switch to dmp variant and to large trigram model.
> Either, I could customize each lm / dict separately or develop one that works fairly well, say 60-70%, for every story.
Well, first of all try simple setup with single language model. It will work fine, but you can improve the quality significantly with the following tricks:
1) Adaptation to your own voice (if you'll be the only speaker there is sense to adapt acoustic model to your voice to get better performance). This will require sphinx3
2) Select language model on the fly according to the topic. This task will require you to train several language models but again you can win in the end.
There are many more methods to improve things, actually you should start with basic setup first.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello all,
I have generated a .lm file at the following site
http://www.speech.cs.cmu.edu/tools/lmtool.html
How i can make convert into a .DMP file?How i can use it in my application?do i have to create my own dictionary or i can use the cmudict itself??
Thank you in advance..
Nickolay-
Great advice! Thanks so much!
I'll get started immediately.
Regards,
-Dan
> How i can make convert into a .DMP file?
with sphinx3_lm_convert
How i can use it in my application?
By using proper config like in confidence demo for example
> do i have to create my own dictionary or i can use the cmudict itself??
When lmtool generates language model it also generates a dictionary for it. Donwload everything in tar archive and use it.
Hi,
Thank you for the reply.If i use my own language model and the existing wsj acoustic model to convert wav to text, do you think it will work??
thank you again.
Folks-
I'm new to speech recognition. Sphinx 4 seems great, but I do have a few questions...
For instance, what is the purpose of the fillerdict? If I replace the cmudict.0.6d with something I created at the cmu lmtool site, do I just leave the fillerdict alone? I tried combining some content from various fillerdict(s) and the program usually complained...?
Also, I'm getting the idea that the lm model uses the SimpleNgramModel. If I'm using *.DMP (a binary version of the lm), then I should configure for the LargeTrigramModel instead. I tried to simply substitute an lm for a DMP, but the run complained about the missing binary. I would like to convert to DMP as discussed above (sphinx3_lm_convert), but was unable to find the utility mentioned in the share cvs branch. (Mar 24, 2008).
Basically, I would like to process a multimedia "stripped" wav file using a transcript-generated lm or dmp.
Please help! I'm demo'ing soon to NBCU.
Thanks,
-Dan Cleary
> For instance, what is the purpose of the fillerdict?
It lists fillers - words that are inserted after every word in jsgf or lm during search. For example you can match breath using the word BREATH with transcription *BREATH8. Or you can use filler HM with HH M transcription to recognize paralinguistic words. You can add word SAY to fillers to strip garbage word your speaker are using. The advantage of filler words is that you don't need to add them to the language model, they are inserted into the search space automatically. Also they have some lower weight so first of all real word is matched for, then filler.
But remember that phones used in transcription of the filler dictionary must present in acoustic model. For example if your model don't have phone BREATH you won't be able to add this word to filler. Thus it's not practical to combine fillers from different models.
> but was unable to find the utility mentioned in the share cvs branch. (Mar 24, 2008).
It's in sphinx3 package.
Nickolay-
Thanks for the great answer. Seems I have a lot of learning to do! Do you further know if there are any examples of a general "generic" wav file read and translation using an lm model? I am not looking for great results just now, but mostly the pipeline or process flow.
For instance, I'd like to transcribe a news story and create an lm and dict that would provide reasonable results when I process a corresponding wav file (my voice). Then, I'd like to do the same for another media genre, say sports or weather. Either, I could customize each lm / dict separately or develop one that works fairly well, say 60-70%, for every story. The purpose is purely for demo at this point... Do you know of OR can suggest a starting point - JAVA example, etc?
BTW, am I correct about the SimpleNgramModel / LargeTrigramModel, i.e. use Simple for lm and Large for DMP?
Thanks again for all. You are most kind!
-Dan
> Do you know of OR can suggest a starting point - JAVA example, etc?
Demos in sphinx4 are a good start. Actually if Java is not a strong requirement I suggest you to try sphinx3 first, it will give you better accuracy and has some features sphinx4 is missing. But it's ok to use sphinx4 as well. Start with transcriber demo or try this modification:
http://www.mediafire.com/?5uxxffxsjop
it transcribes wave file with pretty unusual language model though.
> BTW, am I correct about the SimpleNgramModel / LargeTrigramModel, i.e. use Simple for lm and Large for DMP?
It's just about speed and memory. If your language model is large and sphinx4 is slow you can switch to dmp variant and to large trigram model.
> Either, I could customize each lm / dict separately or develop one that works fairly well, say 60-70%, for every story.
Well, first of all try simple setup with single language model. It will work fine, but you can improve the quality significantly with the following tricks:
1) Adaptation to your own voice (if you'll be the only speaker there is sense to adapt acoustic model to your voice to get better performance). This will require sphinx3
2) Select language model on the fly according to the topic. This task will require you to train several language models but again you can win in the end.
There are many more methods to improve things, actually you should start with basic setup first.