Good afternoon..
I have developed recently an application for windows7 that uses speech
recognition.
I have developed with Windows SAPI making the speech recognition, but as of
lately i wasn't getting the accurate results i wanted, so i decided to try
other engine, and Sphinx seems to be an approach for me.
I just need some guidelines to choose which direction to take, deciding the
decoder, if i need acoustic model training or not.. etc.
So, what i have is a command and control app, i have very well defined
expressions i will use. Other than these i use numbers . So i am guessing
having a grammar would be the way to go..
I thought about using PocketSphinx because it is C, so i can encapsulate it in
C# faster.. but should i spend more time with Sphinx4 , would be better on the
long run? it seems better documented and with more updates recently.
Would creating or adapting an acoustic model with recordings of the people
that will use it, and only the necessary expressions, be a valid solution?
DUmb question maybe, but the perfect solution for a good speech recognition
would be having an acoustic model for each user? Depending on the user that
will run the app, his "profile" would be loaded.. or it's best having an
acoustic model with recordings of every person that will use the App? Or even,
having the acoustic model default ( i think it's the wsj one ) and just adapt
the language model / dictionary / grammar to the expressions i need?
Thanks in advance for your time.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So, what i have is a command and control app, i have very well defined
expressions i will use. Other than these i use numbers . So i am guessing
having a grammar would be the way to go..
Control of the computer with speech have limited usability actually. In a long
term you need to target dictation.
I thought about using PocketSphinx because it is C, so i can encapsulate it
in C# faster.. but should i spend more time with Sphinx4 , would be better on
the long run? it seems better documented and with more updates recently.
I think you need to try pocketsphinx
Would creating or adapting an acoustic model with recordings of the people
that will use it, and only the necessary expressions, be a valid solution?
We think that adaptation is a good strategy to take
Dumb question maybe, but the perfect solution for a good speech recognition
would be having an acoustic model for each user? Depending on the user that
will run the app, his "profile" would be loaded.. or it's best having an
acoustic model with recordings of every person that will use the App? Or even,
having the acoustic model default ( i think it's the wsj one ) and just adapt
the language model / dictionary / grammar to the expressions i need?
Optimal solution would be to store adaptation profile for each user. That
include both adapted language model and adapted acoustic model. You can have
enrollment stage to create them.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Control of the computer with speech have limited usability actually. In a
long term you need to target dictation.
I'm not wanting to control the computer.. what i meant was that my App has
defined command inputs, that the user learns to use. These are small well
defined expressions... enough for what we need. I don't think dictation will
be easier on the users.. if they can say "whatever" they want.. or were you
refering to something else?
thanks for your responses.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
SO i think i'll start with pocketsphinx, and try and do a acoustic model
adaptation.
Just a couple more questions:
- I will need numbers identification until 1million probably. WHat i sould do is record only the endings? Like , 1 , 2 , 3.. 10, 20, 30.. 100,200,300.. 1000,2000,3000 ??
-The grammars i use now are SAPI compliant, so they are grxml files. I can access semantics on this too.. Is there a feature like this for the jsgf grammars files pocketsphinx uses? I needed this feature to divide the sentences into meaning information:
Example: <person> travels to <city>
If it identifies Ricardo travels to Lisboa , with semantics i can extract the
name and the city ... know what i mean? </city></person>
Thanks for your help
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just a couple more questions:
- I will need numbers identification until 1million probably. WHat i sould do is record only the endings? Like , 1 , 2 , 3.. 10, 20, 30.. 100,200,300.. 1000,2000,3000 ??
Sorry, I don't quite understand how "record" is applicable here.
Example: <person> travels to <city>
If it identifies Ricardo travels to Lisboa , with semantics i can extract the
name and the city ... know what i mean? </city></person>
There is no sense extraction right now. You have to analyze the output from
the recognizer yourself. But probably such framework will appear soon. At
least there is Olimpus project doing that:
Just a couple more questions:
- I will need numbers identification until 1million probably. WHat i sould do is record only the endings? Like , 1 , 2 , 3.. 10, 20, 30.. 100,200,300.. 1000,2000,3000 ??
Sorry, I don't quite understand how "record" is applicable here.
I am talking about that "enrollment stage" .. for the acoustic model
adaptation. I thought that for this all the users that i will have in the App,
could record their voice with the expressions my App needs. I was asking if
when recording numbers i just need thoss expressions.
ex: record 1000, 200, 50 or just record 1250.
Or.. maybe.. as u said.. just recording a small text from the user would be
enough for acoustic model adaptation.
In summary.. for acoustic model adaptation, it's better to have the users
recording all the expressions i need, or just a portion of text?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There is not need to read boring numbers. Give them adaptation text that's
easy and fun to read. Try to select few variants and present them for
particular user randomly.
Have you seen Dragon 10? They suggest user to read a paragraph from "Dogbert's
Top Secret Top Management Handbook" for example.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There is not need to read boring numbers. Give them adaptation text that's
easy and fun to read. Try to select few variants and present them for
particular user randomly.
Have you seen Dragon 10? They suggest user to read a paragraph from "Dogbert's
Top Secret Top Management Handbook" for example
Oh ok... because i can train my users.. they are specific ones.. it's not a
global App. I will look for that book..
When i do acoustic adaptation.. this means i will have a new acoustic model
that will have better results for the person i collected the audio?
I am not quite clear if i will have one acoustic model for each user.. or keep
building the acoustic model with adaptation from each user.. will only have 1
acoustic model, but with adaptation for the users..
What about if a user that i have never adapted uses the App.. the results will
be less accurate.. but a LOT less accurate?
Again.. thanks again for your time and patience
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Oh ok... because i can train my users.. they are specific ones.. it's not a
global App. I will look for that book..
You still need to care about them.
When i do acoustic adaptation.. this means i will have a new acoustic model
that will have better results for the person i collected the audio?
Yes
I am not quite clear if i will have one acoustic model for each user.. or
keep building the acoustic model with adaptation from each user.. will only
have 1 acoustic model, but with adaptation for the users..
If you can identify users it's better to keep separate models for them. You
can also have a model adapted to all your users, but it will be less accurate
than per-users models
What about if a user that i have never adapted uses the App.. the results
will be less accurate.. but a LOT less accurate?
I wouldn't use adapted model for a new user. It's better to start with default
model
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If i give to my users a text to read that is smaller than all the expressions
i need ( like you suggested previously ), the dictionary should have ALL the
expressions i need right? Not just the ones that the users will record... is
that it?
Other thing.. don't know if you can help but... i already have a little C#
wrapper, but i want pocketsphinx running on background.. this should involve
threading so it doesn't "freeze" my user interface right?
Take care
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
do i have access on pocketsphinx on choosing the input/output interface?
Meaning.. can i choose what microphone to use, if i have more than one
available? Or is it always the default one ?
thank you
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If i give to my users a text to read that is smaller than all the
expressions i need ( like you suggested previously ), the dictionary should
have ALL the expressions i need right? Not just the ones that the users will
record... is that it?
If you want to recognize the word it must be in the dictionary
Other thing.. don't know if you can help but... i already have a little C#
wrapper, but i want pocketsphinx running on background.. this should involve
threading so it doesn't "freeze" my user interface right?
Yes
do i have access on pocketsphinx on choosing the input/output interface?
Meaning.. can i choose what microphone to use, if i have more than one
available? Or is it always the default one ?
In Linux there is -adcdev option that lets you choose the device. In Windows,
only default one is used.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Not for now.. but not being able to choose what microphone i can use on my app
could bring some troubles in the future..
Let me rephrase my first question, just to be clear and don't ask again.
I have identified all the expressions i need on my recognition application. I
have created a dictionary file that covers all this.
Now, you said i could use a random text for the "enrollment stage". This text
has phrases with words that aren't covered on the dictionary file created
earlier, meaning this text has expressions that won't need to be recognized on
my application.
So, for the adaptation stage, the dictionary file has to have all the
expressions i need to be identified plus the ones from the random text of
training? Or the dictionary file is only used / important on the recognition
phase, so it should only have the words that will be recognized?
Thank you
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So, for the adaptation stage, the dictionary file has to have all the
expressions i need to be identified plus the ones from the random text of
training?
Yes
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What if i present random sentences to the user on the enrollment stage.. i'll
have to create a dictionary for those sentences specifically to adapt the
acoustic model?
What if i present random sentences to the user on the enrollment stage..
i'll have to create a dictionary for those sentences specifically to adapt the
acoustic model?
Yes.
Sorry I don't quite understand your problem. I didn't tell you to present
random sentences. I told you to use interesting sentences, not random ones.
You can choose 2-3 paragraphs from some nice book. There is no issue to add
those paragraph to the dictionary.
In case you want to have random ones, you can first of all check that all
words are present in cmudict. If it's not so, move to next sample. cmudict is
quite representative
and for adaptation u suggest linux environment right?
No, adaptation works in Linux, Windows and MacOS. There is no environment
restriction.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry I don't quite understand your problem. I didn't tell you to present
random sentences. I told you to use interesting sentences, not random ones.
You can choose 2-3 paragraphs from some nice book. There is no issue to add
those paragraph to the dictionary.
In case you want to have random ones, you can first of all check that all
words are present in cmudict. If it's not so, move to next sample. cmudict is
quite representative
Sorry if i used "random". I Will select 2-3 paragraphs of a nice book. I will
add this to the dictionary.
Maybe what i'm confused about is if the dictionary file. I thought that the
dictionary reflects ONLY the words i want to be recognized. If i create a
dictionary file to the model adaptation with the words i want recognized, plus
the ones from the "nice book", that is not a problem when run my application
doing recognition?
No, adaptation works in Linux, Windows and MacOS. There is no environment
restriction.
I was asking for doing the tutorial of adaptation. Because i compiled
sphinxbase, pocketsphinx and sphinxtrain on WIndows, and i can't find
pocketsphinx_mdef_convert anywhere.
Thank you
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I did a search on forums but found nothing about this.
On the adaptation tutorial, when i do this sphinx_fe ''cat wsj1/feat.params'' -samprate 16000 -c arctic20.listoffiles
-di . -do . -ei raw -eo mfc -raw yes
i get an error.
I checked ths sphinx_fe parameters, and on my feat.params file ( on the hmm
models of pocketsphinx 0.6.1 ) there are parameters not found on sphinx_fe
like -svspec. Should i not use this hmm model?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I thought that the dictionary reflects ONLY the words i want to be
recognized.
No, dictionary could be bigger.
that is not a problem when run my application doing recognition?
It will just work
Because i compiled sphinxbase, pocketsphinx and sphinxtrain on WIndows, and
i can't find pocketsphinx_mdef_convert anywhere. Thank you
Ah, sorry about that. We need to create a project to compile that binary. If
you can submit one that would be helpful!
On the adaptation tutorial, when i do this sphinx_fe ''cat
wsj1/feat.params'' -samprate 16000 -c arctic20.listoffiles -di . -do . -ei raw
-eo mfc -raw yes i get an error. I checked ths sphinx_fe parameters, and on my
feat.params file ( on the hmm models of pocketsphinx 0.6.1 ) there are
parameters not found on sphinx_fe like -svspec. Should i not use this hmm
model?
Sorry, we will update the tutorial to match pocketsphinx 0.6.1. very soon. You
only need to pass -svspec as an argument to bw command on the next stage of
the adaptation. You should filter it from sphinx_fe options. You can use
hub4_wsj, moreover it's recommended to use this model. If you could update the
tutorial or suggest the way to make it more clear that would be very
appreciated!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Been meaning to reply to you but was working on another project..
Ah, sorry about that. We need to create a project to compile that binary. If
you can submit one that would be helpful!
Before i saw this, i changed my working for Linux.. if i got the time in the
future i will try this on Windows..
Sorry, we will update the tutorial to match pocketsphinx 0.6.1. very soon.
You only need to pass -svspec as an argument to bw command on the next stage
of the adaptation. You should filter it from sphinx_fe options. You can use
hub4_wsj, moreover it's recommended to use this model. If you could update the
tutorial or suggest the way to make it more clear that would be very
appreciated!
I saw someone ( you ?) updated the tutorial. Thanks..
Now i have an adapted acoustic model after following the tutorial...
What differences should i notice when running for example
pocketsphinx_continuous ?
By the way.. what's the best way to create grammars suitable for pocketsphinx?
thank you
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
yes .. of course.. but is there a way of measuring that? I saw something about
a WER variable, but i think that's just for Sphinx4 right?
If you want human-readable grammars, you can use JSGF grammar format. For
machine-generated grammars use fsg_* API in sphinxbase.
After making you the question I saw on other thread saying we could use jsgf
grammars also.. wasn't aware of that, thought fsg was the only type accepted.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Good afternoon..
I have developed recently an application for windows7 that uses speech
recognition.
I have developed with Windows SAPI making the speech recognition, but as of
lately i wasn't getting the accurate results i wanted, so i decided to try
other engine, and Sphinx seems to be an approach for me.
I just need some guidelines to choose which direction to take, deciding the
decoder, if i need acoustic model training or not.. etc.
So, what i have is a command and control app, i have very well defined
expressions i will use. Other than these i use numbers . So i am guessing
having a grammar would be the way to go..
I thought about using PocketSphinx because it is C, so i can encapsulate it in
C# faster.. but should i spend more time with Sphinx4 , would be better on the
long run? it seems better documented and with more updates recently.
Would creating or adapting an acoustic model with recordings of the people
that will use it, and only the necessary expressions, be a valid solution?
DUmb question maybe, but the perfect solution for a good speech recognition
would be having an acoustic model for each user? Depending on the user that
will run the app, his "profile" would be loaded.. or it's best having an
acoustic model with recordings of every person that will use the App? Or even,
having the acoustic model default ( i think it's the wsj one ) and just adapt
the language model / dictionary / grammar to the expressions i need?
Thanks in advance for your time.
Hello
Control of the computer with speech have limited usability actually. In a long
term you need to target dictation.
I think you need to try pocketsphinx
We think that adaptation is a good strategy to take
Optimal solution would be to store adaptation profile for each user. That
include both adapted language model and adapted acoustic model. You can have
enrollment stage to create them.
I'm not wanting to control the computer.. what i meant was that my App has
defined command inputs, that the user learns to use. These are small well
defined expressions... enough for what we need. I don't think dictation will
be easier on the users.. if they can say "whatever" they want.. or were you
refering to something else?
thanks for your responses.
Great, then open language interface will be the feature for the next version
of your application.
:)
SO i think i'll start with pocketsphinx, and try and do a acoustic model
adaptation.
Just a couple more questions:
- I will need numbers identification until 1million probably. WHat i sould do is record only the endings? Like , 1 , 2 , 3.. 10, 20, 30.. 100,200,300.. 1000,2000,3000 ??
-The grammars i use now are SAPI compliant, so they are grxml files. I can access semantics on this too.. Is there a feature like this for the jsgf grammars files pocketsphinx uses? I needed this feature to divide the sentences into meaning information:
Example: <person> travels to <city>
If it identifies Ricardo travels to Lisboa , with semantics i can extract the
name and the city ... know what i mean? </city></person>
Thanks for your help
I'm sorry but what did you mean by
"You can have enrollment stage to create them."
Sorry, I don't quite understand how "record" is applicable here.
There is no sense extraction right now. You have to analyze the output from
the recognizer yourself. But probably such framework will appear soon. At
least there is Olimpus project doing that:
http://wiki.speech.cs.cmu.edu/olympus/index.php/Olympus
I meant that you can give user small text to read and then use this small
recording for acoustic model adaptaion.
Sorry, I don't quite understand how "record" is applicable here.
I am talking about that "enrollment stage" .. for the acoustic model
adaptation. I thought that for this all the users that i will have in the App,
could record their voice with the expressions my App needs. I was asking if
when recording numbers i just need thoss expressions.
ex: record 1000, 200, 50 or just record 1250.
Or.. maybe.. as u said.. just recording a small text from the user would be
enough for acoustic model adaptation.
In summary.. for acoustic model adaptation, it's better to have the users
recording all the expressions i need, or just a portion of text?
There is not need to read boring numbers. Give them adaptation text that's
easy and fun to read. Try to select few variants and present them for
particular user randomly.
Have you seen Dragon 10? They suggest user to read a paragraph from "Dogbert's
Top Secret Top Management Handbook" for example.
Have you seen Dragon 10? They suggest user to read a paragraph from "Dogbert's
Top Secret Top Management Handbook" for example
Oh ok... because i can train my users.. they are specific ones.. it's not a
global App. I will look for that book..
When i do acoustic adaptation.. this means i will have a new acoustic model
that will have better results for the person i collected the audio?
I am not quite clear if i will have one acoustic model for each user.. or keep
building the acoustic model with adaptation from each user.. will only have 1
acoustic model, but with adaptation for the users..
What about if a user that i have never adapted uses the App.. the results will
be less accurate.. but a LOT less accurate?
Again.. thanks again for your time and patience
You still need to care about them.
Yes
If you can identify users it's better to keep separate models for them. You
can also have a model adapted to all your users, but it will be less accurate
than per-users models
I wouldn't use adapted model for a new user. It's better to start with default
model
Hey again...
started the tutorial for model adaptation (http://cmusphinx.sourceforge.net/w
iki/tutorialadapt)
and a question arised.
The dic file (arctic20.dic) has all the words from the arctic20 text.
If i give to my users a text to read that is smaller than all the expressions
i need ( like you suggested previously ), the dictionary should have ALL the
expressions i need right? Not just the ones that the users will record... is
that it?
Other thing.. don't know if you can help but... i already have a little C#
wrapper, but i want pocketsphinx running on background.. this should involve
threading so it doesn't "freeze" my user interface right?
Take care
forgot to ask..
do i have access on pocketsphinx on choosing the input/output interface?
Meaning.. can i choose what microphone to use, if i have more than one
available? Or is it always the default one ?
thank you
If you want to recognize the word it must be in the dictionary
Yes
In Linux there is -adcdev option that lets you choose the device. In Windows,
only default one is used.
good morning!
Not for now.. but not being able to choose what microphone i can use on my app
could bring some troubles in the future..
Let me rephrase my first question, just to be clear and don't ask again.
I have identified all the expressions i need on my recognition application. I
have created a dictionary file that covers all this.
Now, you said i could use a random text for the "enrollment stage". This text
has phrases with words that aren't covered on the dictionary file created
earlier, meaning this text has expressions that won't need to be recognized on
my application.
So, for the adaptation stage, the dictionary file has to have all the
expressions i need to be identified plus the ones from the random text of
training? Or the dictionary file is only used / important on the recognition
phase, so it should only have the words that will be recognized?
Thank you
Yes
right..
What if i present random sentences to the user on the enrollment stage.. i'll
have to create a dictionary for those sentences specifically to adapt the
acoustic model?
By the way.. the transcription file http://www.speech.cs.cmu.edu/cmusphinx/mo
indocs/arctic20.transcription has to have upper case letters?
and for adaptation u suggest linux environment right?
Yes.
Sorry I don't quite understand your problem. I didn't tell you to present
random sentences. I told you to use interesting sentences, not random ones.
You can choose 2-3 paragraphs from some nice book. There is no issue to add
those paragraph to the dictionary.
In case you want to have random ones, you can first of all check that all
words are present in cmudict. If it's not so, move to next sample. cmudict is
quite representative
No, adaptation works in Linux, Windows and MacOS. There is no environment
restriction.
Sorry I don't quite understand your problem. I didn't tell you to present
random sentences. I told you to use interesting sentences, not random ones.
You can choose 2-3 paragraphs from some nice book. There is no issue to add
those paragraph to the dictionary.
In case you want to have random ones, you can first of all check that all
words are present in cmudict. If it's not so, move to next sample. cmudict is
quite representative
Sorry if i used "random". I Will select 2-3 paragraphs of a nice book. I will
add this to the dictionary.
Maybe what i'm confused about is if the dictionary file. I thought that the
dictionary reflects ONLY the words i want to be recognized. If i create a
dictionary file to the model adaptation with the words i want recognized, plus
the ones from the "nice book", that is not a problem when run my application
doing recognition?
I was asking for doing the tutorial of adaptation. Because i compiled
sphinxbase, pocketsphinx and sphinxtrain on WIndows, and i can't find
pocketsphinx_mdef_convert anywhere.
Thank you
I did a search on forums but found nothing about this.
On the adaptation tutorial, when i do this
sphinx_fe ''cat wsj1/feat.params'' -samprate 16000 -c arctic20.listoffiles
-di . -do . -ei raw -eo mfc -raw yes
i get an error.
I checked ths sphinx_fe parameters, and on my feat.params file ( on the hmm
models of pocketsphinx 0.6.1 ) there are parameters not found on sphinx_fe
like -svspec. Should i not use this hmm model?
Hello
No, dictionary could be bigger.
It will just work
Ah, sorry about that. We need to create a project to compile that binary. If
you can submit one that would be helpful!
Sorry, we will update the tutorial to match pocketsphinx 0.6.1. very soon. You
only need to pass -svspec as an argument to bw command on the next stage of
the adaptation. You should filter it from sphinx_fe options. You can use
hub4_wsj, moreover it's recommended to use this model. If you could update the
tutorial or suggest the way to make it more clear that would be very
appreciated!
Hello..
Been meaning to reply to you but was working on another project..
Before i saw this, i changed my working for Linux.. if i got the time in the
future i will try this on Windows..
I saw someone ( you ?) updated the tutorial. Thanks..
Now i have an adapted acoustic model after following the tutorial...
What differences should i notice when running for example
pocketsphinx_continuous ?
By the way.. what's the best way to create grammars suitable for pocketsphinx?
thank you
Improved accuracy?
If you want human-readable grammars, you can use JSGF grammar format. For
machine-generated grammars use fsg_* API in sphinxbase.
yes .. of course.. but is there a way of measuring that? I saw something about
a WER variable, but i think that's just for Sphinx4 right?
After making you the question I saw on other thread saying we could use jsgf
grammars also.. wasn't aware of that, thought fsg was the only type accepted.