I am currently trying to use Sphinx to attempt a normal speech recognition on an Australian Voice Model. I've gathered all the raw recordings for the voice model and am ready to start training Sphinx (I am using Sphinx4 by the way) to be able to recognise them.
I've got a few questions however:
1. Can I assume that Australian Voice Model and Australian Language Model would mean the same thing? I've come across many occurrences of "using my own language model" in the Sphinx FAQ and tutorials on CMU's page but I wasn't too sure if they actually meant the same thing
Could you please tell me where would be the best place to know more about training Sphinx to understand my raw data?
Thanks,
Suwandy
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I apologise if I seem to be unpatient, but if this isn't the appropriate forum for me to post my question, could somebody please enlighten me? I am really new to all this speech recognition stuff and especially Sphinx. So I really need somekind of guidance to get started.
Thanks,
Suwandy
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
"Voice Models" are usually called "Acoustic Models" and are not the same as "Language Models". A language model is sort of like a grammar of a language, while an acoustic model is kind of like the pronunciation of a language. Since the difference between the pronunciation of Australian English and American is markedly different, you will get much better results by first training an acoustic model. There are differences in "grammar", too, so training a language model can help. Just remember that a language model is task dependent, so doing something like training on fiction and then using it for news broadcasts won't really be all that helpful.
Anyway, as for training the acoustic model, you might consider "adapting" the already existing English models with your new Australian data. I haven't done this before, but I think I saw a post or two on the topic.
Best Wishes,
Robbie
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thanks for the quick reply and the helpful answer :).
When you say "acoustic model is kind of like the pronunciation of a language", does it also refer to the different accent people may have?
The data I have is the ANDOSL (which is part of a government funded project). They took samples "as quoted from the website : Current data are from native speakers of Australian English and from non-native speakers of Australian English (first generation migrants having a non-English native language)."
That said, can I assume I don't need to train for a new language model then? since it's English anyway?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The answer to your question about pronunciation is: absolutely! When you train a speech recognizer to the "accent" of a single person, you can get much better results than if trained on the accent of people with lots of accents.
As for the language model, I would say that you will probably be okay to not train a new one; but, it's hard to make a blanket statement like that. What will you use the recognizer for? The best thing to do would probably be to add entries to the dictionary as necessary (all of the "australianisms" that appear in your training corpus for the acoustic model that aren't already in there), train your acoustic model, then test out your system with either the WSJ or HUB4 language models (or, if your application is even simpler, maybe even just a JSGF grammar). At that point, you might start to get a feel for how far training a new language model will get you.
regards,
Robbie
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thanks again for the really quick helpful answer, it is very much appreciated.
My speech recognition tool, is I guess, more directed to me for a start. To be precise, I am actually working on a thesis project, which is about implementing speech recognition on the Aibo (ERS -7) robot. Of course, the hardwork will all be done on a PC.
But the ultimate goal of my thesis (if feasible) is to allow me to speak to the robot (it has microphone) which will then transfer my voice it catches to a PC which will be running Sphinx. It will then transfer the result of the recognition (in text form, e.g. "Sit down" or "Stand up") back to the dog, who will execute the command.
To put it short, my speech recognition will probably involve simple speech, like "walk for 5 seconds", or "walk 10 meters away", "kick a ball" etc. And not intended to understand (say) news broadcast or anything really complicated.
What would your suggestion be in this?
Cheers,
Suwandy
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I worked on a similar project (adding voice recognition to a robot). My particular part of the project was simply "proof of concept" so it didn't have to recognize very much. I found the JSGF grammar to work fairly well for that purpose.
I would definitely start with a JSGF grammar because it is much simpler to implement and you get to be the architect, you can even weight different parts of the grammar to make them more probable. Almost all of the demos in the /demos directory use a JSGF grammar.
If you do decide to train an LM, just remember that LMs are task specific. Training on newspaper text (e.g. WSJ) will provide horrific results for your robot commands. You will need a fairly large corpus of robot commands, and, ideally, each command will occur with about the same frequency as the robot will be commanded. Since this is not likely, you're probably just as well using a JSGF.
In the case that you do decide on using an LM, check out the HelloNGram demo which makes use of a very simple LM to do something similar to what you are trying to do.
Best wishes,
Robbie
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think I'll go ahead and use JSGF since it's part of Java, plus, I'm using Sphinx4 anyway. However, I am still curious as to what is the difference between the JSGF, WSJ and Hub4 grammar? And how significant is their difference in regards to what I am trying to do?
The audio database that I was provided have had all the wav files with it's sentence list that each speaker are required to say (there's about 200 predefined sentences, so one wav file will match with one sentence)
Thanks,
Suwandy
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
WSJ and Hub4 are language models. Language models provide a probability of words sequences, generally what is the probability of a third word given the previous two. This is not rule based, just probabilisitc. For example, if you see the word "Merry", the next word that follows is "Christmas" with an extremely high probability (at least in American English--not so for the British).
On the other hand, a grammar is more like a list of rules that constrain the recognizer. In its simplest form, it can just be a limited list of possible words. A language model considers every word possible (though some words more probable than others), so you don't have the ability to constrain an LM like this. With a grammar, you can specify arbitrarily long sequences, rather than just 3 in a trigram language model.
The main reason I recommended a word list (besides being able to limit the number of words to only those in your application) is that a language model reflects what it was trained on, meaning that it will tend to recognize sentences that are very similar to what it was trained on. Your application doesn't seem to be very close to what you have in your training corpus.
Even with a JSGF grammar you can add probabilisitic elements, which is highly recommend ed to increase accuracy. Just weight the more frequent words/phrases higher than the rest.
Good luck,
Robbie
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think I need to revise what I have said earlier about the application I am intended to write. I spoke to my thesis supervisor about this, and he apparently wants me to be able to get Sphinx4 (or the robot) to understand large set of vocabularies. (although, in this case, it will only be as large as all the vocabularies available in the database)
Do I need to have a different application here? Like, different linguist, different grammar? I spoke to a guy who knew a bit about Sphinx4, and he said I need to write my own grammar for the use of my application.
Your advice, is very much appreciated :)
Thanks,
Suwandy
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
A grammar can recognize as many words as a you are willing to build into the grammar. The problem is, is that, unless you are able to pre-program probabilities, you're accuracy is going to degrade as you add more rules to your grammar.
A review of your options: (1) grammar-based approach (2) LM approach. (1) is usually used for command-and-control type applications, and robot commands generally fit in this category. My first instinct is to say that you want a grammar (which is what the other fellow recommended as well). The problem with (1) is that you are going to have to hand-code rules, including every possible word that can be said. For even a medium-sized vocabulary this can be very tedious. While you can add probabilities to a JSGF grammar, doing so accurately would be difficult and not doing it at all would mean that a large grammar would have low accuracy.
(2) The LM is extremely easy to train, provided you have a corpus. It provides probabilities, so your accuracy is potentially better. However, for an LM to work, you MUST have a training corpus of sentences that are very similar to the ones you will be using. In other words, is your corpus a corpus of Robot commands? If not, you will probably be very disappointed by the level of accuracy you get.
A middle-ground approach would be to try and build a corpus of Robot commands, but it might be hard to make sure that more frequent commands occur more frequently than less frequent commands. Maybe you will just assume that each command is as likely as the next. I believe this is the approach used in the HelloNGram demo.
If you take approach (1) you will use either the Flat linguist or (more likely) the Dynamic Flat Linguist. If you take approach (2) your will probably use the LexTreeLinguist, though for smaller vocabs it is possible to use the ones mentioned above.
To summarize: the real question is, what type of sentences is your training corpus? If you do not have access to a corpus of robot commands, you will probably choose to write a grammar or build a pseudo-corpus to train an LM on.
Hope this helps,
Robbie
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all,
I am currently trying to use Sphinx to attempt a normal speech recognition on an Australian Voice Model. I've gathered all the raw recordings for the voice model and am ready to start training Sphinx (I am using Sphinx4 by the way) to be able to recognise them.
I've got a few questions however:
1. Can I assume that Australian Voice Model and Australian Language Model would mean the same thing? I've come across many occurrences of "using my own language model" in the Sphinx FAQ and tutorials on CMU's page but I wasn't too sure if they actually meant the same thing
Thanks,
Suwandy
Hi,
I apologise if I seem to be unpatient, but if this isn't the appropriate forum for me to post my question, could somebody please enlighten me? I am really new to all this speech recognition stuff and especially Sphinx. So I really need somekind of guidance to get started.
Thanks,
Suwandy
"Voice Models" are usually called "Acoustic Models" and are not the same as "Language Models". A language model is sort of like a grammar of a language, while an acoustic model is kind of like the pronunciation of a language. Since the difference between the pronunciation of Australian English and American is markedly different, you will get much better results by first training an acoustic model. There are differences in "grammar", too, so training a language model can help. Just remember that a language model is task dependent, so doing something like training on fiction and then using it for news broadcasts won't really be all that helpful.
Anyway, as for training the acoustic model, you might consider "adapting" the already existing English models with your new Australian data. I haven't done this before, but I think I saw a post or two on the topic.
Best Wishes,
Robbie
Hi,
thanks for the quick reply and the helpful answer :).
When you say "acoustic model is kind of like the pronunciation of a language", does it also refer to the different accent people may have?
The data I have is the ANDOSL (which is part of a government funded project). They took samples "as quoted from the website : Current data are from native speakers of Australian English and from non-native speakers of Australian English (first generation migrants having a non-English native language)."
That said, can I assume I don't need to train for a new language model then? since it's English anyway?
The answer to your question about pronunciation is: absolutely! When you train a speech recognizer to the "accent" of a single person, you can get much better results than if trained on the accent of people with lots of accents.
As for the language model, I would say that you will probably be okay to not train a new one; but, it's hard to make a blanket statement like that. What will you use the recognizer for? The best thing to do would probably be to add entries to the dictionary as necessary (all of the "australianisms" that appear in your training corpus for the acoustic model that aren't already in there), train your acoustic model, then test out your system with either the WSJ or HUB4 language models (or, if your application is even simpler, maybe even just a JSGF grammar). At that point, you might start to get a feel for how far training a new language model will get you.
regards,
Robbie
Hi,
thanks again for the really quick helpful answer, it is very much appreciated.
My speech recognition tool, is I guess, more directed to me for a start. To be precise, I am actually working on a thesis project, which is about implementing speech recognition on the Aibo (ERS -7) robot. Of course, the hardwork will all be done on a PC.
But the ultimate goal of my thesis (if feasible) is to allow me to speak to the robot (it has microphone) which will then transfer my voice it catches to a PC which will be running Sphinx. It will then transfer the result of the recognition (in text form, e.g. "Sit down" or "Stand up") back to the dog, who will execute the command.
To put it short, my speech recognition will probably involve simple speech, like "walk for 5 seconds", or "walk 10 meters away", "kick a ball" etc. And not intended to understand (say) news broadcast or anything really complicated.
What would your suggestion be in this?
Cheers,
Suwandy
I worked on a similar project (adding voice recognition to a robot). My particular part of the project was simply "proof of concept" so it didn't have to recognize very much. I found the JSGF grammar to work fairly well for that purpose.
I would definitely start with a JSGF grammar because it is much simpler to implement and you get to be the architect, you can even weight different parts of the grammar to make them more probable. Almost all of the demos in the /demos directory use a JSGF grammar.
If you do decide to train an LM, just remember that LMs are task specific. Training on newspaper text (e.g. WSJ) will provide horrific results for your robot commands. You will need a fairly large corpus of robot commands, and, ideally, each command will occur with about the same frequency as the robot will be commanded. Since this is not likely, you're probably just as well using a JSGF.
In the case that you do decide on using an LM, check out the HelloNGram demo which makes use of a very simple LM to do something similar to what you are trying to do.
Best wishes,
Robbie
Hi,
thanks for the helpful reply.
I think I'll go ahead and use JSGF since it's part of Java, plus, I'm using Sphinx4 anyway. However, I am still curious as to what is the difference between the JSGF, WSJ and Hub4 grammar? And how significant is their difference in regards to what I am trying to do?
The audio database that I was provided have had all the wav files with it's sentence list that each speaker are required to say (there's about 200 predefined sentences, so one wav file will match with one sentence)
Thanks,
Suwandy
WSJ and Hub4 are language models. Language models provide a probability of words sequences, generally what is the probability of a third word given the previous two. This is not rule based, just probabilisitc. For example, if you see the word "Merry", the next word that follows is "Christmas" with an extremely high probability (at least in American English--not so for the British).
On the other hand, a grammar is more like a list of rules that constrain the recognizer. In its simplest form, it can just be a limited list of possible words. A language model considers every word possible (though some words more probable than others), so you don't have the ability to constrain an LM like this. With a grammar, you can specify arbitrarily long sequences, rather than just 3 in a trigram language model.
The main reason I recommended a word list (besides being able to limit the number of words to only those in your application) is that a language model reflects what it was trained on, meaning that it will tend to recognize sentences that are very similar to what it was trained on. Your application doesn't seem to be very close to what you have in your training corpus.
Even with a JSGF grammar you can add probabilisitic elements, which is highly recommend ed to increase accuracy. Just weight the more frequent words/phrases higher than the rest.
Good luck,
Robbie
Hi,
I think I need to revise what I have said earlier about the application I am intended to write. I spoke to my thesis supervisor about this, and he apparently wants me to be able to get Sphinx4 (or the robot) to understand large set of vocabularies. (although, in this case, it will only be as large as all the vocabularies available in the database)
Do I need to have a different application here? Like, different linguist, different grammar? I spoke to a guy who knew a bit about Sphinx4, and he said I need to write my own grammar for the use of my application.
Your advice, is very much appreciated :)
Thanks,
Suwandy
A grammar can recognize as many words as a you are willing to build into the grammar. The problem is, is that, unless you are able to pre-program probabilities, you're accuracy is going to degrade as you add more rules to your grammar.
A review of your options: (1) grammar-based approach (2) LM approach. (1) is usually used for command-and-control type applications, and robot commands generally fit in this category. My first instinct is to say that you want a grammar (which is what the other fellow recommended as well). The problem with (1) is that you are going to have to hand-code rules, including every possible word that can be said. For even a medium-sized vocabulary this can be very tedious. While you can add probabilities to a JSGF grammar, doing so accurately would be difficult and not doing it at all would mean that a large grammar would have low accuracy.
(2) The LM is extremely easy to train, provided you have a corpus. It provides probabilities, so your accuracy is potentially better. However, for an LM to work, you MUST have a training corpus of sentences that are very similar to the ones you will be using. In other words, is your corpus a corpus of Robot commands? If not, you will probably be very disappointed by the level of accuracy you get.
A middle-ground approach would be to try and build a corpus of Robot commands, but it might be hard to make sure that more frequent commands occur more frequently than less frequent commands. Maybe you will just assume that each command is as likely as the next. I believe this is the approach used in the HelloNGram demo.
If you take approach (1) you will use either the Flat linguist or (more likely) the Dynamic Flat Linguist. If you take approach (2) your will probably use the LexTreeLinguist, though for smaller vocabs it is possible to use the ones mentioned above.
To summarize: the real question is, what type of sentences is your training corpus? If you do not have access to a corpus of robot commands, you will probably choose to write a grammar or build a pseudo-corpus to train an LM on.
Hope this helps,
Robbie