Hello,
I am given the task to create a speech recognition application for a social robot.
The goal is that the robot will understand Polish words and react differently after hearing different commands or parts of sentences. Of course it doesn't have to understand a lot of words, a small or medium dictionary would be enough. How big the acoustic model should be? Is that possible to create a small model with not such a big amount of data first to check if it recognizes anything and then, if it does, develop it into a bigger model? How much recordings do I need to make the app able to understand multiple speakers (I mean: different people speaking at different times, not two or more people speaking at the same time)?
Would the "keyword search" be suitable for such a task? I find it quite problematic to set the threshold value for each of the commands. I think maybe using the JSGF grammar could be more convenient to make the robot understand what people say. The problem is that using the grammar makes the system try to match every utterance to the rules, even if it is something completely different.
Best regards,
Artur Zygadlo
Last edit: Artur Zygadło 2015-08-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am given the task to create a speech recognition application for a social robot.
This is cool.
How big the acoustic model should be? Is that possible to create a small model with not such a big amount of data first to check if it recognizes anything and then, if it does, develop it into a bigger model? How much recordings do I need to make the app able to understand multiple speakers (I mean: different people speaking at different times, not two or more people speaking at the same time)?
This issue is covered in acoustic model training tutorial
Would the "keyword search" be suitable for such a task? I find it quite problematic to set the threshold value for each of the commands.
Keyword spotting is useful for catching activation keyphrase, grammar recognition for recognition of the actual input. You can find demos on the youtube. You can also check how Amazon Echo works.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As I am not experienced (yet) in speech recognition, I have another question. Do you think it is necessary to create a new acoustic model for Polish language or maybe adapting some existing model would be enough?
In the tutorial on Training Acoustic Model it is mentioned that the language model is needed. Does it mean that I have to create the statistical language model, even if I decide to use the grammar or keyword search later? To what extent should the language model (and the acoustic model) consist of the exact vocabulary that will be used during conversations with the robot?
Should I rather prepare "5 hours of recordings of 200 speakers for command and control for many speakers" or "50 hours of recordings of 200 speakers for many speakers dictation"? Would that be a good idea to prepare a short text and ask these 200 people to read it? Or maybe should these recordings differ in their content?
Sorry for asking such an amount of (stupid) questions, but I'm trying to estimate how much time and work will be needed to get satisfactory results.
Best regards,
Artur Zygadlo
Last edit: Artur Zygadło 2015-08-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Do you think it is necessary to create a new acoustic model for Polish language
Yes, you need to create a new model.
Does it mean that I have to create the statistical language model, even if I decide to use the grammar or keyword search later?
Yes, you need a simple language model, you can create it from a list of words.
Should I rather prepare "5 hours of recordings of 200 speakers for command and control for many speakers" or "50 hours of recordings of 200 speakers for many speakers dictation"? Would that be a good idea to prepare a short text and ask these 200 people to read it?
You have to decide yourself if you want command and control or dictation.
Ideally texts for recording should be different, using same text for all speakers is not a good idea.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I just wanted to report that I have trained my first acoustic model today :)
My training database is about 1 hour of audiobook recordings (20 different voices) and the training dictionary is about 3900 words (eSpeak's g2p helped a bit).
The test database is 6 minutes of 6 voices speaking specific commands which I would like to be recognized by the robot by now (consisting of about 50 words).
My language model is simply the command list.
The results seem to be quite nice when speaking directly to the mic. I get the WER of 11% and SER of 19% on the test database (recorded in a normal environment and quite close to the mic). Are these error numbers ok when considering these amounts of data given?
However, I want to recognize speech from distance. Do you think using a boundary mic can be a good idea? I get a lot of false positives even in not-so-noisy environment, but as for now my mic is not really a good one. I guess my next step will be to run a keyword spotting mode in the second program thread in order to reduce the false positives (I mean: the sentences will only be recognized when the activation word is said before). In future I would like to give a try to ManyEars and 8SoundsUSB, but I'm not sure whether it may improve the distant recognition or only source separation.
Artur
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you want to recognize distant speech, you probably want to train your model on multicondition database. You can duplicate several copies of your audio and modify each with separate room effect. This is a common approach in modern research on reverberant speech recognition.
For more details on recent state of art and ideas see the results of ASPIRE challenge. Also see aspire training script in Kaldi speech recognition engine.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I am given the task to create a speech recognition application for a social robot.
The goal is that the robot will understand Polish words and react differently after hearing different commands or parts of sentences. Of course it doesn't have to understand a lot of words, a small or medium dictionary would be enough. How big the acoustic model should be? Is that possible to create a small model with not such a big amount of data first to check if it recognizes anything and then, if it does, develop it into a bigger model? How much recordings do I need to make the app able to understand multiple speakers (I mean: different people speaking at different times, not two or more people speaking at the same time)?
Would the "keyword search" be suitable for such a task? I find it quite problematic to set the threshold value for each of the commands. I think maybe using the JSGF grammar could be more convenient to make the robot understand what people say. The problem is that using the grammar makes the system try to match every utterance to the rules, even if it is something completely different.
Best regards,
Artur Zygadlo
Last edit: Artur Zygadło 2015-08-19
This is cool.
This issue is covered in acoustic model training tutorial
http://cmusphinx.sourceforge.net/wiki/tutorialam
Keyword spotting is useful for catching activation keyphrase, grammar recognition for recognition of the actual input. You can find demos on the youtube. You can also check how Amazon Echo works.
Dear Nickolay,
Thank you for your answer.
As I am not experienced (yet) in speech recognition, I have another question. Do you think it is necessary to create a new acoustic model for Polish language or maybe adapting some existing model would be enough?
In the tutorial on Training Acoustic Model it is mentioned that the language model is needed. Does it mean that I have to create the statistical language model, even if I decide to use the grammar or keyword search later? To what extent should the language model (and the acoustic model) consist of the exact vocabulary that will be used during conversations with the robot?
Should I rather prepare "5 hours of recordings of 200 speakers for command and control for many speakers" or "50 hours of recordings of 200 speakers for many speakers dictation"? Would that be a good idea to prepare a short text and ask these 200 people to read it? Or maybe should these recordings differ in their content?
Sorry for asking such an amount of (stupid) questions, but I'm trying to estimate how much time and work will be needed to get satisfactory results.
Best regards,
Artur Zygadlo
Last edit: Artur Zygadło 2015-08-19
Yes, you need to create a new model.
Yes, you need a simple language model, you can create it from a list of words.
You have to decide yourself if you want command and control or dictation.
Ideally texts for recording should be different, using same text for all speakers is not a good idea.
Hi,
I just wanted to report that I have trained my first acoustic model today :)
My training database is about 1 hour of audiobook recordings (20 different voices) and the training dictionary is about 3900 words (eSpeak's g2p helped a bit).
The test database is 6 minutes of 6 voices speaking specific commands which I would like to be recognized by the robot by now (consisting of about 50 words).
My language model is simply the command list.
The results seem to be quite nice when speaking directly to the mic. I get the WER of 11% and SER of 19% on the test database (recorded in a normal environment and quite close to the mic). Are these error numbers ok when considering these amounts of data given?
However, I want to recognize speech from distance. Do you think using a boundary mic can be a good idea? I get a lot of false positives even in not-so-noisy environment, but as for now my mic is not really a good one. I guess my next step will be to run a keyword spotting mode in the second program thread in order to reduce the false positives (I mean: the sentences will only be recognized when the activation word is said before). In future I would like to give a try to ManyEars and 8SoundsUSB, but I'm not sure whether it may improve the distant recognition or only source separation.
Artur
Congratulations, Artur
Results are acceptable I think.
If you want to recognize distant speech, you probably want to train your model on multicondition database. You can duplicate several copies of your audio and modify each with separate room effect. This is a common approach in modern research on reverberant speech recognition.
For more details on recent state of art and ideas see the results of ASPIRE challenge. Also see aspire training script in Kaldi speech recognition engine.