My purpose is to implement a speech recognition engine which extracts one or two necessary keywords, from an extremely large amount of spoken words, during the listening process. A reliable acoustic model in my language is not available as I far as I know.
I considered about building an acoustic model. I figured out from tutorial that I have to implement 5 ours of recordings, and write down every spoken word from the recording files. Including those I don’t need for my specific keyword-detecting.
My question is, is there any efficient way to build my acoustic model, without mentioning those other unneeded words.
I also heard about something called “keyword spotting”, which I’m almost sure is not relevant without any acoustic model by itself.
with many thanks for advance,
Daniel Schmidt
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You could have made an answer on your question way easier if you care to name the language you of your interest and the particular use case you are interested in. It is not clear if you want a fixed set of words or you want to reconsider the words in the future.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My language I'm interested about is Hebrew. Yes, my set of words is well-known. I won't need to change them at the next future.
Also if it helps, my keywords can be also be pronunced in Spanish.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Those are pretty short to be detected reliably in continuous stream. You need 3, preferably 4-5 syllables.
You can record 100-200 samples of your keyword and about an hour of random speech and build a model from that. I should be sufficient. An alternative would to be train a specific DNN model to detect the word you need like google does for keyword spotting. DNN model will be more reliable, but will require more keyword examples.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for answer.
I assume your second suggestion isn't included with cmusphinx features.
If so, is there any available tutorial you can reccomned me about? I tried to search but found only studies.
(From your answer I believe the first case would be less effective since I care alot for reability)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry for my rush assumption, I just didn't know exactly the other names of 'dnn' to search for.
I have some questions about the first approach you suggested (the accoustic model):
1. In the transcription files, would I need to place all the recorded words, or just the specified keywords I need, followd by <sil> slices?
2. How fast would be the respons time for these keywords, after making a model with (at least) 3 hours of recording?
Thank you for additional response.</sil>
Last edit: Daniel Schmidt 2016-07-04
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry for my confusion, I probally didn't explain myself right at first.
All I need from my model is to recognize the --only-- 2 words I mentiond before. I won't need any other word to recognize except those two.
I need my model to recognize those words, even if they are being spoken as part of a sentence.
Would it be better for me to build a new acoustic model?
or maybe using anoter existing one?
* I obviously care alot about accuracy.
Thank you for your patience
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I won't need any other word to recognize except those two.
This is your misconception. You still need to recognize that words are not present if they are not present. So you need to recognize other parts of the speech in order to understand that words are not present there.
Would it be better for me to build a new acoustic model?
I am not aware of a public Hebrew acoustic model, so you have to build it yourself.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This is your misconception. You still need to recognize that words are not present if they are not present. So you need to recognize other parts of the speech in order to understand that words are not present there
If hypothetically I would like to understand the deep principles about that conception, Is there any readable source of information you can recommend me about?
Last edit: Daniel Schmidt 2016-07-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My purpose is to implement a speech recognition engine which extracts one or two necessary keywords, from an extremely large amount of spoken words, during the listening process. A reliable acoustic model in my language is not available as I far as I know.
I considered about building an acoustic model. I figured out from tutorial that I have to implement 5 ours of recordings, and write down every spoken word from the recording files. Including those I don’t need for my specific keyword-detecting.
My question is, is there any efficient way to build my acoustic model, without mentioning those other unneeded words.
I also heard about something called “keyword spotting”, which I’m almost sure is not relevant without any acoustic model by itself.
with many thanks for advance,
Daniel Schmidt
You could have made an answer on your question way easier if you care to name the language you of your interest and the particular use case you are interested in. It is not clear if you want a fixed set of words or you want to reconsider the words in the future.
Last edit: Daniel Schmidt 2016-06-27
My language I'm interested about is Hebrew. Yes, my set of words is well-known. I won't need to change them at the next future.
Also if it helps, my keywords can be also be pronunced in Spanish.
What are the words exactly?
"LI-SHÓN" "LA-KUM"
Those are pretty short to be detected reliably in continuous stream. You need 3, preferably 4-5 syllables.
You can record 100-200 samples of your keyword and about an hour of random speech and build a model from that. I should be sufficient. An alternative would to be train a specific DNN model to detect the word you need like google does for keyword spotting. DNN model will be more reliable, but will require more keyword examples.
Thank you for answer.
I assume your second suggestion isn't included with cmusphinx features.
If so, is there any available tutorial you can reccomned me about? I tried to search but found only studies.
(From your answer I believe the first case would be less effective since I care alot for reability)
Sorry for my rush assumption, I just didn't know exactly the other names of 'dnn' to search for.
I have some questions about the first approach you suggested (the accoustic model):
1. In the transcription files, would I need to place all the recorded words, or just the specified keywords I need, followd by <sil> slices?
2. How fast would be the respons time for these keywords, after making a model with (at least) 3 hours of recording?
Thank you for additional response.</sil>
Last edit: Daniel Schmidt 2016-07-04
In the transcription file you place the transcription of what was recorded. You need to record what you need to recognize, not just keywords.
Response time does not depend on amount of training data. It is about 0.1s.
Sorry for my confusion, I probally didn't explain myself right at first.
All I need from my model is to recognize the --only-- 2 words I mentiond before. I won't need any other word to recognize except those two.
I need my model to recognize those words, even if they are being spoken as part of a sentence.
Would it be better for me to build a new acoustic model?
or maybe using anoter existing one?
* I obviously care alot about accuracy.
Thank you for your patience
This is your misconception. You still need to recognize that words are not present if they are not present. So you need to recognize other parts of the speech in order to understand that words are not present there.
I am not aware of a public Hebrew acoustic model, so you have to build it yourself.
If hypothetically I would like to understand the deep principles about that conception, Is there any readable source of information you can recommend me about?
Last edit: Daniel Schmidt 2016-07-05
This thesis has a good survey
http://eprints.qut.edu.au/37254/1/Albert_Thambiratnam_Thesis.pdf