Menu

one keyword detection, no acoustic model

Help
2016-06-25
2016-07-06
  • Daniel Schmidt

    Daniel Schmidt - 2016-06-25

    My purpose is to implement a speech recognition engine which extracts one or two necessary keywords, from an extremely large amount of spoken words, during the listening process. A reliable acoustic model in my language is not available as I far as I know.

    I considered about building an acoustic model. I figured out from tutorial that I have to implement 5 ours of recordings, and write down every spoken word from the recording files. Including those I don’t need for my specific keyword-detecting.

    My question is, is there any efficient way to build my acoustic model, without mentioning those other unneeded words.

    I also heard about something called “keyword spotting”, which I’m almost sure is not relevant without any acoustic model by itself.

    with many thanks for advance,
    Daniel Schmidt

     
    • Nickolay V. Shmyrev

      You could have made an answer on your question way easier if you care to name the language you of your interest and the particular use case you are interested in. It is not clear if you want a fixed set of words or you want to reconsider the words in the future.

       
      • Daniel Schmidt

        Daniel Schmidt - 2016-06-27
         

        Last edit: Daniel Schmidt 2016-06-27
  • Daniel Schmidt

    Daniel Schmidt - 2016-06-26

    My language I'm interested about is Hebrew. Yes, my set of words is well-known. I won't need to change them at the next future.
    Also if it helps, my keywords can be also be pronunced in Spanish.

     
    • Nickolay V. Shmyrev

      What are the words exactly?

       
  • Daniel Schmidt

    Daniel Schmidt - 2016-06-27

    "LI-SHÓN" "LA-KUM"

     
    • Nickolay V. Shmyrev

      Those are pretty short to be detected reliably in continuous stream. You need 3, preferably 4-5 syllables.

      You can record 100-200 samples of your keyword and about an hour of random speech and build a model from that. I should be sufficient. An alternative would to be train a specific DNN model to detect the word you need like google does for keyword spotting. DNN model will be more reliable, but will require more keyword examples.

       
      • Daniel Schmidt

        Daniel Schmidt - 2016-06-29

        Thank you for answer.
        I assume your second suggestion isn't included with cmusphinx features.
        If so, is there any available tutorial you can reccomned me about? I tried to search but found only studies.
        (From your answer I believe the first case would be less effective since I care alot for reability)

         
  • Daniel Schmidt

    Daniel Schmidt - 2016-07-04

    Sorry for my rush assumption, I just didn't know exactly the other names of 'dnn' to search for.
    I have some questions about the first approach you suggested (the accoustic model):
    1. In the transcription files, would I need to place all the recorded words, or just the specified keywords I need, followd by <sil> slices?
    2. How fast would be the respons time for these keywords, after making a model with (at least) 3 hours of recording?
    Thank you for additional response.</sil>

     

    Last edit: Daniel Schmidt 2016-07-04
    • Nickolay V. Shmyrev

      1. In the transcription files, would I need to place all the recorded words, or just the specified keywords I need, followd by <sil> slices?</sil>

      In the transcription file you place the transcription of what was recorded. You need to record what you need to recognize, not just keywords.

      1. How fast would be the respons time for these keywords, after making a model with (at least) 3 hours of recording?

      Response time does not depend on amount of training data. It is about 0.1s.

       
  • Daniel Schmidt

    Daniel Schmidt - 2016-07-05

    Sorry for my confusion, I probally didn't explain myself right at first.
    All I need from my model is to recognize the --only-- 2 words I mentiond before. I won't need any other word to recognize except those two.
    I need my model to recognize those words, even if they are being spoken as part of a sentence.
    Would it be better for me to build a new acoustic model?
    or maybe using anoter existing one?
    * I obviously care alot about accuracy.
    Thank you for your patience

     
    • Nickolay V. Shmyrev

      I won't need any other word to recognize except those two.

      This is your misconception. You still need to recognize that words are not present if they are not present. So you need to recognize other parts of the speech in order to understand that words are not present there.

      Would it be better for me to build a new acoustic model?

      I am not aware of a public Hebrew acoustic model, so you have to build it yourself.

       
  • Daniel Schmidt

    Daniel Schmidt - 2016-07-05

    This is your misconception. You still need to recognize that words are not present if they are not present. So you need to recognize other parts of the speech in order to understand that words are not present there

    If hypothetically I would like to understand the deep principles about that conception, Is there any readable source of information you can recommend me about?

     

    Last edit: Daniel Schmidt 2016-07-05

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.