Menu

Long audio alignment in Pocketsphinx

Help
2016-04-26
2018-04-13
  • Daniel Wolf

    Daniel Wolf - 2016-04-26

    I'm using Pocketsphinx. I need to align a long audio file (~15 min) with a transcript.

    I realize that Pocketsphinx doesn't have a built-in long audio aligner. So I must first split the transcript into smaller parts that match the utterances in the audio.

    In a previous post, Nickolay said that this can be solved by constructing a grammar from the transcript.

    How do I generate a grammar that allows me to split the transcript into utterances? Ideally, this grammar should also be capable of handling errors in the transcript.

     
    • Nickolay V. Shmyrev

      You need to construct fsg, something like "how are you doing today" ->

      start state 0
      end state 5
      how 0 1
      are 1 2
      you 2 3
      doing 3 4
      today 4 5
      

      Then to handle errors you can add loops in the grammar or add a garbage word from every node, you can check this Fig 1 on page 2 at http://www.danielpovey.com/files/2015_icassp_librispeech.pdf for detail.

       
  • Daniel Wolf

    Daniel Wolf - 2016-04-26

    Thank your for the explanation and the link!

    As I understand it, they first perform word recognition on the audio. Then they align the transcript to the recognized words using the Smith-Waterman alignment algorithm (based on phone similarity). They use this alignment to split the transcript into utterances corresponding with the recording. Only then do they construct a grammar to fine-align these short fragments of the transcript with recorded utterances.

    So in their paper, they don't use the generated grammar for the long audio alignment, but only for the fine-alignment.

    So I wonder: Is it really possible to directly use a grammar for long audio alignment?

     

    Last edit: Daniel Wolf 2016-04-28
    • Nickolay V. Shmyrev

      So I wonder: Is it really possible to directly use a grammar for long audio alignment?

      Grammars usually (means statistically) fail on longer files because they impose too strict search space. So say if you have a file of 1 minute you most likely will have an error in acoustic and grammar will be confused and align will fail. Or you have to use very generic grammar so it will never confuse. For that reason if you have more than 30 seconds of speech it is better to use ngram model for align, it is essentially a grammar too, just more relaxed, it has the right balance between accuracy and grammar complexity. For smaller segments you can use grammars, they work well in that case since they are more strict.

       
      • Daniel Wolf

        Daniel Wolf - 2016-05-01

        So if I understand you correctly, I can generate a special-purpose language model based only on the transcript. Then I can perform simple word recognition using that language model and the full dictionary that comes with Pocketsphinx.

        The result won't necessarily be identical with the transcript. Pocketsphinx will try to use the words, word pairs and triples from the transcript. On the other hand, if the transcript is incorrect in places, I will get reasonable word detection in those places, even if the words weren't part of the generated language model.

        In other words, Pocketsphinx will still use the entire dictionary and will successfully recognize words that weren't in the transcript.

        Is that correct?

         
        • Nickolay V. Shmyrev

          Then I can perform simple word recognition using that language model and the full dictionary that comes with Pocketsphinx.

          This is a first step in most alignment algorithms. But also please note that decoder uses the language model exclusively to determine on which words to look. For that reason for such biased decoding you need to build biased model, i.e. take a specialized model and interpolate it with generic large vocabulary model with smaller weight assigned to generic. From dictionary you only take the pronunciations.

           
          • Daniel Wolf

            Daniel Wolf - 2016-05-03

            Thanks Nickolay! I didn't fully understand the role of the language model. I've now done some research and things start to make sense.

            I've experimented with the Sphinx Knowledge Base Tool and there are two concepts I don't understand yet: discount mass and the ratio method for backoffs. Maybe you can help me?

            • I've noticed that the 1-gram probabilities generated by the Sphinx Knowledge Base Tool add up to 0.5, not to 1. A comment says, 'The (fixed) discount mass is 0.5.', so my guess is that this is intentional. What is a discount mass and why is it used?
            • Another comment says, 'The backoffs are computed using the ratio method.' What is this ratio method?

            It would be great if you could explain these concepts. Maybe you have a link?

             
            • Nickolay V. Shmyrev

               
              • Daniel Wolf

                Daniel Wolf - 2016-05-06

                Thanks for the links! The 2nd one was great for understanding the theory, the 1st one for an actual working example.

                Now I need to learn how to merge two existing language models into a single, biased one. Do you have any articles or actual code that I can look at?

                 
                • Nickolay V. Shmyrev

                  You calculate model probability with one model, then calcualte probability with another model and then simply take weighted average. Sphinxbase has ngram_model_set class for that, see ngram_model_set_init

                   
                  • Daniel Wolf

                    Daniel Wolf - 2016-05-06

                    Thanks -- I'll have a look at it!

                     
  • Daniel Wolf

    Daniel Wolf - 2016-06-05

    I managed to calculate n-gram probabilities and backof weights on-the-fly in C++. Now I'd like to create an ngram_model_t instance directly from this data (rather than writing it to a file and reading it back via ngram_model_read).

    I've hit a little problem:

    • To initialize an ngram_model_t, I need to call ngram_model_init, which is declared in ngram_model_internal.h. This function takes an ngram_funcs_t* value as argument. So I need an instance of this type to pass along.
    • ngram_model_trie.c defines a static instance of this type, but I don't see a way to access this value.
    • I could try to define an identical value myself, but its definition uses the functions ngram_model_trie_free, trie_apply_weights and four others. All these functions are defined directly within ngram_model_trie.c and not declared in any header file.

    So the only way I see is to declare these functions myself, have the linker use the definitions in ngram_model_trie.c, and define my own instance of type ngram_funcs_t*. Or is there a better way?

     
  • Daniel Wolf

    Daniel Wolf - 2016-06-05

    I just realized that these functions are static as well. So I cannot use them at all.

    Is there any way to create an ngram_model_t instance from code?

     
    • Nickolay V. Shmyrev

      Unfortunately there is no way to do that yet, you are welcome to submit a patch. We'd be interested in ngram model which can be initialized from a raw text too.

       
      • Daniel Wolf

        Daniel Wolf - 2016-06-07

        I'll give it a try. I can't make any promises, though -- I'm more at home with C++ than with plain C.

        One question in advance: ARPA models have all their n-grams in alphabetical order, so reading them automatically populates the ngram_model_t sub-structures in alphabetical order. Is this a requirement, or can I use any order?

         
        • Nickolay V. Shmyrev

          Since your lm is small and you do not need very efficient storage, you can use unsorted list of ngrams_raw structure, then you can simply sort it with qsort.

           
          • Daniel Wolf

            Daniel Wolf - 2016-06-07

            I'll take that as a 'yes': they have to be sorted in the end?

             
            • Nickolay V. Shmyrev

              Yes

               
  • Daniel Wolf

    Daniel Wolf - 2016-06-07

    I'm giving up. If I had a few spare days, I'd love to implement the clean solution: Add a new function to ngram_model_trie.c that takes normalized text, extracts 1..n-grams, calculates probabilities and backoff weights, then creates an ngram_model_trie_t from them.

    Sadly, I just don't have the time right now. I have already implemented all but the last step in C++, so I'm going to choose the hacky route, export the LM to a temporary ARPA file (that's trivial), then read it back using ngram_model_read.

     
  • VINOD KUMAR

    VINOD KUMAR - 2018-04-13

    Hi....I have developed my own training model using Sphinx on audio file 8khz. Running the pocketsphinx decoder but getting very low accuracy. Can someone pls suggest best way to improve the accuracy level . Thanks

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.