I'm using Pocketsphinx. I need to align a long audio file (~15 min) with a transcript.
I realize that Pocketsphinx doesn't have a built-in long audio aligner. So I must first split the transcript into smaller parts that match the utterances in the audio.
In a previous post, Nickolay said that this can be solved by constructing a grammar from the transcript.
How do I generate a grammar that allows me to split the transcript into utterances? Ideally, this grammar should also be capable of handling errors in the transcript.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As I understand it, they first perform word recognition on the audio. Then they align the transcript to the recognized words using the Smith-Waterman alignment algorithm (based on phone similarity). They use this alignment to split the transcript into utterances corresponding with the recording. Only then do they construct a grammar to fine-align these short fragments of the transcript with recorded utterances.
So in their paper, they don't use the generated grammar for the long audio alignment, but only for the fine-alignment.
So I wonder: Is it really possible to directly use a grammar for long audio alignment?
Last edit: Daniel Wolf 2016-04-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So I wonder: Is it really possible to directly use a grammar for long audio alignment?
Grammars usually (means statistically) fail on longer files because they impose too strict search space. So say if you have a file of 1 minute you most likely will have an error in acoustic and grammar will be confused and align will fail. Or you have to use very generic grammar so it will never confuse. For that reason if you have more than 30 seconds of speech it is better to use ngram model for align, it is essentially a grammar too, just more relaxed, it has the right balance between accuracy and grammar complexity. For smaller segments you can use grammars, they work well in that case since they are more strict.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So if I understand you correctly, I can generate a special-purpose language model based only on the transcript. Then I can perform simple word recognition using that language model and the full dictionary that comes with Pocketsphinx.
The result won't necessarily be identical with the transcript. Pocketsphinx will try to use the words, word pairs and triples from the transcript. On the other hand, if the transcript is incorrect in places, I will get reasonable word detection in those places, even if the words weren't part of the generated language model.
In other words, Pocketsphinx will still use the entire dictionary and will successfully recognize words that weren't in the transcript.
Is that correct?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Then I can perform simple word recognition using that language model and the full dictionary that comes with Pocketsphinx.
This is a first step in most alignment algorithms. But also please note that decoder uses the language model exclusively to determine on which words to look. For that reason for such biased decoding you need to build biased model, i.e. take a specialized model and interpolate it with generic large vocabulary model with smaller weight assigned to generic. From dictionary you only take the pronunciations.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks Nickolay! I didn't fully understand the role of the language model. I've now done some research and things start to make sense.
I've experimented with the Sphinx Knowledge Base Tool and there are two concepts I don't understand yet: discount mass and the ratio method for backoffs. Maybe you can help me?
I've noticed that the 1-gram probabilities generated by the Sphinx Knowledge Base Tool add up to 0.5, not to 1. A comment says, 'The (fixed) discount mass is 0.5.', so my guess is that this is intentional. What is a discount mass and why is it used?
Another comment says, 'The backoffs are computed using the ratio method.' What is this ratio method?
It would be great if you could explain these concepts. Maybe you have a link?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You calculate model probability with one model, then calcualte probability with another model and then simply take weighted average. Sphinxbase has ngram_model_set class for that, see ngram_model_set_init
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I managed to calculate n-gram probabilities and backof weights on-the-fly in C++. Now I'd like to create an ngram_model_t instance directly from this data (rather than writing it to a file and reading it back via ngram_model_read).
I've hit a little problem:
To initialize an ngram_model_t, I need to call ngram_model_init, which is declared in ngram_model_internal.h. This function takes an ngram_funcs_t* value as argument. So I need an instance of this type to pass along.
ngram_model_trie.c defines a static instance of this type, but I don't see a way to access this value.
I could try to define an identical value myself, but its definition uses the functions ngram_model_trie_free, trie_apply_weights and four others. All these functions are defined directly within ngram_model_trie.c and not declared in any header file.
So the only way I see is to declare these functions myself, have the linker use the definitions in ngram_model_trie.c, and define my own instance of type ngram_funcs_t*. Or is there a better way?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Unfortunately there is no way to do that yet, you are welcome to submit a patch. We'd be interested in ngram model which can be initialized from a raw text too.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'll give it a try. I can't make any promises, though -- I'm more at home with C++ than with plain C.
One question in advance: ARPA models have all their n-grams in alphabetical order, so reading them automatically populates the ngram_model_t sub-structures in alphabetical order. Is this a requirement, or can I use any order?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Since your lm is small and you do not need very efficient storage, you can use unsorted list of ngrams_raw structure, then you can simply sort it with qsort.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm giving up. If I had a few spare days, I'd love to implement the clean solution: Add a new function to ngram_model_trie.c that takes normalized text, extracts 1..n-grams, calculates probabilities and backoff weights, then creates an ngram_model_trie_t from them.
Sadly, I just don't have the time right now. I have already implemented all but the last step in C++, so I'm going to choose the hacky route, export the LM to a temporary ARPA file (that's trivial), then read it back using ngram_model_read.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi....I have developed my own training model using Sphinx on audio file 8khz. Running the pocketsphinx decoder but getting very low accuracy. Can someone pls suggest best way to improve the accuracy level . Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm using Pocketsphinx. I need to align a long audio file (~15 min) with a transcript.
I realize that Pocketsphinx doesn't have a built-in long audio aligner. So I must first split the transcript into smaller parts that match the utterances in the audio.
In a previous post, Nickolay said that this can be solved by constructing a grammar from the transcript.
How do I generate a grammar that allows me to split the transcript into utterances? Ideally, this grammar should also be capable of handling errors in the transcript.
You need to construct fsg, something like "how are you doing today" ->
Then to handle errors you can add loops in the grammar or add a garbage word from every node, you can check this Fig 1 on page 2 at http://www.danielpovey.com/files/2015_icassp_librispeech.pdf for detail.
Thank your for the explanation and the link!
As I understand it, they first perform word recognition on the audio. Then they align the transcript to the recognized words using the Smith-Waterman alignment algorithm (based on phone similarity). They use this alignment to split the transcript into utterances corresponding with the recording. Only then do they construct a grammar to fine-align these short fragments of the transcript with recorded utterances.
So in their paper, they don't use the generated grammar for the long audio alignment, but only for the fine-alignment.
So I wonder: Is it really possible to directly use a grammar for long audio alignment?
Last edit: Daniel Wolf 2016-04-28
Grammars usually (means statistically) fail on longer files because they impose too strict search space. So say if you have a file of 1 minute you most likely will have an error in acoustic and grammar will be confused and align will fail. Or you have to use very generic grammar so it will never confuse. For that reason if you have more than 30 seconds of speech it is better to use ngram model for align, it is essentially a grammar too, just more relaxed, it has the right balance between accuracy and grammar complexity. For smaller segments you can use grammars, they work well in that case since they are more strict.
So if I understand you correctly, I can generate a special-purpose language model based only on the transcript. Then I can perform simple word recognition using that language model and the full dictionary that comes with Pocketsphinx.
The result won't necessarily be identical with the transcript. Pocketsphinx will try to use the words, word pairs and triples from the transcript. On the other hand, if the transcript is incorrect in places, I will get reasonable word detection in those places, even if the words weren't part of the generated language model.
In other words, Pocketsphinx will still use the entire dictionary and will successfully recognize words that weren't in the transcript.
Is that correct?
This is a first step in most alignment algorithms. But also please note that decoder uses the language model exclusively to determine on which words to look. For that reason for such biased decoding you need to build biased model, i.e. take a specialized model and interpolate it with generic large vocabulary model with smaller weight assigned to generic. From dictionary you only take the pronunciations.
Thanks Nickolay! I didn't fully understand the role of the language model. I've now done some research and things start to make sense.
I've experimented with the Sphinx Knowledge Base Tool and there are two concepts I don't understand yet: discount mass and the ratio method for backoffs. Maybe you can help me?
It would be great if you could explain these concepts. Maybe you have a link?
You can read a comment in the beginning of quick_lm.pl here:
http://www.speech.cs.cmu.edu/tools/download/quick_lm.pl
and also
http://www.speech.sri.com/projects/srilm/manpages/pdfs/chen-goodman-tr-10-98.pdf
Thanks for the links! The 2nd one was great for understanding the theory, the 1st one for an actual working example.
Now I need to learn how to merge two existing language models into a single, biased one. Do you have any articles or actual code that I can look at?
You calculate model probability with one model, then calcualte probability with another model and then simply take weighted average. Sphinxbase has ngram_model_set class for that, see
ngram_model_set_init
Thanks -- I'll have a look at it!
I managed to calculate n-gram probabilities and backof weights on-the-fly in C++. Now I'd like to create an
ngram_model_t
instance directly from this data (rather than writing it to a file and reading it back viangram_model_read
).I've hit a little problem:
ngram_model_t
, I need to callngram_model_init
, which is declared inngram_model_internal.h
. This function takes anngram_funcs_t*
value as argument. So I need an instance of this type to pass along.ngram_model_trie.c
defines a static instance of this type, but I don't see a way to access this value.ngram_model_trie_free
,trie_apply_weights
and four others. All these functions are defined directly withinngram_model_trie.c
and not declared in any header file.So the only way I see is to declare these functions myself, have the linker use the definitions in
ngram_model_trie.c
, and define my own instance of typengram_funcs_t*
. Or is there a better way?I just realized that these functions are
static
as well. So I cannot use them at all.Is there any way to create an
ngram_model_t
instance from code?Unfortunately there is no way to do that yet, you are welcome to submit a patch. We'd be interested in ngram model which can be initialized from a raw text too.
I'll give it a try. I can't make any promises, though -- I'm more at home with C++ than with plain C.
One question in advance: ARPA models have all their n-grams in alphabetical order, so reading them automatically populates the
ngram_model_t
sub-structures in alphabetical order. Is this a requirement, or can I use any order?Since your lm is small and you do not need very efficient storage, you can use unsorted list of ngrams_raw structure, then you can simply sort it with qsort.
I'll take that as a 'yes': they have to be sorted in the end?
Yes
I'm giving up. If I had a few spare days, I'd love to implement the clean solution: Add a new function to
ngram_model_trie.c
that takes normalized text, extracts 1..n-grams, calculates probabilities and backoff weights, then creates anngram_model_trie_t
from them.Sadly, I just don't have the time right now. I have already implemented all but the last step in C++, so I'm going to choose the hacky route, export the LM to a temporary ARPA file (that's trivial), then read it back using
ngram_model_read
.Hi....I have developed my own training model using Sphinx on audio file 8khz. Running the pocketsphinx decoder but getting very low accuracy. Can someone pls suggest best way to improve the accuracy level . Thanks