I have some amt. of single spk. untranscribed audio data in a language, and also SI acoustic model in the same language, which would be used to obtain approximate transcriptions. These approximate transcriptions would be improved by performing spk adaptation (using decode_fmllr.sh script). The goal is to build a synthesis system which requires as accurate transcriptions and timestamps as possible.
I want to ask the types of confidence measures that I can readily use in Kaldi to prune data where audio and labels don't match. Currently, I am using phone posterior probabilities, but these have contribution from LM too. For synthesis task, I want to compute Pure acoustic confidences. Which are the ways/tricks to compute pure phone/word-level acoustic confidences in Kaldi?
Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you are concerned about the accuracy of the transcriptions, use
steps/cleanup/find_bad_utts.sh and look at the diagnostics it
produces. It's based on the decoding the data using a unigram LM
containing the words in the transcript (plus some other command
words), and seeing whether the transcription is part of the lattice
that is produced.
Dan
I have some amt. of single spk. untranscribed audio data in a language, and
also SI acoustic model in the same language, which would be used to obtain
approximate transcriptions. These approximate transcriptions would be
improved by performing spk adaptation (using decode_fmllr.sh script). The
goal is to build a synthesis system which requires as accurate
transcriptions and timestamps as possible.
I want to ask the types of confidence measures that I can readily use in
Kaldi to prune data where audio and labels don't match. Currently, I am
using phone posterior probabilities, but these have contribution from LM
too. For synthesis task, I want to compute Pure acoustic confidences. Which
are the ways/tricks to compute pure phone/word-level acoustic confidences in
Kaldi?
I went through the steps/cleanup/find_bad_utts.sh, and also noticed that nbest-to-linear gives lm-cost and acoustic-cost, but at the utterance level. Can I get acoustic cost at the word-level?
Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Those are not the stats from nbest-to-linear that you should be
looking at, they are not very meaningful. What is more meaningful is
the number of word errors and the length of the corresponding
reference. The script creates a file with that information. Some
examples are printed out at the end, look at the script to see what
the format of the files is.
If you do want the per-word acoustic cost (and they won't help you!)
you could get it by getting the 1-best from a lattice (lattice-1best),
aligning it (lattice-align-words), doing acoustic rescoring
(gmm-lattice-rescore/gmm-rescore-lattice(?)), and then looking at the
lattice in text form. The per-word LM cost might still be shifted,
but again, it's not that meaningful for you anyway.
Dan
I went through the steps/cleanup/find_bad_utts.sh, and also noticed that
nbest-to-linear gives lm-cost and acoustic-cost, but at the utterance level.
Can I get acoustic cost at the word-level?
Hi,
I have some amt. of single spk. untranscribed audio data in a language, and also SI acoustic model in the same language, which would be used to obtain approximate transcriptions. These approximate transcriptions would be improved by performing spk adaptation (using decode_fmllr.sh script). The goal is to build a synthesis system which requires as accurate transcriptions and timestamps as possible.
I want to ask the types of confidence measures that I can readily use in Kaldi to prune data where audio and labels don't match. Currently, I am using phone posterior probabilities, but these have contribution from LM too. For synthesis task, I want to compute Pure acoustic confidences. Which are the ways/tricks to compute pure phone/word-level acoustic confidences in Kaldi?
Thanks.
If you are concerned about the accuracy of the transcriptions, use
steps/cleanup/find_bad_utts.sh and look at the diagnostics it
produces. It's based on the decoding the data using a unigram LM
containing the words in the transcript (plus some other command
words), and seeing whether the transcription is part of the lattice
that is produced.
Dan
On Mon, Jul 6, 2015 at 6:53 AM, Tejas Godambe tejasg@users.sf.net wrote:
Hi,
I went through the steps/cleanup/find_bad_utts.sh, and also noticed that nbest-to-linear gives lm-cost and acoustic-cost, but at the utterance level. Can I get acoustic cost at the word-level?
Thanks.
Those are not the stats from nbest-to-linear that you should be
looking at, they are not very meaningful. What is more meaningful is
the number of word errors and the length of the corresponding
reference. The script creates a file with that information. Some
examples are printed out at the end, look at the script to see what
the format of the files is.
If you do want the per-word acoustic cost (and they won't help you!)
you could get it by getting the 1-best from a lattice (lattice-1best),
aligning it (lattice-align-words), doing acoustic rescoring
(gmm-lattice-rescore/gmm-rescore-lattice(?)), and then looking at the
lattice in text form. The per-word LM cost might still be shifted,
but again, it's not that meaningful for you anyway.
Dan
On Thu, Jul 9, 2015 at 5:14 AM, Tejas Godambe tejasg@users.sf.net wrote: