Hi,
I have a list of movie names and song names (billions). The list keeps getting updated with new names (songs and movie names) on a regular basis. It is required to recognize movie/song names from user utterances. Is it possible to use Kaldi for such a requirement? Can open vocabulary models be built and used with kaldi for recognizing OOV words?
Thanks
Srinidhi
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The acoustic models are inherently open vocabulary; the lexicon would need
to be updated though (e.g. using g2p) and the decoding-graph recompiled.
It's definitely possible using Kaldi but it requires some understanding of
how speech recognition works, i.e. it might not be a suitable task for a
beginner.
Dan
Hi,
I have a list of movie names and song names (billions). The list keeps
getting updated with new names (songs and movie names) on a regular basis.
It is required to recognize movie/song names from user utterances. Is it
possible to use Kaldi for such a requirement? Can open vocabulary models be
built and used with kaldi for recognizing OOV words?
Thanks
Srinidhi
I can get my own corpus (recordings from multiple people and transcriptions) for accoustic model.Can I build a flat language model which will only contain the names (of songs/movies) and keep updating the language model (G.fst) with additional names as and when new names are available. Then can I rebuild the decoding graph from new language model and new lexicon (containing new names) and use it? Is this a viable option for my requirement.? Is it possible to provide me details on step be step plan for building models and use it for recognition?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That plan is workable, yes.
Probably instead of building a flat language model it would be better to
compute some kind of probabilities for how often different movies/songs
show up in various lists, and use those.
Regarding the steps involved in building models and using them for
recognition - you could probably look at any of the example scripts. I
would suggest the Voxforge or Librispeech setups because I'm assuming you
don't have access to LDC data.
I can get my own corpus (recordings from multiple people and
transcriptions) for accoustic model.Can I build a flat language model which
will only contain the names (of songs/movies) and keep updating the
language model (G.fst) with additional names as and when new names are
available. Then can I rebuild the decoding graph from new language model
and new lexicon (containing new names) and use it? Is this a viable option
for my requirement.? Is it possible to provide me details on step be step
plan for building models and use it for recognition?
I am planning to test with accoustic model from fisher_english model. Now if I want to recognize names is it just sufficient to add new names to vocabulary and generate HCLG decoding graph with existing language model (where name is not appearing) without modifying the language model?
I tried to build a small unigram language model with city name list (around 650 names) and lexicon for the above name list (generated using g2p) and constructed HCLG decoding graph . But recognition using fisher_english accoustic model and the generated HCLG.fst is not giving the desired results.
I am planning to test with accoustic model from fisher_english model. Now
if I want to recognize names is it just sufficient to add new names to
vocabulary and generate HCLG decoding graph with existing language model
(where name is not appearing) without modifying the language model?
No, if the words don't appear in the language model they can never be
recognized.
I tried to build a small unigram language model with city name list
(around 650 names) and lexicon for the above name list (generated using
g2p) and constructed HCLG decoding graph . But recognition using
fisher_english accoustic model and the generated HCLG.fst is not giving the
desired results.
How can I debug what went wrong in my HCLG.fst.? Also while making HCLG.fast tree and model files are required. In fisher model provided in kaldi-asr.org tree file is not available, so used tree and model files from voxforge tri3b model. Please provide with some details on debugging this issue.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Oh - I think that gives us the answer. The tree files are not
interchangeable, you need to include the correct one. I'll have to modify
the prepare_online_decoding.sh script to copy the tree, which will make it
easier for others like you.
I just had a look at the online-nnet2 training script, and it looks like it
uses the tree from exp/tri5a. So you should navigate to that location in
the corresponding upload at kaldi-asr.org and download that tree.
How can I debug what went wrong in my HCLG.fst.? Also while making
HCLG.fast tree and model files are required. In fisher model provided in
kaldi-asr.org tree file is not available, so used tree and model files
from voxforge tri3b model. Please provide with some details on debugging
this issue.
I am still not getting the desired results even after rebuilding HCLG.fst against tree and final.mdl of fisher's /exp/tri5a. I am attaching the files and also the cmds (cmds.sh) used in building the HCLG.fst . Please check if possible as to what is wrong . When I am running the following cmd (with lucknow being recorded in utterance luck.wav)
Is it "Luck now" or 'laakh nu" or something else? What's the dictionary
entry?
On Oct 14, 2014 6:48 AM, "K R Srinidhi" srinidhikrs@users.sf.net wrote:
I am still not getting the desired results even after rebuilding HCLG.fst
against tree and final.mdl of fisher's /exp/tri5a. I am attaching the files
and also the cmds (cmds.sh) used in building the HCLG.fst . Please check if
possible as to what is wrong . When I am running the following cmd (with
lucknow being recorded in utterance luck.wav)
You should be able to do better than RAJGARH. Just out of curiosity I upsampled the luck.wav to 16kHz and used the librispeech nnet2-online model(available for download) on it and the result is "AND NOW", which is closer I think. Also I did this using a HCLG graph built with a very generic 3-gram LM trained on 14500 books, so with a lot smaller LM(like yours) you should be able to recognize this correctly.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
can you please provide the link of lexicon.txt and phones.txt used in fisher model which can be used to train g2p model for generating lexicon for my word list.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You can browse to it at kaldi-asr.org - it will be in the same upload as
the models you downloaded, the phones.txt will be in data/lang/, and the
lexicon will be somewhere like data/local/dict/lexicon.txt
can you please provide the link of lexicon.txt and phones.txt used in
fisher model which can be used to train g2p model for generating lexicon
for my word list.
That version is before adding word-position dependency info; you can look
at data/local/lang/lexiconp.txt for an example that has the word-position
dependency info added.
In the run.sh you'll see this command:
Thanks a lot for the help. Now I am able to get LUCKNOW recognized from the utterance after following instructions by Dan. Also thanks to Vassil for pointing out the phones.txt issue.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have a list of 20k unique hindi language words for which I have the lexicon with phonemes used in building fisher english accoustic model. I have created a unigram language model from the above 20 k words and constructed decoding graph(HCLG.fst).
When I am trying recognition with the fisher english accoustic model and the above decoding graph (HCLG.fst) I am finding that recognition accuracy is not very good. For some words the recognition is fine but for some words starting with certain phonemes the results are very bad. 1)Is it possible to get above 95% accuracy with the fisher english accoustic model and decoding graph constructed as explained above? If possible what are the options to be looked into for tuning to get more than 95% accuracy.
2)Is it required to build a new accoustic model for hindi language with new phoneme set to cover all the sounds in the language to get better accuracy?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Using the phone-set of one language to recognize another language is not
something that people normally do, and we don't expect the recognition
performance to be very good. You need to train on a Hindi dataset. I
don't know whether one exists.
Dan
I have a list of 20k unique hindi language words for which I have the
lexicon with phonemes used in building fisher english accoustic model. I
have created a unigram language model from the above 20 k words and
constructed decoding graph(HCLG.fst).
When I am trying recognition with the fisher english accoustic model and
the above decoding graph (HCLG.fst) I am finding that recognition accuracy
is not very good. For some words the recognition is fine but for some words
starting with certain phonemes the results are very bad. 1)Is it possible
to get above 95% accuracy with the fisher english accoustic model and
decoding graph constructed as explained above? If possible what are the
options to be looked into for tuning to get more than 95% accuracy.
2)Is it required to build a new accoustic model for hindi language with
new phoneme set to cover all the sounds in the language to get better
accuracy?
I have got some training data with hindi transcriptions . I also have a lexicon for hindi with hindi phone-set. I have trained a gmm accoustic model (tri3b)
Now when I run local/online/run_nnet2.sh using data/train and exp/tri3b it is failing in nnet-combine-fast stage .
When I checked the script debug output I found that num_iters (4) is less than mix_up_iters (6) and nnets_list[$idx] is not getting populated.
Please help me in finding the source of the problem.
I have attached the screen output while running run_nnet2.sh (with sh -x for train_pnorm_fast.sh)
I changed the parameters --num-epochs to 4 and --num-hidden-layers to 2 and got the nnet model final.mdl. What is the ideal value for those parameters for getting a better model ?
I was testing to check if I could build a deep neural net model with my training data. I used only 2500 recordings with transcriptions for accoustic model training . While testing with online decoding I was getting nearby match for some utterances while majority were misrecognitions. Now I want to run the setup with 150-200 hours of recordings with transcriptions. Will I be able to get better recognition accuracy if I build an accoustic model with 150-200 hrs of training data?
What is the recommended hardware configuration for building a neural net accoustic model with the above training data.
How much time it would take on a single server normally (for 150-200 hrs of training data)?
Or is grid engine setup recommended ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I changed the parameters --num-epochs to 4 and --num-hidden-layers to 2
and got the nnet model final.mdl. What is the ideal value for those
parameters for getting a better model ?
If there was an ideal value we would have baked it into the script.
Here http://kaldi.sourceforge.net/dnn2.html
there are some suggestions for tuning. I recommend to use the
train_pnorm_simple.sh script-- in the other scripts there is a
--num-epochs-final number to configure also, which can be confusing.
An important diagnostic is the final (train,valid) probs: do
grep LOG exp/your-dir/log/compute_prob_*.final.log
to see them.
They should differ by no more than 20%, or 50% at most; if more, then you
have too many parameters.
I was testing to check if I could build a deep neural net model with my
training data. I used only 2500 recordings with transcriptions for
accoustic model training . While testing with online decoding I was getting
nearby match for some utterances while majority were misrecognitions.
That is very little data to train a DNN.
Now I want to run the setup with 150-200 hours of recordings with
transcriptions. Will I be able to get better recognition accuracy if I
build an accoustic model with 150-200 hrs of training data?
What is the recommended hardware configuration for building a neural net
accoustic model with the above training data.
How much time it would take on a single server normally (for 150-200 hrs
of training data)?
Or is grid engine setup recommended ?
With that much data you need GPUs, or it will take you a very long time
(e.g. a week at least, but depends how many cores you have).
Dan
I was able to build hindi accoustic models with our hindi training data. But I am facing following issues with recognition :
1)Built a 1-gram language model using all the words in hindi lexicon and constructed the graph (HCLG.fst). With the built accoustic model and the graph (HCLG.fst), what I am observing is that recognition is decent for utterences containing single word . But if utterenace contain more than one word (for example : the lord of the rings) then recognition is poor. How can I get high accuracy for multi word recognition ?
2)When I was testing with online-gmm-decode-faster I found that I had to speak little loudly and slowly for recognition to happen properly. Also sometimes the first attempt was failing with misrecognition and second attempt was giving correct recognition . What could be the reason ? (Like for example I wanted to recognize NATWAR. In the first attempt when I spoke NATWAR it gave wrong result and next attempt I spoke NATWAR in the same way as first attempt , it gave correct Result.)
Please provide some information on improving multi word recognition accuracy as most of the utterances would contain 2 to5 words .
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I have a list of movie names and song names (billions). The list keeps getting updated with new names (songs and movie names) on a regular basis. It is required to recognize movie/song names from user utterances. Is it possible to use Kaldi for such a requirement? Can open vocabulary models be built and used with kaldi for recognizing OOV words?
Thanks
Srinidhi
The acoustic models are inherently open vocabulary; the lexicon would need
to be updated though (e.g. using g2p) and the decoding-graph recompiled.
It's definitely possible using Kaldi but it requires some understanding of
how speech recognition works, i.e. it might not be a suitable task for a
beginner.
Dan
On Wed, Oct 1, 2014 at 5:33 AM, K R Srinidhi srinidhikrs@users.sf.net
wrote:
I can get my own corpus (recordings from multiple people and transcriptions) for accoustic model.Can I build a flat language model which will only contain the names (of songs/movies) and keep updating the language model (G.fst) with additional names as and when new names are available. Then can I rebuild the decoding graph from new language model and new lexicon (containing new names) and use it? Is this a viable option for my requirement.? Is it possible to provide me details on step be step plan for building models and use it for recognition?
That plan is workable, yes.
Probably instead of building a flat language model it would be better to
compute some kind of probabilities for how often different movies/songs
show up in various lists, and use those.
Regarding the steps involved in building models and using them for
recognition - you could probably look at any of the example scripts. I
would suggest the Voxforge or Librispeech setups because I'm assuming you
don't have access to LDC data.
Dan
On Mon, Oct 6, 2014 at 2:12 AM, K R Srinidhi srinidhikrs@users.sf.net
wrote:
I am planning to test with accoustic model from fisher_english model. Now if I want to recognize names is it just sufficient to add new names to vocabulary and generate HCLG decoding graph with existing language model (where name is not appearing) without modifying the language model?
I tried to build a small unigram language model with city name list (around 650 names) and lexicon for the above name list (generated using g2p) and constructed HCLG decoding graph . But recognition using fisher_english accoustic model and the generated HCLG.fst is not giving the desired results.
I am using the following cmd :
online2-wav-nnet2-latgen-faster --do-endpointing=false --online=false --config=nnet_a_gpu_online/conf/online_nnet2_decoding.conf --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=namelist.txt nnet_a_gpu_online/final.mdl namelist_HCLG.fst "ark:echo utterance-id1 utterance-id1|" "scp:echo utterance-id1 luck.wav|" ark:/dev/null
I am planning to test with accoustic model from fisher_english model. Now
That probably should have worked - perhaps something went wrong when
constructing the HCLG.
Dan
How can I debug what went wrong in my HCLG.fst.? Also while making HCLG.fast tree and model files are required. In fisher model provided in kaldi-asr.org tree file is not available, so used tree and model files from voxforge tri3b model. Please provide with some details on debugging this issue.
Oh - I think that gives us the answer. The tree files are not
interchangeable, you need to include the correct one. I'll have to modify
the prepare_online_decoding.sh script to copy the tree, which will make it
easier for others like you.
I just had a look at the online-nnet2 training script, and it looks like it
uses the tree from exp/tri5a. So you should navigate to that location in
the corresponding upload at kaldi-asr.org and download that tree.
Dan
On Mon, Oct 13, 2014 at 1:01 PM, K R Srinidhi srinidhikrs@users.sf.net
wrote:
I am still not getting the desired results even after rebuilding HCLG.fst against tree and final.mdl of fisher's /exp/tri5a. I am attaching the files and also the cmds (cmds.sh) used in building the HCLG.fst . Please check if possible as to what is wrong . When I am running the following cmd (with lucknow being recorded in utterance luck.wav)
kaldi-trunk/src/online2bin/online2-wav-nnet2-latgen-faster --do-endpointing=true --online=false --config=newgraph/online_nnet2_decoding.conf --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=newgraph/citiwords.txt newgraph/final.mdl newgraph/HCLG.fst 'ark:echo utterance-id1 utterance-id1|' 'scp:echo utterance-id1 luck.wav|' ark:/dev/null
I am getting the following output instead of LUCKNOW
LOG (online2-wav-nnet2-latgen-faster:ComputeDerivedVars():ivector-extractor.cc:201) Done.
utterance-id1 RAJGARH
Why it is not getting recognized as LUCKNOW?
Is it "Luck now" or 'laakh nu" or something else? What's the dictionary
entry?
On Oct 14, 2014 6:48 AM, "K R Srinidhi" srinidhikrs@users.sf.net wrote:
The dictionary entry is as follows:
LUCKNOW l ah k n aw
Okay. It looks close. You will need to speak in that tone of the dictionary.
On Oct 14, 2014 7:26 AM, "K R Srinidhi" srinidhikrs@users.sf.net wrote:
But why it is recognizing completely differtly as RAJGARH or RAIPUR.
There is no similarity between RAJGARH and LUCKNOW ?
Where did you get the phones.txt file from? As far as I can see it's different from the file in http://www.kaldi-asr.org/downloads/build/2/sandbox/online/egs/fisher_english/s5/exp/tri5a/graph/. You can't really mix and match files like that - the phones file is part of the acoustic model definition AFAIK.
You should be able to do better than RAJGARH. Just out of curiosity I upsampled the luck.wav to 16kHz and used the librispeech nnet2-online model(available for download) on it and the result is "AND NOW", which is closer I think. Also I did this using a HCLG graph built with a very generic 3-gram LM trained on 14500 books, so with a lot smaller LM(like yours) you should be able to recognize this correctly.
can you please provide the link of lexicon.txt and phones.txt used in fisher model which can be used to train g2p model for generating lexicon for my word list.
You can browse to it at kaldi-asr.org - it will be in the same upload as
the models you downloaded, the phones.txt will be in data/lang/, and the
lexicon will be somewhere like data/local/dict/lexicon.txt
Dan
On Tue, Oct 14, 2014 at 1:26 PM, K R Srinidhi srinidhikrs@users.sf.net
wrote:
the lexicon.txt in data/local/dict/lexicon.txt has different phonemes than the phonemes mentioned in http://www.kaldi-asr.org/downloads/build/2/sandbox/online/egs/fisher_english/s5/exp/tri5a/graph/phones.txt
That version is before adding word-position dependency info; you can look
at data/local/lang/lexiconp.txt for an example that has the word-position
dependency info added.
In the run.sh you'll see this command:
utils/prepare_lang.sh data/local/dict "<unk>" data/local/lang data/lang
What I would recommend is to edit data/local/dict/lexicon.txt to add your
own words, then call something like
utils/prepare_lang.sh data/local/dict "<unk>" data/local/lang
data/lang_morewords
and when you're done, verify that the phones.txt is identical to the one in
data/lang/, otherwise it will be incompatible with the tree.
Then use data/lang_morewords to build the graph.
Dan
On Tue, Oct 14, 2014 at 1:38 PM, K R Srinidhi srinidhikrs@users.sf.net
wrote:
Thanks a lot for the help. Now I am able to get LUCKNOW recognized from the utterance after following instructions by Dan. Also thanks to Vassil for pointing out the phones.txt issue.
I have a list of 20k unique hindi language words for which I have the lexicon with phonemes used in building fisher english accoustic model. I have created a unigram language model from the above 20 k words and constructed decoding graph(HCLG.fst).
When I am trying recognition with the fisher english accoustic model and the above decoding graph (HCLG.fst) I am finding that recognition accuracy is not very good. For some words the recognition is fine but for some words starting with certain phonemes the results are very bad. 1)Is it possible to get above 95% accuracy with the fisher english accoustic model and decoding graph constructed as explained above? If possible what are the options to be looked into for tuning to get more than 95% accuracy.
2)Is it required to build a new accoustic model for hindi language with new phoneme set to cover all the sounds in the language to get better accuracy?
Using the phone-set of one language to recognize another language is not
something that people normally do, and we don't expect the recognition
performance to be very good. You need to train on a Hindi dataset. I
don't know whether one exists.
Dan
On Wed, Oct 22, 2014 at 2:10 AM, K R Srinidhi srinidhikrs@users.sf.net
wrote:
I have got some training data with hindi transcriptions . I also have a lexicon for hindi with hindi phone-set. I have trained a gmm accoustic model (tri3b)
Do LDA+MLLT+SAT, and decode.
steps/train_sat.sh 2000 11000 data/train data/lang exp/tri2b_ali exp/tri3b || exit 1;
utils/mkgraph.sh data/lang exp/tri3b exp/tri3b/graph || exit 1;
Now when I run local/online/run_nnet2.sh using data/train and exp/tri3b it is failing in nnet-combine-fast stage .
When I checked the script debug output I found that num_iters (4) is less than mix_up_iters (6) and nnets_list[$idx] is not getting populated.
Please help me in finding the source of the problem.
I have attached the screen output while running run_nnet2.sh (with sh -x for train_pnorm_fast.sh)
I changed the parameters --num-epochs to 4 and --num-hidden-layers to 2 and got the nnet model final.mdl. What is the ideal value for those parameters for getting a better model ?
I was testing to check if I could build a deep neural net model with my training data. I used only 2500 recordings with transcriptions for accoustic model training . While testing with online decoding I was getting nearby match for some utterances while majority were misrecognitions. Now I want to run the setup with 150-200 hours of recordings with transcriptions. Will I be able to get better recognition accuracy if I build an accoustic model with 150-200 hrs of training data?
What is the recommended hardware configuration for building a neural net accoustic model with the above training data.
How much time it would take on a single server normally (for 150-200 hrs of training data)?
Or is grid engine setup recommended ?
If there was an ideal value we would have baked it into the script.
Here
http://kaldi.sourceforge.net/dnn2.html
there are some suggestions for tuning. I recommend to use the
train_pnorm_simple.sh script-- in the other scripts there is a
--num-epochs-final number to configure also, which can be confusing.
An important diagnostic is the final (train,valid) probs: do
grep LOG exp/your-dir/log/compute_prob_*.final.log
to see them.
They should differ by no more than 20%, or 50% at most; if more, then you
have too many parameters.
That is very little data to train a DNN.
With that much data you need GPUs, or it will take you a very long time
(e.g. a week at least, but depends how many cores you have).
Dan
I was able to build hindi accoustic models with our hindi training data. But I am facing following issues with recognition :
1)Built a 1-gram language model using all the words in hindi lexicon and constructed the graph (HCLG.fst). With the built accoustic model and the graph (HCLG.fst), what I am observing is that recognition is decent for utterences containing single word . But if utterenace contain more than one word (for example : the lord of the rings) then recognition is poor. How can I get high accuracy for multi word recognition ?
2)When I was testing with online-gmm-decode-faster I found that I had to speak little loudly and slowly for recognition to happen properly. Also sometimes the first attempt was failing with misrecognition and second attempt was giving correct recognition . What could be the reason ? (Like for example I wanted to recognize NATWAR. In the first attempt when I spoke NATWAR it gave wrong result and next attempt I spoke NATWAR in the same way as first attempt , it gave correct Result.)
Please provide some information on improving multi word recognition accuracy as most of the utterances would contain 2 to5 words .