Kaldi / Discussion / Open Discussion: Name recognition

K R Srinidhi - 2014-10-01

Hi,
I have a list of movie names and song names (billions). The list keeps getting updated with new names (songs and movie names) on a regular basis. It is required to recognize movie/song names from user utterances. Is it possible to use Kaldi for such a requirement? Can open vocabulary models be built and used with kaldi for recognizing OOV words?
Thanks
Srinidhi

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2014-10-01
  
  The acoustic models are inherently open vocabulary; the lexicon would need
  to be updated though (e.g. using g2p) and the decoding-graph recompiled.
  It's definitely possible using Kaldi but it requires some understanding of
  how speech recognition works, i.e. it might not be a suitable task for a
  beginner.
  Dan
  
  On Wed, Oct 1, 2014 at 5:33 AM, K R Srinidhi srinidhikrs@users.sf.net
  wrote:
  
  Hi,
  I have a list of movie names and song names (billions). The list keeps
  getting updated with new names (songs and movie names) on a regular basis.
  It is required to recognize movie/song names from user utterances. Is it
  possible to use Kaldi for such a requirement? Can open vocabulary models be
  built and used with kaldi for recognizing OOV words?
  Thanks
  Srinidhi
  
  Name recognition
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=25#c715
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

K R Srinidhi - 2014-10-06

I can get my own corpus (recordings from multiple people and transcriptions) for accoustic model.Can I build a flat language model which will only contain the names (of songs/movies) and keep updating the language model (G.fst) with additional names as and when new names are available. Then can I rebuild the decoding graph from new language model and new lexicon (containing new names) and use it? Is this a viable option for my requirement.? Is it possible to provide me details on step be step plan for building models and use it for recognition?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2014-10-06
  
  That plan is workable, yes.
  Probably instead of building a flat language model it would be better to
  compute some kind of probabilities for how often different movies/songs
  show up in various lists, and use those.
  Regarding the steps involved in building models and using them for
  recognition - you could probably look at any of the example scripts. I
  would suggest the Voxforge or Librispeech setups because I'm assuming you
  don't have access to LDC data.
  
  Dan
  
  On Mon, Oct 6, 2014 at 2:12 AM, K R Srinidhi srinidhikrs@users.sf.net
  wrote:
  
  I can get my own corpus (recordings from multiple people and
  transcriptions) for accoustic model.Can I build a flat language model which
  will only contain the names (of songs/movies) and keep updating the
  language model (G.fst) with additional names as and when new names are
  available. Then can I rebuild the decoding graph from new language model
  and new lexicon (containing new names) and use it? Is this a viable option
  for my requirement.? Is it possible to provide me details on step be step
  plan for building models and use it for recognition?
  
  Name recognition
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=25#9326
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

K R Srinidhi - 2014-10-13

I am planning to test with accoustic model from fisher_english model. Now if I want to recognize names is it just sufficient to add new names to vocabulary and generate HCLG decoding graph with existing language model (where name is not appearing) without modifying the language model?

I tried to build a small unigram language model with city name list (around 650 names) and lexicon for the above name list (generated using g2p) and constructed HCLG decoding graph . But recognition using fisher_english accoustic model and the generated HCLG.fst is not giving the desired results.

I am using the following cmd :

online2-wav-nnet2-latgen-faster --do-endpointing=false --online=false --config=nnet_a_gpu_online/conf/online_nnet2_decoding.conf --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=namelist.txt nnet_a_gpu_online/final.mdl namelist_HCLG.fst "ark:echo utterance-id1 utterance-id1|" "scp:echo utterance-id1 luck.wav|" ark:/dev/null

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2014-10-13
  
  I am planning to test with accoustic model from fisher_english model. Now
  
  if I want to recognize names is it just sufficient to add new names to
  vocabulary and generate HCLG decoding graph with existing language model
  (where name is not appearing) without modifying the language model?
  
  No, if the words don't appear in the language model they can never be
  recognized.
  
  I tried to build a small unigram language model with city name list
  (around 650 names) and lexicon for the above name list (generated using
  g2p) and constructed HCLG decoding graph . But recognition using
  fisher_english accoustic model and the generated HCLG.fst is not giving the
  desired results.
  
  I am using the following cmd :
  
  online2-wav-nnet2-latgen-faster --do-endpointing=false --online=false
  --config=nnet_a_gpu_online/conf/online_nnet2_decoding.conf
  --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1
  --word-symbol-table=namelist.txt nnet_a_gpu_online/final.mdl
  namelist_HCLG.fst "ark:echo utterance-id1 utterance-id1|" "scp:echo
  utterance-id1 luck.wav|" ark:/dev/null
  
  That probably should have worked - perhaps something went wrong when
  constructing the HCLG.
  
  Dan
  
  Name recognition
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=50#24a2
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

K R Srinidhi - 2014-10-13

How can I debug what went wrong in my HCLG.fst.? Also while making HCLG.fast tree and model files are required. In fisher model provided in kaldi-asr.org tree file is not available, so used tree and model files from voxforge tri3b model. Please provide with some details on debugging this issue.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2014-10-13
  
  Oh - I think that gives us the answer. The tree files are not
  interchangeable, you need to include the correct one. I'll have to modify
  the prepare_online_decoding.sh script to copy the tree, which will make it
  easier for others like you.
  I just had a look at the online-nnet2 training script, and it looks like it
  uses the tree from exp/tri5a. So you should navigate to that location in
  the corresponding upload at kaldi-asr.org and download that tree.
  
  Dan
  
  On Mon, Oct 13, 2014 at 1:01 PM, K R Srinidhi srinidhikrs@users.sf.net
  wrote:
  
  How can I debug what went wrong in my HCLG.fst.? Also while making
  HCLG.fast tree and model files are required. In fisher model provided in
  kaldi-asr.org tree file is not available, so used tree and model files
  from voxforge tri3b model. Please provide with some details on debugging
  this issue.
  
  Name recognition
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=50#8723
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

K R Srinidhi - 2014-10-14

I am still not getting the desired results even after rebuilding HCLG.fst against tree and final.mdl of fisher's /exp/tri5a. I am attaching the files and also the cmds (cmds.sh) used in building the HCLG.fst . Please check if possible as to what is wrong . When I am running the following cmd (with lucknow being recorded in utterance luck.wav)

kaldi-trunk/src/online2bin/online2-wav-nnet2-latgen-faster --do-endpointing=true --online=false --config=newgraph/online_nnet2_decoding.conf --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=newgraph/citiwords.txt newgraph/final.mdl newgraph/HCLG.fst 'ark:echo utterance-id1 utterance-id1|' 'scp:echo utterance-id1 luck.wav|' ark:/dev/null

I am getting the following output instead of LUCKNOW

LOG (online2-wav-nnet2-latgen-faster:ComputeDerivedVars():ivector-extractor.cc:201) Done.
utterance-id1 RAJGARH

Why it is not getting recognized as LUCKNOW?

newgraph.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nagendra Kumar Goel - 2014-10-14
  
  Is it "Luck now" or 'laakh nu" or something else? What's the dictionary
  entry?
  On Oct 14, 2014 6:48 AM, "K R Srinidhi" srinidhikrs@users.sf.net wrote:
  
  I am still not getting the desired results even after rebuilding HCLG.fst
  against tree and final.mdl of fisher's /exp/tri5a. I am attaching the files
  and also the cmds (cmds.sh) used in building the HCLG.fst . Please check if
  possible as to what is wrong . When I am running the following cmd (with
  lucknow being recorded in utterance luck.wav)
  
  kaldi-trunk/src/online2bin/online2-wav-nnet2-latgen-faster
  --do-endpointing=true --online=false
  --config=newgraph/online_nnet2_decoding.conf --max-active=7000 --beam=15.0
  --lattice-beam=6.0 --acoustic-scale=0.1
  --word-symbol-table=newgraph/citiwords.txt newgraph/final.mdl
  newgraph/HCLG.fst 'ark:echo utterance-id1 utterance-id1|' 'scp:echo
  utterance-id1 luck.wav|' ark:/dev/null
  
  I am getting the following output instead of LUCKNOW
  
  LOG
  (online2-wav-nnet2-latgen-faster:ComputeDerivedVars():ivector-extractor.cc:201)
  Done.
  utterance-id1 RAJGARH
  
  Why it is not getting recognized as LUCKNOW?
  
  Attachment: newgraph.zip (626.0 kB; application/zip)
  
  Name recognition
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=25#83b1
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

K R Srinidhi - 2014-10-14

The dictionary entry is as follows:
LUCKNOW l ah k n aw

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nagendra Kumar Goel - 2014-10-14
  
  Okay. It looks close. You will need to speak in that tone of the dictionary.
  On Oct 14, 2014 7:26 AM, "K R Srinidhi" srinidhikrs@users.sf.net wrote:
  
  The dictionary entry is as follows:
  LUCKNOW l ah k n aw
  
  Name recognition
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=25#30df
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

K R Srinidhi - 2014-10-14

But why it is recognizing completely differtly as RAJGARH or RAIPUR.
There is no similarity between RAJGARH and LUCKNOW ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vassil Panayotov - 2014-10-14

Where did you get the phones.txt file from? As far as I can see it's different from the file in http://www.kaldi-asr.org/downloads/build/2/sandbox/online/egs/fisher_english/s5/exp/tri5a/graph/. You can't really mix and match files like that - the phones file is part of the acoustic model definition AFAIK.

You should be able to do better than RAJGARH. Just out of curiosity I upsampled the luck.wav to 16kHz and used the librispeech nnet2-online model(available for download) on it and the result is "AND NOW", which is closer I think. Also I did this using a HCLG graph built with a very generic 3-gram LM trained on 14500 books, so with a lot smaller LM(like yours) you should be able to recognize this correctly.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

K R Srinidhi - 2014-10-14

can you please provide the link of lexicon.txt and phones.txt used in fisher model which can be used to train g2p model for generating lexicon for my word list.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2014-10-14
  
  You can browse to it at kaldi-asr.org - it will be in the same upload as
  the models you downloaded, the phones.txt will be in data/lang/, and the
  lexicon will be somewhere like data/local/dict/lexicon.txt
  
  Dan
  
  On Tue, Oct 14, 2014 at 1:26 PM, K R Srinidhi srinidhikrs@users.sf.net
  wrote:
  
  can you please provide the link of lexicon.txt and phones.txt used in
  fisher model which can be used to train g2p model for generating lexicon
  for my word list.
  
  Name recognition
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=25#bce8
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

K R Srinidhi - 2014-10-14

the lexicon.txt in data/local/dict/lexicon.txt has different phonemes than the phonemes mentioned in http://www.kaldi-asr.org/downloads/build/2/sandbox/online/egs/fisher_english/s5/exp/tri5a/graph/phones.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2014-10-14
  
  That version is before adding word-position dependency info; you can look
  at data/local/lang/lexiconp.txt for an example that has the word-position
  dependency info added.
  In the run.sh you'll see this command:
  
  utils/prepare_lang.sh data/local/dict "<unk>" data/local/lang data/lang
  
  What I would recommend is to edit data/local/dict/lexicon.txt to add your
  own words, then call something like
  
  utils/prepare_lang.sh data/local/dict "<unk>" data/local/lang
  data/lang_morewords
  
  and when you're done, verify that the phones.txt is identical to the one in
  data/lang/, otherwise it will be incompatible with the tree.
  
  Then use data/lang_morewords to build the graph.
  
  Dan
  
  On Tue, Oct 14, 2014 at 1:38 PM, K R Srinidhi srinidhikrs@users.sf.net
  wrote:
  
  the lexicon.txt in data/local/dict/lexicon.txt has different phonemes than
  the phonemes mentioned in
  http://www.kaldi-asr.org/downloads/build/2/sandbox/online/egs/fisher_english/s5/exp/tri5a/graph/phones.txt
  
  Name recognition
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=25#bfae
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

K R Srinidhi - 2014-10-14

Thanks a lot for the help. Now I am able to get LUCKNOW recognized from the utterance after following instructions by Dan. Also thanks to Vassil for pointing out the phones.txt issue.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

K R Srinidhi - 2014-10-22

I have a list of 20k unique hindi language words for which I have the lexicon with phonemes used in building fisher english accoustic model. I have created a unigram language model from the above 20 k words and constructed decoding graph(HCLG.fst).
When I am trying recognition with the fisher english accoustic model and the above decoding graph (HCLG.fst) I am finding that recognition accuracy is not very good. For some words the recognition is fine but for some words starting with certain phonemes the results are very bad. 1)Is it possible to get above 95% accuracy with the fisher english accoustic model and decoding graph constructed as explained above? If possible what are the options to be looked into for tuning to get more than 95% accuracy.
2)Is it required to build a new accoustic model for hindi language with new phoneme set to cover all the sounds in the language to get better accuracy?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2014-10-22
  
  Using the phone-set of one language to recognize another language is not
  something that people normally do, and we don't expect the recognition
  performance to be very good. You need to train on a Hindi dataset. I
  don't know whether one exists.
  Dan
  
  On Wed, Oct 22, 2014 at 2:10 AM, K R Srinidhi srinidhikrs@users.sf.net
  wrote:
  
  I have a list of 20k unique hindi language words for which I have the
  lexicon with phonemes used in building fisher english accoustic model. I
  have created a unigram language model from the above 20 k words and
  constructed decoding graph(HCLG.fst).
  When I am trying recognition with the fisher english accoustic model and
  the above decoding graph (HCLG.fst) I am finding that recognition accuracy
  is not very good. For some words the recognition is fine but for some words
  starting with certain phonemes the results are very bad. 1)Is it possible
  to get above 95% accuracy with the fisher english accoustic model and
  decoding graph constructed as explained above? If possible what are the
  options to be looked into for tuning to get more than 95% accuracy.
  2)Is it required to build a new accoustic model for hindi language with
  new phoneme set to cover all the sounds in the language to get better
  accuracy?
  
  Name recognition
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=25#e516
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

K R Srinidhi - 2014-10-29

I have got some training data with hindi transcriptions . I also have a lexicon for hindi with hindi phone-set. I have trained a gmm accoustic model (tri3b)

Do LDA+MLLT+SAT, and decode.

steps/train_sat.sh 2000 11000 data/train data/lang exp/tri2b_ali exp/tri3b || exit 1;
utils/mkgraph.sh data/lang exp/tri3b exp/tri3b/graph || exit 1;

Now when I run local/online/run_nnet2.sh using data/train and exp/tri3b it is failing in nnet-combine-fast stage .
When I checked the script debug output I found that num_iters (4) is less than mix_up_iters (6) and nnets_list[$idx] is not getting populated.

Please help me in finding the source of the problem.

I have attached the screen output while running run_nnet2.sh (with sh -x for train_pnorm_fast.sh)

run_nnet_screen.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

K R Srinidhi - 2014-10-30

I changed the parameters --num-epochs to 4 and --num-hidden-layers to 2 and got the nnet model final.mdl. What is the ideal value for those parameters for getting a better model ?
I was testing to check if I could build a deep neural net model with my training data. I used only 2500 recordings with transcriptions for accoustic model training . While testing with online decoding I was getting nearby match for some utterances while majority were misrecognitions. Now I want to run the setup with 150-200 hours of recordings with transcriptions. Will I be able to get better recognition accuracy if I build an accoustic model with 150-200 hrs of training data?
What is the recommended hardware configuration for building a neural net accoustic model with the above training data.
How much time it would take on a single server normally (for 150-200 hrs of training data)?
Or is grid engine setup recommended ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2014-10-30
  
  I changed the parameters --num-epochs to 4 and --num-hidden-layers to 2
  and got the nnet model final.mdl. What is the ideal value for those
  parameters for getting a better model ?
  
  If there was an ideal value we would have baked it into the script.
  Here
  http://kaldi.sourceforge.net/dnn2.html
  there are some suggestions for tuning. I recommend to use the
  train_pnorm_simple.sh script-- in the other scripts there is a
  --num-epochs-final number to configure also, which can be confusing.
  An important diagnostic is the final (train,valid) probs: do
  grep LOG exp/your-dir/log/compute_prob_*.final.log
  to see them.
  They should differ by no more than 20%, or 50% at most; if more, then you
  have too many parameters.
  
  I was testing to check if I could build a deep neural net model with my
  training data. I used only 2500 recordings with transcriptions for
  accoustic model training . While testing with online decoding I was getting
  nearby match for some utterances while majority were misrecognitions.
  
  That is very little data to train a DNN.
  
  Now I want to run the setup with 150-200 hours of recordings with
  transcriptions. Will I be able to get better recognition accuracy if I
  build an accoustic model with 150-200 hrs of training data?
  What is the recommended hardware configuration for building a neural net
  accoustic model with the above training data.
  How much time it would take on a single server normally (for 150-200 hrs
  of training data)?
  Or is grid engine setup recommended ?
  
  With that much data you need GPUs, or it will take you a very long time
  (e.g. a week at least, but depends how many cores you have).
  Dan
  
  Name recognition
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=25#6610
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

K R Srinidhi - 2014-11-10

I was able to build hindi accoustic models with our hindi training data. But I am facing following issues with recognition :
1)Built a 1-gram language model using all the words in hindi lexicon and constructed the graph (HCLG.fst). With the built accoustic model and the graph (HCLG.fst), what I am observing is that recognition is decent for utterences containing single word . But if utterenace contain more than one word (for example : the lord of the rings) then recognition is poor. How can I get high accuracy for multi word recognition ?
2)When I was testing with online-gmm-decode-faster I found that I had to speak little loudly and slowly for recognition to happen properly. Also sometimes the first attempt was failing with misrecognition and second attempt was giving correct recognition . What could be the reason ? (Like for example I wanted to recognize NATWAR. In the first attempt when I spoke NATWAR it gave wrong result and next attempt I spoke NATWAR in the same way as first attempt , it gave correct Result.)
Please provide some information on improving multi word recognition accuracy as most of the utterances would contain 2 to5 words .

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Name recognition

Forums

Help

Name recognition

Attachment: newgraph.zip (626.0 kB; application/zip)

Do LDA+MLLT+SAT, and decode.

Name recognition

Forums

Help

Name recognition document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Attachment: newgraph.zip (626.0 kB; application/zip)

Do LDA+MLLT+SAT, and decode.

Name recognition