
Name recognition

1 2 > >> (Page 1 of 2)
  • K R Srinidhi

    K R Srinidhi - 2014-10-01

    I have a list of movie names and song names (billions). The list keeps getting updated with new names (songs and movie names) on a regular basis. It is required to recognize movie/song names from user utterances. Is it possible to use Kaldi for such a requirement? Can open vocabulary models be built and used with kaldi for recognizing OOV words?

    • Daniel Povey

      Daniel Povey - 2014-10-01

      The acoustic models are inherently open vocabulary; the lexicon would need
      to be updated though (e.g. using g2p) and the decoding-graph recompiled.
      It's definitely possible using Kaldi but it requires some understanding of
      how speech recognition works, i.e. it might not be a suitable task for a

      On Wed, Oct 1, 2014 at 5:33 AM, K R Srinidhi

      I have a list of movie names and song names (billions). The list keeps
      getting updated with new names (songs and movie names) on a regular basis.
      It is required to recognize movie/song names from user utterances. Is it
      possible to use Kaldi for such a requirement? Can open vocabulary models be
      built and used with kaldi for recognizing OOV words?

      Name recognition

      Sent from because you indicated interest in

      To unsubscribe from further messages, please visit

  • K R Srinidhi

    K R Srinidhi - 2014-10-06

    I can get my own corpus (recordings from multiple people and transcriptions) for accoustic model.Can I build a flat language model which will only contain the names (of songs/movies) and keep updating the language model (G.fst) with additional names as and when new names are available. Then can I rebuild the decoding graph from new language model and new lexicon (containing new names) and use it? Is this a viable option for my requirement.? Is it possible to provide me details on step be step plan for building models and use it for recognition?

    • Daniel Povey

      Daniel Povey - 2014-10-06

      That plan is workable, yes.
      Probably instead of building a flat language model it would be better to
      compute some kind of probabilities for how often different movies/songs
      show up in various lists, and use those.
      Regarding the steps involved in building models and using them for
      recognition - you could probably look at any of the example scripts. I
      would suggest the Voxforge or Librispeech setups because I'm assuming you
      don't have access to LDC data.


      On Mon, Oct 6, 2014 at 2:12 AM, K R Srinidhi

      I can get my own corpus (recordings from multiple people and
      transcriptions) for accoustic model.Can I build a flat language model which
      will only contain the names (of songs/movies) and keep updating the
      language model (G.fst) with additional names as and when new names are
      available. Then can I rebuild the decoding graph from new language model
      and new lexicon (containing new names) and use it? Is this a viable option
      for my requirement.? Is it possible to provide me details on step be step
      plan for building models and use it for recognition?

      Name recognition

      Sent from because you indicated interest in

      To unsubscribe from further messages, please visit

  • K R Srinidhi

    K R Srinidhi - 2014-10-13

    I am planning to test with accoustic model from fisher_english model. Now if I want to recognize names is it just sufficient to add new names to vocabulary and generate HCLG decoding graph with existing language model (where name is not appearing) without modifying the language model?

    I tried to build a small unigram language model with city name list (around 650 names) and lexicon for the above name list (generated using g2p) and constructed HCLG decoding graph . But recognition using fisher_english accoustic model and the generated HCLG.fst is not giving the desired results.

    I am using the following cmd :

    online2-wav-nnet2-latgen-faster --do-endpointing=false --online=false --config=nnet_a_gpu_online/conf/online_nnet2_decoding.conf --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=namelist.txt nnet_a_gpu_online/final.mdl namelist_HCLG.fst "ark:echo utterance-id1 utterance-id1|" "scp:echo utterance-id1 luck.wav|" ark:/dev/null

    • Daniel Povey

      Daniel Povey - 2014-10-13

      I am planning to test with accoustic model from fisher_english model. Now

      if I want to recognize names is it just sufficient to add new names to
      vocabulary and generate HCLG decoding graph with existing language model
      (where name is not appearing) without modifying the language model?

      No, if the words don't appear in the language model they can never be

      I tried to build a small unigram language model with city name list
      (around 650 names) and lexicon for the above name list (generated using
      g2p) and constructed HCLG decoding graph . But recognition using
      fisher_english accoustic model and the generated HCLG.fst is not giving the
      desired results.

      I am using the following cmd :

      online2-wav-nnet2-latgen-faster --do-endpointing=false --online=false
      --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1
      --word-symbol-table=namelist.txt nnet_a_gpu_online/final.mdl
      namelist_HCLG.fst "ark:echo utterance-id1 utterance-id1|" "scp:echo
      utterance-id1 luck.wav|" ark:/dev/null

      That probably should have worked - perhaps something went wrong when
      constructing the HCLG.


      Name recognition

      Sent from because you indicated interest in

      To unsubscribe from further messages, please visit

  • K R Srinidhi

    K R Srinidhi - 2014-10-13

    How can I debug what went wrong in my HCLG.fst.? Also while making tree and model files are required. In fisher model provided in tree file is not available, so used tree and model files from voxforge tri3b model. Please provide with some details on debugging this issue.

    • Daniel Povey

      Daniel Povey - 2014-10-13

      Oh - I think that gives us the answer. The tree files are not
      interchangeable, you need to include the correct one. I'll have to modify
      the script to copy the tree, which will make it
      easier for others like you.
      I just had a look at the online-nnet2 training script, and it looks like it
      uses the tree from exp/tri5a. So you should navigate to that location in
      the corresponding upload at and download that tree.


      On Mon, Oct 13, 2014 at 1:01 PM, K R Srinidhi

      How can I debug what went wrong in my HCLG.fst.? Also while making tree and model files are required. In fisher model provided in tree file is not available, so used tree and model files
      from voxforge tri3b model. Please provide with some details on debugging
      this issue.

      Name recognition

      Sent from because you indicated interest in

      To unsubscribe from further messages, please visit

  • K R Srinidhi

    K R Srinidhi - 2014-10-14

    I am still not getting the desired results even after rebuilding HCLG.fst against tree and final.mdl of fisher's /exp/tri5a. I am attaching the files and also the cmds ( used in building the HCLG.fst . Please check if possible as to what is wrong . When I am running the following cmd (with lucknow being recorded in utterance luck.wav)

    kaldi-trunk/src/online2bin/online2-wav-nnet2-latgen-faster --do-endpointing=true --online=false --config=newgraph/online_nnet2_decoding.conf --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=newgraph/citiwords.txt newgraph/final.mdl newgraph/HCLG.fst 'ark:echo utterance-id1 utterance-id1|' 'scp:echo utterance-id1 luck.wav|' ark:/dev/null

    I am getting the following output instead of LUCKNOW

    LOG (online2-wav-nnet2-latgen-faster:ComputeDerivedVars() Done.
    utterance-id1 RAJGARH

    Why it is not getting recognized as LUCKNOW?

    • Nagendra Kumar Goel

      Is it "Luck now" or 'laakh nu" or something else? What's the dictionary
      On Oct 14, 2014 6:48 AM, "K R Srinidhi" wrote:

      I am still not getting the desired results even after rebuilding HCLG.fst
      against tree and final.mdl of fisher's /exp/tri5a. I am attaching the files
      and also the cmds ( used in building the HCLG.fst . Please check if
      possible as to what is wrong . When I am running the following cmd (with
      lucknow being recorded in utterance luck.wav)

      --do-endpointing=true --online=false
      --config=newgraph/online_nnet2_decoding.conf --max-active=7000 --beam=15.0
      --lattice-beam=6.0 --acoustic-scale=0.1
      --word-symbol-table=newgraph/citiwords.txt newgraph/final.mdl
      newgraph/HCLG.fst 'ark:echo utterance-id1 utterance-id1|' 'scp:echo
      utterance-id1 luck.wav|' ark:/dev/null

      I am getting the following output instead of LUCKNOW

      utterance-id1 RAJGARH

      Why it is not getting recognized as LUCKNOW?

      Attachment: (626.0 kB; application/zip)

      Name recognition

      Sent from because you indicated interest in

      To unsubscribe from further messages, please visit

  • K R Srinidhi

    K R Srinidhi - 2014-10-14

    But why it is recognizing completely differtly as RAJGARH or RAIPUR.
    There is no similarity between RAJGARH and LUCKNOW ?

  • Vassil Panayotov

    Where did you get the phones.txt file from? As far as I can see it's different from the file in You can't really mix and match files like that - the phones file is part of the acoustic model definition AFAIK.

    You should be able to do better than RAJGARH. Just out of curiosity I upsampled the luck.wav to 16kHz and used the librispeech nnet2-online model(available for download) on it and the result is "AND NOW", which is closer I think. Also I did this using a HCLG graph built with a very generic 3-gram LM trained on 14500 books, so with a lot smaller LM(like yours) you should be able to recognize this correctly.

  • K R Srinidhi

    K R Srinidhi - 2014-10-14

    can you please provide the link of lexicon.txt and phones.txt used in fisher model which can be used to train g2p model for generating lexicon for my word list.

  • K R Srinidhi

    K R Srinidhi - 2014-10-14

    Thanks a lot for the help. Now I am able to get LUCKNOW recognized from the utterance after following instructions by Dan. Also thanks to Vassil for pointing out the phones.txt issue.

  • K R Srinidhi

    K R Srinidhi - 2014-10-22

    I have a list of 20k unique hindi language words for which I have the lexicon with phonemes used in building fisher english accoustic model. I have created a unigram language model from the above 20 k words and constructed decoding graph(HCLG.fst).
    When I am trying recognition with the fisher english accoustic model and the above decoding graph (HCLG.fst) I am finding that recognition accuracy is not very good. For some words the recognition is fine but for some words starting with certain phonemes the results are very bad. 1)Is it possible to get above 95% accuracy with the fisher english accoustic model and decoding graph constructed as explained above? If possible what are the options to be looked into for tuning to get more than 95% accuracy.
    2)Is it required to build a new accoustic model for hindi language with new phoneme set to cover all the sounds in the language to get better accuracy?

    • Daniel Povey

      Daniel Povey - 2014-10-22

      Using the phone-set of one language to recognize another language is not
      something that people normally do, and we don't expect the recognition
      performance to be very good. You need to train on a Hindi dataset. I
      don't know whether one exists.

      On Wed, Oct 22, 2014 at 2:10 AM, K R Srinidhi

      I have a list of 20k unique hindi language words for which I have the
      lexicon with phonemes used in building fisher english accoustic model. I
      have created a unigram language model from the above 20 k words and
      constructed decoding graph(HCLG.fst).
      When I am trying recognition with the fisher english accoustic model and
      the above decoding graph (HCLG.fst) I am finding that recognition accuracy
      is not very good. For some words the recognition is fine but for some words
      starting with certain phonemes the results are very bad. 1)Is it possible
      to get above 95% accuracy with the fisher english accoustic model and
      decoding graph constructed as explained above? If possible what are the
      options to be looked into for tuning to get more than 95% accuracy.
      2)Is it required to build a new accoustic model for hindi language with
      new phoneme set to cover all the sounds in the language to get better

      Name recognition

      Sent from because you indicated interest in

      To unsubscribe from further messages, please visit

  • K R Srinidhi

    K R Srinidhi - 2014-10-29

    I have got some training data with hindi transcriptions . I also have a lexicon for hindi with hindi phone-set. I have trained a gmm accoustic model (tri3b)

    Do LDA+MLLT+SAT, and decode.

    steps/ 2000 11000 data/train data/lang exp/tri2b_ali exp/tri3b || exit 1;
    utils/ data/lang exp/tri3b exp/tri3b/graph || exit 1;

    Now when I run local/online/ using data/train and exp/tri3b it is failing in nnet-combine-fast stage .
    When I checked the script debug output I found that num_iters (4) is less than mix_up_iters (6) and nnets_list[$idx] is not getting populated.

    Please help me in finding the source of the problem.

    I have attached the screen output while running (with sh -x for

  • K R Srinidhi

    K R Srinidhi - 2014-10-30

    I changed the parameters --num-epochs to 4 and --num-hidden-layers to 2 and got the nnet model final.mdl. What is the ideal value for those parameters for getting a better model ?
    I was testing to check if I could build a deep neural net model with my training data. I used only 2500 recordings with transcriptions for accoustic model training . While testing with online decoding I was getting nearby match for some utterances while majority were misrecognitions. Now I want to run the setup with 150-200 hours of recordings with transcriptions. Will I be able to get better recognition accuracy if I build an accoustic model with 150-200 hrs of training data?
    What is the recommended hardware configuration for building a neural net accoustic model with the above training data.
    How much time it would take on a single server normally (for 150-200 hrs of training data)?
    Or is grid engine setup recommended ?

    • Daniel Povey

      Daniel Povey - 2014-10-30

      I changed the parameters --num-epochs to 4 and --num-hidden-layers to 2
      and got the nnet model final.mdl. What is the ideal value for those
      parameters for getting a better model ?

      If there was an ideal value we would have baked it into the script.
      there are some suggestions for tuning. I recommend to use the script-- in the other scripts there is a
      --num-epochs-final number to configure also, which can be confusing.
      An important diagnostic is the final (train,valid) probs: do
      grep LOG exp/your-dir/log/compute_prob_*.final.log
      to see them.
      They should differ by no more than 20%, or 50% at most; if more, then you
      have too many parameters.

      I was testing to check if I could build a deep neural net model with my
      training data. I used only 2500 recordings with transcriptions for
      accoustic model training . While testing with online decoding I was getting
      nearby match for some utterances while majority were misrecognitions.

      That is very little data to train a DNN.

      Now I want to run the setup with 150-200 hours of recordings with
      transcriptions. Will I be able to get better recognition accuracy if I
      build an accoustic model with 150-200 hrs of training data?
      What is the recommended hardware configuration for building a neural net
      accoustic model with the above training data.
      How much time it would take on a single server normally (for 150-200 hrs
      of training data)?
      Or is grid engine setup recommended ?

      With that much data you need GPUs, or it will take you a very long time
      (e.g. a week at least, but depends how many cores you have).

      Name recognition

      Sent from because you indicated interest in

      To unsubscribe from further messages, please visit

  • K R Srinidhi

    K R Srinidhi - 2014-11-10

    I was able to build hindi accoustic models with our hindi training data. But I am facing following issues with recognition :
    1)Built a 1-gram language model using all the words in hindi lexicon and constructed the graph (HCLG.fst). With the built accoustic model and the graph (HCLG.fst), what I am observing is that recognition is decent for utterences containing single word . But if utterenace contain more than one word (for example : the lord of the rings) then recognition is poor. How can I get high accuracy for multi word recognition ?
    2)When I was testing with online-gmm-decode-faster I found that I had to speak little loudly and slowly for recognition to happen properly. Also sometimes the first attempt was failing with misrecognition and second attempt was giving correct recognition . What could be the reason ? (Like for example I wanted to recognize NATWAR. In the first attempt when I spoke NATWAR it gave wrong result and next attempt I spoke NATWAR in the same way as first attempt , it gave correct Result.)
    Please provide some information on improving multi word recognition accuracy as most of the utterances would contain 2 to5 words .

1 2 > >> (Page 1 of 2)