Menu

Name recognition

2014-10-01
2015-05-14
1 2 > >> (Page 1 of 2)
  • K R Srinidhi

    K R Srinidhi - 2014-10-01

    Hi,
    I have a list of movie names and song names (billions). The list keeps getting updated with new names (songs and movie names) on a regular basis. It is required to recognize movie/song names from user utterances. Is it possible to use Kaldi for such a requirement? Can open vocabulary models be built and used with kaldi for recognizing OOV words?
    Thanks
    Srinidhi

     
    • Daniel Povey

      Daniel Povey - 2014-10-01

      The acoustic models are inherently open vocabulary; the lexicon would need
      to be updated though (e.g. using g2p) and the decoding-graph recompiled.
      It's definitely possible using Kaldi but it requires some understanding of
      how speech recognition works, i.e. it might not be a suitable task for a
      beginner.
      Dan

      On Wed, Oct 1, 2014 at 5:33 AM, K R Srinidhi srinidhikrs@users.sf.net
      wrote:

      Hi,
      I have a list of movie names and song names (billions). The list keeps
      getting updated with new names (songs and movie names) on a regular basis.
      It is required to recognize movie/song names from user utterances. Is it
      possible to use Kaldi for such a requirement? Can open vocabulary models be
      built and used with kaldi for recognizing OOV words?
      Thanks
      Srinidhi


      Name recognition
      https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=25#c715


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355347/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
  • K R Srinidhi

    K R Srinidhi - 2014-10-06

    I can get my own corpus (recordings from multiple people and transcriptions) for accoustic model.Can I build a flat language model which will only contain the names (of songs/movies) and keep updating the language model (G.fst) with additional names as and when new names are available. Then can I rebuild the decoding graph from new language model and new lexicon (containing new names) and use it? Is this a viable option for my requirement.? Is it possible to provide me details on step be step plan for building models and use it for recognition?

     
    • Daniel Povey

      Daniel Povey - 2014-10-06

      That plan is workable, yes.
      Probably instead of building a flat language model it would be better to
      compute some kind of probabilities for how often different movies/songs
      show up in various lists, and use those.
      Regarding the steps involved in building models and using them for
      recognition - you could probably look at any of the example scripts. I
      would suggest the Voxforge or Librispeech setups because I'm assuming you
      don't have access to LDC data.

      Dan

      On Mon, Oct 6, 2014 at 2:12 AM, K R Srinidhi srinidhikrs@users.sf.net
      wrote:

      I can get my own corpus (recordings from multiple people and
      transcriptions) for accoustic model.Can I build a flat language model which
      will only contain the names (of songs/movies) and keep updating the
      language model (G.fst) with additional names as and when new names are
      available. Then can I rebuild the decoding graph from new language model
      and new lexicon (containing new names) and use it? Is this a viable option
      for my requirement.? Is it possible to provide me details on step be step
      plan for building models and use it for recognition?


      Name recognition
      https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=25#9326


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355347/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
  • K R Srinidhi

    K R Srinidhi - 2014-10-13

    I am planning to test with accoustic model from fisher_english model. Now if I want to recognize names is it just sufficient to add new names to vocabulary and generate HCLG decoding graph with existing language model (where name is not appearing) without modifying the language model?

    I tried to build a small unigram language model with city name list (around 650 names) and lexicon for the above name list (generated using g2p) and constructed HCLG decoding graph . But recognition using fisher_english accoustic model and the generated HCLG.fst is not giving the desired results.

    I am using the following cmd :

    online2-wav-nnet2-latgen-faster --do-endpointing=false --online=false --config=nnet_a_gpu_online/conf/online_nnet2_decoding.conf --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=namelist.txt nnet_a_gpu_online/final.mdl namelist_HCLG.fst "ark:echo utterance-id1 utterance-id1|" "scp:echo utterance-id1 luck.wav|" ark:/dev/null

     
    • Daniel Povey

      Daniel Povey - 2014-10-13

      I am planning to test with accoustic model from fisher_english model. Now

      if I want to recognize names is it just sufficient to add new names to
      vocabulary and generate HCLG decoding graph with existing language model
      (where name is not appearing) without modifying the language model?

      No, if the words don't appear in the language model they can never be
      recognized.

      I tried to build a small unigram language model with city name list
      (around 650 names) and lexicon for the above name list (generated using
      g2p) and constructed HCLG decoding graph . But recognition using
      fisher_english accoustic model and the generated HCLG.fst is not giving the
      desired results.

      I am using the following cmd :

      online2-wav-nnet2-latgen-faster --do-endpointing=false --online=false
      --config=nnet_a_gpu_online/conf/online_nnet2_decoding.conf
      --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1
      --word-symbol-table=namelist.txt nnet_a_gpu_online/final.mdl
      namelist_HCLG.fst "ark:echo utterance-id1 utterance-id1|" "scp:echo
      utterance-id1 luck.wav|" ark:/dev/null

      That probably should have worked - perhaps something went wrong when
      constructing the HCLG.

      Dan


      Name recognition
      https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=50#24a2


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355347/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
  • K R Srinidhi

    K R Srinidhi - 2014-10-13

    How can I debug what went wrong in my HCLG.fst.? Also while making HCLG.fast tree and model files are required. In fisher model provided in kaldi-asr.org tree file is not available, so used tree and model files from voxforge tri3b model. Please provide with some details on debugging this issue.

     
    • Daniel Povey

      Daniel Povey - 2014-10-13

      Oh - I think that gives us the answer. The tree files are not
      interchangeable, you need to include the correct one. I'll have to modify
      the prepare_online_decoding.sh script to copy the tree, which will make it
      easier for others like you.
      I just had a look at the online-nnet2 training script, and it looks like it
      uses the tree from exp/tri5a. So you should navigate to that location in
      the corresponding upload at kaldi-asr.org and download that tree.

      Dan

      On Mon, Oct 13, 2014 at 1:01 PM, K R Srinidhi srinidhikrs@users.sf.net
      wrote:

      How can I debug what went wrong in my HCLG.fst.? Also while making
      HCLG.fast tree and model files are required. In fisher model provided in
      kaldi-asr.org tree file is not available, so used tree and model files
      from voxforge tri3b model. Please provide with some details on debugging
      this issue.


      Name recognition
      https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=50#8723


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355347/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
  • K R Srinidhi

    K R Srinidhi - 2014-10-14

    I am still not getting the desired results even after rebuilding HCLG.fst against tree and final.mdl of fisher's /exp/tri5a. I am attaching the files and also the cmds (cmds.sh) used in building the HCLG.fst . Please check if possible as to what is wrong . When I am running the following cmd (with lucknow being recorded in utterance luck.wav)

    kaldi-trunk/src/online2bin/online2-wav-nnet2-latgen-faster --do-endpointing=true --online=false --config=newgraph/online_nnet2_decoding.conf --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=newgraph/citiwords.txt newgraph/final.mdl newgraph/HCLG.fst 'ark:echo utterance-id1 utterance-id1|' 'scp:echo utterance-id1 luck.wav|' ark:/dev/null

    I am getting the following output instead of LUCKNOW

    LOG (online2-wav-nnet2-latgen-faster:ComputeDerivedVars():ivector-extractor.cc:201) Done.
    utterance-id1 RAJGARH

    Why it is not getting recognized as LUCKNOW?

     
    • Nagendra Kumar Goel

      Is it "Luck now" or 'laakh nu" or something else? What's the dictionary
      entry?
      On Oct 14, 2014 6:48 AM, "K R Srinidhi" srinidhikrs@users.sf.net wrote:

      I am still not getting the desired results even after rebuilding HCLG.fst
      against tree and final.mdl of fisher's /exp/tri5a. I am attaching the files
      and also the cmds (cmds.sh) used in building the HCLG.fst . Please check if
      possible as to what is wrong . When I am running the following cmd (with
      lucknow being recorded in utterance luck.wav)

      kaldi-trunk/src/online2bin/online2-wav-nnet2-latgen-faster
      --do-endpointing=true --online=false
      --config=newgraph/online_nnet2_decoding.conf --max-active=7000 --beam=15.0
      --lattice-beam=6.0 --acoustic-scale=0.1
      --word-symbol-table=newgraph/citiwords.txt newgraph/final.mdl
      newgraph/HCLG.fst 'ark:echo utterance-id1 utterance-id1|' 'scp:echo
      utterance-id1 luck.wav|' ark:/dev/null

      I am getting the following output instead of LUCKNOW

      LOG
      (online2-wav-nnet2-latgen-faster:ComputeDerivedVars():ivector-extractor.cc:201)
      Done.
      utterance-id1 RAJGARH

      Why it is not getting recognized as LUCKNOW?

      Attachment: newgraph.zip (626.0 kB; application/zip)

      Name recognition
      https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=25#83b1


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355347/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
  • K R Srinidhi

    K R Srinidhi - 2014-10-14

    But why it is recognizing completely differtly as RAJGARH or RAIPUR.
    There is no similarity between RAJGARH and LUCKNOW ?

     
  • Vassil Panayotov

    Where did you get the phones.txt file from? As far as I can see it's different from the file in http://www.kaldi-asr.org/downloads/build/2/sandbox/online/egs/fisher_english/s5/exp/tri5a/graph/. You can't really mix and match files like that - the phones file is part of the acoustic model definition AFAIK.

    You should be able to do better than RAJGARH. Just out of curiosity I upsampled the luck.wav to 16kHz and used the librispeech nnet2-online model(available for download) on it and the result is "AND NOW", which is closer I think. Also I did this using a HCLG graph built with a very generic 3-gram LM trained on 14500 books, so with a lot smaller LM(like yours) you should be able to recognize this correctly.

     
  • K R Srinidhi

    K R Srinidhi - 2014-10-14

    can you please provide the link of lexicon.txt and phones.txt used in fisher model which can be used to train g2p model for generating lexicon for my word list.

     
  • K R Srinidhi

    K R Srinidhi - 2014-10-14

    Thanks a lot for the help. Now I am able to get LUCKNOW recognized from the utterance after following instructions by Dan. Also thanks to Vassil for pointing out the phones.txt issue.

     
  • K R Srinidhi

    K R Srinidhi - 2014-10-22

    I have a list of 20k unique hindi language words for which I have the lexicon with phonemes used in building fisher english accoustic model. I have created a unigram language model from the above 20 k words and constructed decoding graph(HCLG.fst).
    When I am trying recognition with the fisher english accoustic model and the above decoding graph (HCLG.fst) I am finding that recognition accuracy is not very good. For some words the recognition is fine but for some words starting with certain phonemes the results are very bad. 1)Is it possible to get above 95% accuracy with the fisher english accoustic model and decoding graph constructed as explained above? If possible what are the options to be looked into for tuning to get more than 95% accuracy.
    2)Is it required to build a new accoustic model for hindi language with new phoneme set to cover all the sounds in the language to get better accuracy?

     
    • Daniel Povey

      Daniel Povey - 2014-10-22

      Using the phone-set of one language to recognize another language is not
      something that people normally do, and we don't expect the recognition
      performance to be very good. You need to train on a Hindi dataset. I
      don't know whether one exists.
      Dan

      On Wed, Oct 22, 2014 at 2:10 AM, K R Srinidhi srinidhikrs@users.sf.net
      wrote:

      I have a list of 20k unique hindi language words for which I have the
      lexicon with phonemes used in building fisher english accoustic model. I
      have created a unigram language model from the above 20 k words and
      constructed decoding graph(HCLG.fst).
      When I am trying recognition with the fisher english accoustic model and
      the above decoding graph (HCLG.fst) I am finding that recognition accuracy
      is not very good. For some words the recognition is fine but for some words
      starting with certain phonemes the results are very bad. 1)Is it possible
      to get above 95% accuracy with the fisher english accoustic model and
      decoding graph constructed as explained above? If possible what are the
      options to be looked into for tuning to get more than 95% accuracy.
      2)Is it required to build a new accoustic model for hindi language with
      new phoneme set to cover all the sounds in the language to get better
      accuracy?


      Name recognition
      https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=25#e516


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355347/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
  • K R Srinidhi

    K R Srinidhi - 2014-10-29

    I have got some training data with hindi transcriptions . I also have a lexicon for hindi with hindi phone-set. I have trained a gmm accoustic model (tri3b)

    Do LDA+MLLT+SAT, and decode.

    steps/train_sat.sh 2000 11000 data/train data/lang exp/tri2b_ali exp/tri3b || exit 1;
    utils/mkgraph.sh data/lang exp/tri3b exp/tri3b/graph || exit 1;

    Now when I run local/online/run_nnet2.sh using data/train and exp/tri3b it is failing in nnet-combine-fast stage .
    When I checked the script debug output I found that num_iters (4) is less than mix_up_iters (6) and nnets_list[$idx] is not getting populated.

    Please help me in finding the source of the problem.

    I have attached the screen output while running run_nnet2.sh (with sh -x for train_pnorm_fast.sh)

     
  • K R Srinidhi

    K R Srinidhi - 2014-10-30

    I changed the parameters --num-epochs to 4 and --num-hidden-layers to 2 and got the nnet model final.mdl. What is the ideal value for those parameters for getting a better model ?
    I was testing to check if I could build a deep neural net model with my training data. I used only 2500 recordings with transcriptions for accoustic model training . While testing with online decoding I was getting nearby match for some utterances while majority were misrecognitions. Now I want to run the setup with 150-200 hours of recordings with transcriptions. Will I be able to get better recognition accuracy if I build an accoustic model with 150-200 hrs of training data?
    What is the recommended hardware configuration for building a neural net accoustic model with the above training data.
    How much time it would take on a single server normally (for 150-200 hrs of training data)?
    Or is grid engine setup recommended ?

     
    • Daniel Povey

      Daniel Povey - 2014-10-30

      I changed the parameters --num-epochs to 4 and --num-hidden-layers to 2
      and got the nnet model final.mdl. What is the ideal value for those
      parameters for getting a better model ?

      If there was an ideal value we would have baked it into the script.
      Here
      http://kaldi.sourceforge.net/dnn2.html
      there are some suggestions for tuning. I recommend to use the
      train_pnorm_simple.sh script-- in the other scripts there is a
      --num-epochs-final number to configure also, which can be confusing.
      An important diagnostic is the final (train,valid) probs: do
      grep LOG exp/your-dir/log/compute_prob_*.final.log
      to see them.
      They should differ by no more than 20%, or 50% at most; if more, then you
      have too many parameters.

      I was testing to check if I could build a deep neural net model with my
      training data. I used only 2500 recordings with transcriptions for
      accoustic model training . While testing with online decoding I was getting
      nearby match for some utterances while majority were misrecognitions.

      That is very little data to train a DNN.

      Now I want to run the setup with 150-200 hours of recordings with
      transcriptions. Will I be able to get better recognition accuracy if I
      build an accoustic model with 150-200 hrs of training data?
      What is the recommended hardware configuration for building a neural net
      accoustic model with the above training data.
      How much time it would take on a single server normally (for 150-200 hrs
      of training data)?
      Or is grid engine setup recommended ?

      With that much data you need GPUs, or it will take you a very long time
      (e.g. a week at least, but depends how many cores you have).
      Dan


      Name recognition
      https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?limit=25#6610


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355347/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
  • K R Srinidhi

    K R Srinidhi - 2014-11-10

    I was able to build hindi accoustic models with our hindi training data. But I am facing following issues with recognition :
    1)Built a 1-gram language model using all the words in hindi lexicon and constructed the graph (HCLG.fst). With the built accoustic model and the graph (HCLG.fst), what I am observing is that recognition is decent for utterences containing single word . But if utterenace contain more than one word (for example : the lord of the rings) then recognition is poor. How can I get high accuracy for multi word recognition ?
    2)When I was testing with online-gmm-decode-faster I found that I had to speak little loudly and slowly for recognition to happen properly. Also sometimes the first attempt was failing with misrecognition and second attempt was giving correct recognition . What could be the reason ? (Like for example I wanted to recognize NATWAR. In the first attempt when I spoke NATWAR it gave wrong result and next attempt I spoke NATWAR in the same way as first attempt , it gave correct Result.)
    Please provide some information on improving multi word recognition accuracy as most of the utterances would contain 2 to5 words .

     
1 2 > >> (Page 1 of 2)