Menu

how to get this accuracy with sphinx-3

Help
CMOS
2008-03-28
2012-09-22
  • CMOS

    CMOS - 2008-03-28

    hi,
    i listened to a sample dialog present in
    http://www.speech.cs.cmu.edu/letsgo/example.html
    which is from "LetsGo" project based on sphinx.
    The recognition rate of that voice is great and i would like to train a system that can do a similar job.
    however i tried open source acoustic models like WSJ1 and HUB4, but the results i got were very bad and had a WER of almost 100%. Im not sure what im doing wrong here.
    for audio input i use a standard microphone that comes with the head phone.

    Any guesses/advices are greatly appreciated.

     
    • CMOS

      CMOS - 2008-04-09

      does any of you know a place where i can download such corpses (for call centers, etc)?

       
    • Nickolay V. Shmyrev

      Upload a speech you want to recognize and we'll show you the options you must use.

      Were you able to recognize simple commands we discussed before?

       
      • CMOS

        CMOS - 2008-03-28

        hi,
        i tried the files you've uploaded (dictionary, language model, etc) and tried the audio files i sent to you. But result was very poor (100% WER). For this purpose i used the "WSJ1 (dictation) acoustic models - for wideband (16kHz) microphone speech" acoustic model.
        please let me know if you have got any good results with the audio files i already sent to you. If i can get good results with those audio files, then i might be able to go ahead.

        thanks

         
    • CMOS

      CMOS - 2008-03-29

      hi,
      thank you for the help.

      following link has some samples i would like to recognize correctly.
      http://rapidshare.com/files/103263009/samples.zip.html
      (they are in wav file format).

      it would be best if the system can handle a large vocabulary, because my application requires to recognize a medium/large vocabulary.

      thanks

       
      • Nickolay V. Shmyrev

        Well, again they decode mostly fine, check my result here. I used sphinx3 trunk and new wsj model available

        http://www.mediafire.com/?zkp03h9temd
        http://www.speech.cs.cmu.edu/sphinx/models/wsj_jan2008/wsj_all_mllt_4000_20080104.tar.gz

        Files are decoded mostly correctly but there is a little problem - there are too many
        garbage A letters. The reason for it is simple - your files are actually preprocessed somehow. They had a large periods of complete silence. Decoder fails to understand that. You have to add dither with -dither yes but even dither doesn't help with subvector quantization.

        Another problem with your files - they have no initial silence. File should start with around half a second of silence to be correctly recognized.

        Summarizing all above - don't preprocess the files if you'd like to get good quality. Use recordings as is.

         
    • CMOS

      CMOS - 2008-04-01

      when i ran it, it gives the following error.

      INFO: Word Insertion Penalty =0.700000
      INFO: Silence probability =0.100000
      INFO: Filler probability =0.100000
      INFO:
      INFO: dict2pid.c(577): Building PID tables for dictionary
      INFO: Initialization of dict2pid_t, report:
      INFO: Dict2pid is in composite triphone mode
      INFO: 267 composite states; 106 composite sseq
      INFO:
      INFO: kbcore.c(623): Inside kbcore: Verifying models consistency ......
      FATAL_ERROR: "kbcore.c", line 628: Feature streamlen(1) != mgau streamlen(30)

      im using the trunk.
      any idea what this could be?

       
    • CMOS

      CMOS - 2008-04-01

      i finally managed to run it and got the result you got.
      thank you very much for the support.

      i used the same setup to convert the following speech (just one word - attention), but it fails.
      i recorded it with fairly good quality microphone.
      please have look at the sample and let me know what you think.

      here is the link to the sample.
      http://rapidshare.com/files/104087869/test.zip.html

      thank you.

       
      • Nickolay V. Shmyrev

        Your file is 44.1 kHz stereo, convert it to mono 16 kHz and it will work fine.

         
    • CMOS

      CMOS - 2008-04-08

      hi again,
      i got good results with the language models you've sent. however with more general models that has large number of words, the results im getting are quite poor, with almost 100% word error rates. however i've seen some statistics about sphinx-3 which shows around 70% accuracy even with large vocabularies.
      my plan is to use this in call centers, which has fairly large vocabularies and the input audio will be coming from telephone line. since the results im getting is poor, even with good quality microphones, im not sure whether i can archive what i want. it would be great if someone can point me to a case study/project where sphinx-3 is used in a similar application.

       
      • Nickolay V. Shmyrev

        > which has fairly large vocabularies and the input audio will be coming from telephone line.

        Well, everything depends on real vocabulary size that's why we expect to get exact numbers from you. For telephone line you can get around 95% with 2000 words. It's actually enough for a simple speech. Of course if you vocabulary is 40000 you can't expect more than 70% even from advanced commercial systems.

        So first of all you must design you complete system - design vocabulary, interaction and so on. Later if you'll have recognition rate problems share your recordings, we'll try to optimize decoder parameters.

         
      • Nagendra Kumar Goel

        Hello,
        I can only say that 70% is very much an achievable target with sphinx. I don't
        know enough about your system to tell what's wrong.
        cheers
        Nagendra

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.