Menu

Get Pocketsphinx performance info for grammar mode? Understanding LM mode output?

Help
amcdonley
2016-05-22
2016-06-09
  • amcdonley

    amcdonley - 2016-05-22

    I am running pocketsphinx on Raspberry Pi 3 / Raspbian Jessie-lite and want to compare my recognition performance on a Raspberry Pi B+.

    I have set -verbose yes, and -logfn to psphinx.log, and I am getting lots of output but in "grammar mode" nothing looks like what I get in "lm mode":

    INFO: ngram_search_fwdtree.c(1567): fwdtree 0.88 CPU 0.355 xRT
    INFO: ngram_search_fwdtree.c(1570): fwdtree 2.63 wall 1.062 xRT
    INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 36 words
    INFO: ngram_search_fwdflat.c(945):     1162 words recognized (5/fr)
    INFO: ngram_search_fwdflat.c(947):   105917 senones evaluated (427/fr)
    INFO: ngram_search_fwdflat.c(949):    83719 channels searched (337/fr)
    INFO: ngram_search_fwdflat.c(951):     4624 words searched (18/fr)
    INFO: ngram_search_fwdflat.c(954):     3438 word transitions (13/fr)
    INFO: ngram_search_fwdflat.c(957): fwdflat 0.47 CPU 0.190 xRT
    INFO: ngram_search_fwdflat.c(960): fwdflat 0.47 wall 0.191 xRT
    INFO: ngram_search.c(1252): lattice start node <s>.0 end node </s>.213
    INFO: ngram_search.c(1278): Eliminated 1 nodes before end node
    INFO: ngram_search.c(1383): Lattice has 331 nodes, 499 links
    INFO: ps_lattice.c(1380): Bestpath score: -6142
    INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(</s>:213:246) = -414419
    INFO: ps_lattice.c(1441): Joint P(O,S) = -439496 P(S|O) = -25077
    INFO: ngram_search.c(874): bestpath 0.01 CPU 0.004 xRT
    INFO: ngram_search.c(877): bestpath 0.00 wall 0.001 xRT
    INFO: ngram_search_fwdtree.c(432): TOTAL fwdtree 0.88 CPU 0.356 xRT
    INFO: ngram_search_fwdtree.c(435): TOTAL fwdtree 2.63 wall 1.067 xRT
    INFO: ngram_search_fwdflat.c(176): TOTAL fwdflat 0.47 CPU 0.190 xRT
    INFO: ngram_search_fwdflat.c(179): TOTAL fwdflat 0.47 wall 0.191 xRT
    INFO: ngram_search.c(303): TOTAL bestpath 0.01 CPU 0.004 xRT
    INFO: ngram_search.c(306): TOTAL bestpath 0.00 wall 0.001 xRT
    

    Q1) Is it possible to get timing for grammar mode?

    Q2) for "lm mode" how do I interpret how long was the utterance, and how long did the reco take?

    BTW: This is my reco listener loop: (from Neil Davenport https://github.com/bynds/makevoicedemo )

                       # We tell PocketSphinx that the user is finished saying what they wanted
                        # to say, and that it should makes it's best guess as to what thay was.
                        self.decoder.end_utt()
                        # The following will get a hypothesis object with, amongst other things,
                        # the string of words that PocketSphinx thinks the user said.
                        self.hypothesis = self.decoder.hyp()
                        if self.hypothesis is not None:
                            bestGuess = self.hypothesis.hypstr
                            print 'I just heard you say:"{}"'.format(bestGuess)
                            # We are done with the microphone for now so we'll close the stream.
                            self.stream.stop_stream()
                            self.stream.close()
                            # We have what we came for! A string representing what the user said.
                            # We'll now return it to the runMain function so that it can be
                            # processed and some meaning can be gleamed from it.
                            return bestGuess
    
     
    • Nickolay V. Shmyrev

      Q1) Is it possible to get timing for grammar mode?

      Grammar mode must print similar numbers if you use latest pocketsphinx:

      INFO: fsg_search.c(869): fsg 0.13 CPU 0.023 xRT
      INFO: fsg_search.c(871): fsg 0.16 wall 0.028 xRT
      INFO: fsg_search.c(265): TOTAL fsg 0.13 CPU 0.023 xRT
      INFO: fsg_search.c(268): TOTAL fsg 0.16 wall 0.028 xRT
      

      Q2) for "lm mode" how do I interpret how long was the utterance, and how long did the reco take?

      From this lines

      INFO: ngram_search_fwdtree.c(432): TOTAL fwdtree 0.88 CPU 0.356 xRT
      INFO: ngram_search_fwdtree.c(435): TOTAL fwdtree 2.63 wall 1.067 xRT
      INFO: ngram_search_fwdflat.c(176): TOTAL fwdflat 0.47 CPU 0.190 xRT
      INFO: ngram_search_fwdflat.c(179): TOTAL fwdflat 0.47 wall 0.191 xRT
      INFO: ngram_search.c(303): TOTAL bestpath 0.01 CPU 0.004 xRT
      INFO: ngram_search.c(306): TOTAL bestpath 0.00 wall 0.001 xRT
      

      3 stages (fwdtree, fwdflat and bestpath) took 0.88 + 0.47 + 0.01 or 1.36 seconds CPU time. The speed of decoding was 0.356 + 0.190 + 0.004 or 0.55xRT which means 1 second of speech was decode in 0.55 seconds of CPU time. The total length of the audio is 1.36 / 0.55 or 2.47 seconds, but it is probably irrelevant. xRT number matters.

       
  • amcdonley

    amcdonley - 2016-05-23

    Superb - Thank You! Exactly what I needed.

     
  • amcdonley

    amcdonley - 2016-05-27

    xRT is a measure only "in-speech", correct?

    total-wall-time is the sum of times from start of speech to end of speech, including inter-word silences?

    Is there a term for total-cpu-time / total-wall-time?

    result_14.txt
    ::::::::::::::
    Utterances=14
    CpuTime=56.42 seconds
    CPU xRealTime=0.926 or 92.6% of one core
    Actual Speech=60.9287 seconds
    Utterances=92.89 seconds total
    66% of utterances were speech
    ::::::::::::::
    result_63.txt
    ::::::::::::::
    Utterances=63
    CpuTime=75.7 seconds
    CPU xRealTime=0.52 or 52% of one core
    Actual Speech=145.577 seconds
    Utterances=189.77 seconds total
    77% of utterances were speech

     
    • Nickolay V. Shmyrev

      xRT is a measure only "in-speech", correct?

      Yes

      total-wall-time is the sum of times from start of speech to end of speech, including inter-word silences?

      Well, it is not exactly with silence included. Silence is always filtered out in processing and is not accounted in performance computation.

      It's more about what system time did it took to process speech. When you process from microphone, yes, it waits for input and the time it simply waits is included. When you process from file it is just the time taken to process speech. This time also accounts for machine doing something else, for example, if you are doing some other computation it will be included into wall time.

      Is there a term for total-cpu-time / total-wall-time?

      Not really

       
  • amcdonley

    amcdonley - 2016-06-05

    Nickolay - you are credited in Alan's Pi 3 Road Test Using CMU PocketSphinx: https://goo.gl/RrGgCm

    and the video: https://vimeo.com/169445418

    Thanks for your help.

     
    • Nickolay V. Shmyrev

      Hey, Alan, many thanks, this is a very important publication. I shared on our blog too
      http://cmusphinx.sourceforge.net/2016/06/should-you-select-raspberry-pi-3-or-raspberry-pi-b-for-cmusphinx/

      Actually it would be very interesting to evaluate keyword spotting mode too which supposed to be a primary operation mode for IOT. It might be also interesting to play with decoding parameters for LVCSR, it might be reasonably fast and accurate after tuning.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.