Menu

Strange Crash With PocketSphinx

Help
Scorx Ion
2013-03-14
2013-03-17
  • Scorx Ion

    Scorx Ion - 2013-03-14

    Hello,

    I'm using PocketSphinx in a training simulator I am writing and I am getting a strange crash about 1/6 to 1/10 times I run the simulator. The crash always occurs on EXACTLY the second utterance, always gives the same call stack, and is always a read access violation at 0x0.

    On the simulation, I am trying to achieve as close to real time results as possible. To do this, I am capturing the live audio into a collection of circular buffers. The end result being that every 3 seconds I send the previous 4 seconds to sphinx for processing. This means the first 1 second that is processed is the same as the last 1 second that was previously processed.

    The reasoning for this approach is 2 parts:
    1) There is no guarantee that the user will have frequent enough breaks of silence. Waiting for silence to process each speech segment would likely make it no longer appear to be "real time" or may result in segments that are just too long for run-time processing.
    2) Since I process speech every 3 seconds, is quite likely that a segment of audio I send in for processing ends the middle of a word. Thus I include the last 1 second of the previously processed audio segment (1 second was chosen arbitrarily). I use the beginning and end timestamps of each word (from PocketSphinx) to eliminate word duplicates.

    To facilitate this flow, every 3 seconds I make the following calls:

    int procError = ps_start_utt(speechDecoder, NULL);
    char const utteranceID = ps_get_uttid(speechDecoder);
    procError = ps_process_raw(speechDecoder, (qint16
    )ptrToProcessBuffer, (bytesRecorded), 0, 0);
    procError = ps_end_utt(speechDecoder);
    ps_seg_t *recWord = ps_seg_iter(speechDecoder, &pathScore);

    // retrieve each word recognized by using this in a loop:
    ps_seg_frames(recWord, &startFrame, &endFrame);
    float confidenceScore = logmath_exp(ps_get_logmath(speechDecoder), postProb);
    QString wordString = ps_seg_word(recWord);
    recWord = ps_seg_next(recWord);

    Since I do not have a previous recording buffer the first time audio is sent to PocketSphinx, it is sent as a 3 second chunk (48000 bytes)instead of a 4 second chunk (64000 bytes). The segments that are sent to PocketSphinx may at times contain no speech at all. Another misc. note is that this whole recognition process is off in a separate thread from the GUI to prevent stuttering (PocketSphinx and all the systems I use directly with it are on the same thread, nothing outside this thread has access to PocketSphinx).

    On the crash, the top of the call stack I am getting is always:
    0 _VEC_memcpy MSVCR100D 0x5f64be8d
    1 fe_shift_frame fe_sigproc.c 636 0x3a5393
    2 fe_process_frames fe_interface.c 435 0x3a3499
    3 acmod_process_raw acmod.c 674 0x1002a797
    4 ps_process_raw pocketsphinx.c 769 0x1005bcb8
    5 SpeechRecognizer::RecognizeSpeech SpeechRecognizer.cpp 85 0x2c32fb

    The output I got from PocketSphinx on the most recent crash is (I scrubbed the path to make it more readable):

    INFO: cmd_ln.c(691): Parsing command line:
    \
    -hmm PATH/sphinx_models/hub4wsj_sc_8k \
    -lm PATH/sphinx_models/SM1.lm \
    -dict PATH/sphinx_models/SM1.dic \
    -samprate 8000 \
    -fwdflat yes \
    -bestpath yes

    Current configuration:
    [NAME] [DEFLT] [VALUE]
    -agc none none
    -agcthresh 2.0 2.000000e+000
    -alpha 0.97 9.700000e-001
    -ascale 20.0 2.000000e+001
    -aw 1 1
    -backtrace no no
    -beam 1e-48 1.000000e-048
    -bestpath yes yes
    -bestpathlw 9.5 9.500000e+000
    -bghist no no
    -ceplen 13 13
    -cmn current current
    -cmninit 8.0 8.0
    -compallsen no no
    -debug 0
    -dict PATH/sphinx_models/SM1.dic
    -dictcase no no
    -dither no no
    -doublebw no no
    -ds 1 1
    -fdict
    -feat 1s_c_d_dd 1s_c_d_dd
    -featparams
    -fillprob 1e-8 1.000000e-008
    -frate 100 100
    -fsg
    -fsgusealtpron yes yes
    -fsgusefiller yes yes
    -fwdflat yes yes
    -fwdflatbeam 1e-64 1.000000e-064
    -fwdflatefwid 4 4
    -fwdflatlw 8.5 8.500000e+000
    -fwdflatsfwin 25 25
    -fwdflatwbeam 7e-29 7.000000e-029
    -fwdtree yes yes
    -hmm PATH/sphinx_models/hub4wsj_sc_8k
    -input_endian little little
    -jsgf
    -kdmaxbbi -1 -1
    -kdmaxdepth 0 0
    -kdtree
    -latsize 5000 5000
    -lda
    -ldadim 0 0
    -lextreedump 0 0
    -lifter 0 0
    -lm PATH/sphinx_models/SM1.lm
    -lmctl
    -lmname default default
    -logbase 1.0001 1.000100e+000
    -logfn
    -logspec no no
    -lowerf 133.33334 1.333333e+002
    -lpbeam 1e-40 1.000000e-040
    -lponlybeam 7e-29 7.000000e-029
    -lw 6.5 6.500000e+000
    -maxhmmpf -1 -1
    -maxnewoov 20 20
    -maxwpf -1 -1
    -mdef
    -mean
    -mfclogdir
    -min_endfr 0 0
    -mixw
    -mixwfloor 0.0000001 1.000000e-007
    -mllr
    -mmap yes yes
    -ncep 13 13
    -nfft 512 512
    -nfilt 40 40
    -nwpen 1.0 1.000000e+000
    -pbeam 1e-48 1.000000e-048
    -pip 1.0 1.000000e+000
    -pl_beam 1e-10 1.000000e-010
    -pl_pbeam 1e-5 1.000000e-005
    -pl_window 0 0
    -rawlogdir
    -remove_dc no no
    -round_filters yes yes
    -samprate 16000 8.000000e+003
    -seed -1 -1
    -sendump
    -senlogdir
    -senmgau
    -silprob 0.005 5.000000e-003
    -smoothspec no no
    -svspec
    -tmat
    -tmatfloor 0.0001 1.000000e-004
    -topn 4 4
    -topn_beam 0 0
    -toprule
    -transform legacy legacy
    -unit_area yes yes
    -upperf 6855.4976 6.855498e+003
    -usewdphones no no
    -uw 1.0 1.000000e+000
    -var
    -varfloor 0.0001 1.000000e-004
    -varnorm no no
    -verbose no no
    -warp_params
    -warp_type inverse_linear inverse_linear
    -wbeam 7e-29 7.000000e-029
    -wip 0.65 6.500000e-001
    -wlen 0.025625 2.562500e-002

    INFO: cmd_ln.c(691): Parsing command line:
    \
    -nfilt 20 \
    -lowerf 1 \
    -upperf 4000 \
    -wlen 0.025 \
    -transform dct \
    -round_filters no \
    -remove_dc yes \
    -svspec 0-12/13-25/26-38 \
    -feat 1s_c_d_dd \
    -agc none \
    -cmn current \
    -cmninit 56,-3,1 \
    -varnorm no

    Current configuration:
    [NAME] [DEFLT] [VALUE]
    -agc none none
    -agcthresh 2.0 2.000000e+000
    -alpha 0.97 9.700000e-001
    -ceplen 13 13
    -cmn current current
    -cmninit 8.0 56,-3,1
    -dither no no
    -doublebw no no
    -feat 1s_c_d_dd 1s_c_d_dd
    -frate 100 100
    -input_endian little little
    -lda
    -ldadim 0 0
    -lifter 0 0
    -logspec no no
    -lowerf 133.33334 1.000000e+000
    -ncep 13 13
    -nfft 512 512
    -nfilt 40 20
    -remove_dc no yes
    -round_filters yes no
    -samprate 16000 8.000000e+003
    -seed -1 -1
    -smoothspec no no
    -svspec 0-12/13-25/26-38
    -transform legacy dct
    -unit_area yes yes
    -upperf 6855.4976 4.000000e+003
    -varnorm no no
    -verbose no no
    -warp_params
    -warp_type inverse_linear inverse_linear
    -wlen 0.025625 2.500000e-002

    INFO: acmod.c(246): Parsed model-specific feature parameters from PATH/sphinx_models/hub4wsj_sc_8k/feat.params
    INFO: feat.c(713): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='current', VARNORM='no', AGC='none'
    INFO: cmn.c(142): mean[0]= 12.00, mean[1..12]= 0.0
    INFO: acmod.c(167): Using subvector specification 0-12/13-25/26-38
    INFO: mdef.c(517): Reading model definition: PATH/sphinx_models/hub4wsj_sc_8k/mdef
    INFO: mdef.c(528): Found byte-order mark BMDF, assuming this is a binary mdef file
    INFO: bin_mdef.c(336): Reading binary model definition: PATH/sphinx_models/hub4wsj_sc_8k/mdef
    INFO: bin_mdef.c(513): 50 CI-phone, 143047 CD-phone, 3 emitstate/phone, 150 CI-sen, 5150 Sen, 27135 Sen-Seq
    INFO: tmat.c(205): Reading HMM transition probability matrices: PATH/sphinx_models/hub4wsj_sc_8k/transition_matrices
    INFO: acmod.c(121): Attempting to use SCHMM computation module
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter: PATH/sphinx_models/hub4wsj_sc_8k/means
    INFO: ms_gauden.c(292): 1 codebook, 3 feature, size:
    INFO: ms_gauden.c(294): 256x13
    INFO: ms_gauden.c(294): 256x13
    INFO: ms_gauden.c(294): 256x13
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter: PATH/sphinx_models/hub4wsj_sc_8k/variances
    INFO: ms_gauden.c(292): 1 codebook, 3 feature, size:
    INFO: ms_gauden.c(294): 256x13
    INFO: ms_gauden.c(294): 256x13
    INFO: ms_gauden.c(294): 256x13
    INFO: ms_gauden.c(354): 0 variance values floored
    INFO: s2_semi_mgau.c(903): Loading senones from dump file PATH/sphinx_models/hub4wsj_sc_8k/sendump
    INFO: s2_semi_mgau.c(927): BEGIN FILE FORMAT DESCRIPTION
    INFO: s2_semi_mgau.c(1022): Using memory-mapped I/O for senones
    INFO: s2_semi_mgau.c(1296): Maximum top-N: 4 Top-N beams: 0 0 0
    INFO: dict.c(317): Allocating 4144 * 20 bytes (80 KiB) for word entries
    INFO: dict.c(332): Reading main dictionary: PATH/sphinx_models/SM1.dic
    INFO: dict.c(211): Allocated 0 KiB for strings, 0 KiB for phones
    INFO: dict.c(335): 37 words read
    INFO: dict.c(341): Reading filler dictionary: PATH/sphinx_models/hub4wsj_sc_8k/noisedict
    INFO: dict.c(211): Allocated 0 KiB for strings, 0 KiB for phones
    INFO: dict.c(344): 11 words read
    INFO: dict2pid.c(396): Building PID tables for dictionary
    INFO: dict2pid.c(404): Allocating 50^3 * 2 bytes (244 KiB) for word-initial triphones
    INFO: dict2pid.c(131): Allocated 30200 bytes (29 KiB) for word-final triphones
    INFO: dict2pid.c(195): Allocated 30200 bytes (29 KiB) for single-phone word triphones
    INFO: ngram_model_arpa.c(477): ngrams 1=33, 2=114, 3=190
    INFO: ngram_model_arpa.c(135): Reading unigrams
    INFO: ngram_model_arpa.c(516): 33 = #unigrams created
    INFO: ngram_model_arpa.c(195): Reading bigrams
    INFO: ngram_model_arpa.c(533): 114 = #bigrams created
    INFO: ngram_model_arpa.c(534): 36 = #prob2 entries
    INFO: ngram_model_arpa.c(542): 29 = #bo_wt2 entries
    INFO: ngram_model_arpa.c(292): Reading trigrams
    INFO: ngram_model_arpa.c(555): 190 = #trigrams created
    INFO: ngram_model_arpa.c(556): 15 = #prob3 entries
    INFO: ngram_search_fwdtree.c(99): 25 unique initial diphones
    INFO: ngram_search_fwdtree.c(147): 0 root, 0 non-root channels, 14 single-phone words
    INFO: ngram_search_fwdtree.c(186): Creating search tree
    INFO: ngram_search_fwdtree.c(191): before: 0 root, 0 non-root channels, 14 single-phone words
    INFO: ngram_search_fwdtree.c(326): after: max nonroot chan increased to 185
    INFO: ngram_search_fwdtree.c(338): after: 25 root, 57 non-root channels, 13 single-phone words
    INFO: ngram_search_fwdflat.c(156): fwdflat: min_ef_width = 4, max_sf_win = 25

    Any ideas on what I could be doing to cause this crash?

    Thanks in advance!

     
  • Nickolay V. Shmyrev

    On the simulation, I am trying to achieve as close to real time results as possible. To do this, I am capturing the live audio into a collection of circular buffers. The end result being that every 3 seconds I send the previous 4 seconds to sphinx for processing. This means the first 1 second that is processed is the same as the last 1 second that was previously processed.

    This is not a good idea overall, you probably want to rethink the design of your
    application. I'm not sure what your system is, but all your workarounds for "realtime" doesn't make more sense for me. With the vocabulary of your type the decoding must be very fast even in a very low-resource computer.

    If you want quick results during processing you just need to call ps_get_hyp without ps_end_utt in the middle of the utterances. It will return a valid partial hypothesis. This video demonstrates how it works:

    http://www.youtube.com/watch?v=OEUeJb6Pwt4

    The crash is likely to occur during some memory violation of the external code or some other reason. On Linux you can Valgrind to detect reasons for error like your one

     
  • Scorx Ion

    Scorx Ion - 2013-03-16

    Thank you very much for taking the time to look over this crash information and for the recommendation. Unfortunately I do not think I will be able to use ps_get_hyp in my application. It is necessary to my application to get specific time stamps for each word, and ps_get_hyp does not appear to do this.

    I knew before designing my application in this way that my use of ps_start_utt and ps_end_utt was unorthodox. However I had thought that at worst case this would just lead to reduced accuracy. Is it possible that this usage is what is causing the crash?

    My application is a training simulator designed to allow Stage Managers to practice. I mentioned some of the specific about this in my other thread about tuning PocketSphinx for performance. I am using PocketSphinx to allow grading of what the user says during a simulation session. The grading criteria for each word is:
    1) Must be spoken clearly
    2) Must be the correct word (out of several variants)
    3) Must be said at the right time
    4) Each word's score is averaged into a phrase of 6-10 words. The user is not told whether each word is correct, instead they are told if a given phrase is correct (determined by having >= 50% of the words correct).

    Since the correct words are averaged over a phrase, the accuracy of PocketSphinx can acceptably be in the 60% range, although 70-90% would be ideal. By far, the most important thing for the speech recognition is consistent real-time update intervals. These updates are used to give the user a rough indication of how well they are doing in their practice.

    To aid in tracking the crash bug, I have made my application save the recorded audio to wav file. Since I know the crash always occurs when processing the second utterance, it is trivial to close the file just before this utterance. I then just ran the simulator enough times to get the crash to occur. I have attached a sample from one of the "crash runs". Unfortunately it doesn't appear to be anything special: just silence with some mechanical noise. This test did prove that the buffers themselves are not using invalid memory addresses.

    While testing, I did come across a new crash stack. It happens a lot less frequently than the one above, but perhaps it can reveal a pattern in the crashes:

    0 memcpy MSVCR100D 0x5f67c9c7
    1 fe_process_frames fe_interface.c 462 0x4635fe
    2 acmod_process_raw acmod.c 674 0x1002a797
    3 ps_process_raw pocketsphinx.c 769 0x1005bcb8
    4 SpeechRecognizer::RecognizeSpeech SpeechRecognizer.cpp 90 0x13e331b

    Thank you very much for taking the time to assist me with this. It has been a very frustrating issue and I'm not the type to ask for help very often. Does this information give you any ideas on what might be causing the crash?

     

    Last edit: Scorx Ion 2013-03-16
  • Nickolay V. Shmyrev

    Is it possible that this usage is what is causing the crash?

    No, crash is usually caused by software logic, not the API usage patterns

    Does this information give you any ideas on what might be causing the crash?

    This information is too limited. You need to provide a code to reproduce the crash sample in order to allow debugging. Like I said, you can use memory checkers like valgrind to investigate crash reasons.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.