CMU Sphinx / Forums / Help: Forced alignment with a reference transcript using pocketsphinx

Hello,

I am trying to apply forced alignment for audio in Dutch and accompanying transcripts. I see that the tool transcribes the audio itself, rather than using the reference transcript. How can this be done?

Edit: Ubuntu, pocketsphinx-5prealpha

I downloaded the Dutch files from here.

Contents of with-word.jsgf:

#JSGF V1.0;
grammar word;
public <s> = gezelschap die aan het eten is of die in een restaurant zit en iets willen gaan bestellen;

I think the Dutch model works without an error (although it misses some words in speech-to-text), but it is not using the actual transcript.

I have tried using the command below:

pocketsphinx_continuous
-lm nl-nl/voxforge_nl_sphinx.lm.bin
-dict nl-nl/voxforge_nl_sphinx.dic  
-hmm nl-nl/nl-nl/  
-infile audio.wav 
-jsgf with-word.jsgf
-time yes 
-backtrace yes 
-fsgusefiller no
-bestpath no 
2>&1 > with-word.txt

Output of running this command:

INFO: pocketsphinx.c(152): Parsed model-specific feature parameters from nl-nl/nl-nl//feat.params
Current configuration:
[NAME]          [DEFLT]     [VALUE]
-agc            none        none
-agcthresh      2.0     2.000000e+00
-allphone               
-allphone_ci        no      no
-alpha          0.97        9.700000e-01
-ascale         20.0        2.000000e+01
-aw         1       1
-backtrace      no      yes
-beam           1e-48       1.000000e-48
-bestpath       yes     no
-bestpathlw     9.5     9.500000e+00
-ceplen         13      13
-cmn            live        batch
-cmninit        40,3,-1     40,3,-1
-compallsen     no      no
-debug                  0
-dict                   nl-nl/voxforge_nl_sphinx.dic
-dictcase       no      no
-dither         no      no
-doublebw       no      no
-ds         1       1
-fdict                  
-feat           1s_c_d_dd   1s_c_d_dd
-featparams             
-fillprob       1e-8        1.000000e-08
-frate          100     100
-fsg                    
-fsgusealtpron      yes     yes
-fsgusefiller       yes     no
-fwdflat        yes     yes
-fwdflatbeam        1e-64       1.000000e-64
-fwdflatefwid       4       4
-fwdflatlw      8.5     8.500000e+00
-fwdflatsfwin       25      25
-fwdflatwbeam       7e-29       7.000000e-29
-fwdtree        yes     yes
-hmm                    nl-nl/nl-nl/
-input_endian       little      little
-jsgf                   with-word.jsgf
-keyphrase              
-kws                    
-kws_delay      10      10
-kws_plp        1e-1        1.000000e-01
-kws_threshold      1       1.000000e+00
-latsize        5000        5000
-lda                    
-ldadim         0       0
-lifter         0       22
-lm                 nl-nl/voxforge_nl_sphinx.lm.bin
-lmctl                  
-lmname                 
-logbase        1.0001      1.000100e+00
-logfn                  
-logspec        no      no
-lowerf         133.33334   1.300000e+02
-lpbeam         1e-40       1.000000e-40
-lponlybeam     7e-29       7.000000e-29
-lw         6.5     6.500000e+00
-maxhmmpf       30000       30000
-maxwpf         -1      -1
-mdef                   
-mean                   
-mfclogdir              
-min_endfr      0       0
-mixw                   
-mixwfloor      0.0000001   1.000000e-07
-mllr                   
-mmap           yes     yes
-ncep           13      13
-nfft           512     512
-nfilt          40      25
-nwpen          1.0     1.000000e+00
-pbeam          1e-48       1.000000e-48
-pip            1.0     1.000000e+00
-pl_beam        1e-10       1.000000e-10
-pl_pbeam       1e-10       1.000000e-10
-pl_pip         1.0     1.000000e+00
-pl_weight      3.0     3.000000e+00
-pl_window      5       5
-rawlogdir              
-remove_dc      no      no
-remove_noise       yes     yes
-remove_silence     yes     yes
-round_filters      yes     yes
-samprate       16000       1.600000e+04
-seed           -1      -1
-sendump                
-senlogdir              
-senmgau                
-silprob        0.005       5.000000e-03
-smoothspec     no      no
-svspec                 
-tmat                   
-tmatfloor      0.0001      1.000000e-04
-topn           4       4
-topn_beam      0       0
-toprule                
-transform      legacy      dct
-unit_area      yes     yes
-upperf         6855.4976   6.800000e+03
-uw         1.0     1.000000e+00
-vad_postspeech     50      50
-vad_prespeech      20      20
-vad_startspeech    10      10
-vad_threshold      2.0     2.000000e+00
-var                    
-varfloor       0.0001      1.000000e-04
-varnorm        no      no
-verbose        no      no
-warp_params                
-warp_type      inverse_linear  inverse_linear
-wbeam          7e-29       7.000000e-29
-wip            0.65        6.500000e-01
-wlen           0.025625    2.562500e-02

INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='batch', VARNORM='no', AGC='none'
INFO: acmod.c(152): Reading linear feature transformation from nl-nl/nl-nl//feature_transform
INFO: mdef.c(518): Reading model definition: nl-nl/nl-nl//mdef
INFO: bin_mdef.c(181): Allocating 173395 * 8 bytes (1354 KiB) for CD tree
INFO: tmat.c(149): Reading HMM transition probability matrices: nl-nl/nl-nl//transition_matrices
INFO: acmod.c(113): Attempting to use PTM computation module
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//means
INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  16x36
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//variances
INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  16x36
INFO: ms_gauden.c(304): 144 variance values floored
INFO: ptm_mgau.c(804): Number of codebooks exceeds 256: 2117
INFO: acmod.c(115): Attempting to use semi-continuous computation module
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//means
INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  16x36
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//variances
INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  16x36
INFO: ms_gauden.c(304): 144 variance values floored
INFO: acmod.c(117): Falling back to general multi-stream GMM computation
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//means
INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  16x36
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//variances
INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: 
INFO: ms_gauden.c(244):  16x36
INFO: ms_gauden.c(304): 144 variance values floored
INFO: ms_senone.c(149): Reading senone mixture weights: nl-nl/nl-nl//mixture_weights
INFO: ms_senone.c(200): Truncating senone logs3(pdf) values by 10 bits
INFO: ms_senone.c(207): Not transposing mixture weights in memory
INFO: ms_senone.c(268): Read mixture weights for 2117 senones: 1 features x 16 codewords
INFO: ms_senone.c(320): Mapping senones to individual codebooks
INFO: ms_mgau.c(144): The value of topn: 4
INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
INFO: dict.c(320): Allocating 1431496 * 32 bytes (44734 KiB) for word entries
WARN: "hash_table.c", line 150: Very large hash table requested (2147244 entries)
INFO: dict.c(333): Reading main dictionary: nl-nl/voxforge_nl_sphinx.dic
INFO: dict.c(213): Dictionary size 1427397, allocated 15981 KiB for strings, 28425 KiB for phones
INFO: dict.c(336): 1427397 words read
INFO: dict.c(358): Reading filler dictionary: nl-nl/nl-nl//noisedict
INFO: dict.c(213): Dictionary size 1427400, allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(361): 3 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(406): Allocating 39^3 * 2 bytes (115 KiB) for word-initial triphones
INFO: dict2pid.c(132): Allocated 36816 bytes (35 KiB) for word-final triphones
INFO: dict2pid.c(196): Allocated 36816 bytes (35 KiB) for single-phone word triphones
INFO: jsgf.c(706): Defined rule: PUBLIC <word.s>
INFO: fsg_model.c(208): Computing transitive closure for null transitions
INFO: fsg_model.c(270): 0 null transitions added
INFO: fsg_search.c(227): FSG(beam: -1080, pbeam: -1080, wbeam: -634; wip: -26, pip: 0)
INFO: fsg_search.c(173): Added 0 alternate word transitions
INFO: fsg_lextree.c(110): Allocated 1440 bytes (1 KiB) for left and right context phones
INFO: fsg_lextree.c(256): 79 HMM nodes in lextree (33 leaves)
INFO: fsg_lextree.c(259): Allocated 11376 bytes (11 KiB) for all lextree nodes
INFO: fsg_lextree.c(262): Allocated 4752 bytes (4 KiB) for lextree leafnodes
INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format
INFO: ngram_search_fwdtree.c(74): Initializing search tree
INFO: ngram_search_fwdtree.c(101): 1177 unique initial diphones
INFO: ngram_search_fwdtree.c(186): Creating search channels
INFO: ngram_search_fwdtree.c(323): Max nonroot chan increased to 138548
INFO: ngram_search_fwdtree.c(333): Created 764 root, 138420 non-root channels, 72 single-phone words
INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
INFO: fsg_search.c(265): TOTAL fsg 0.00 CPU -nan xRT
INFO: fsg_search.c(268): TOTAL fsg 0.00 wall -nan xRT
INFO: continuous.c(307): pocketsphinx_continuous COMPILED ON: Mar 29 2019, AT: 17:32:51

INFO: ngram_search.c(459): Resized backpointer table to 10000 entries
INFO: ngram_search.c(467): Resized score stack to 200000 entries
INFO: ngram_search.c(459): Resized backpointer table to 20000 entries
INFO: ngram_search.c(467): Resized score stack to 400000 entries
INFO: ngram_search.c(459): Resized backpointer table to 40000 entries
INFO: ngram_search.c(467): Resized score stack to 800000 entries
INFO: cmn_live.c(120): Update from < 40.00  3.00 -1.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00 >
INFO: cmn_live.c(138): Update to   < 41.02 -3.13  6.26 19.42 -5.71  4.42 -9.63  1.27 -7.67 -0.53 -3.03 -3.69  5.36 >
INFO: ngram_search_fwdtree.c(1550):    23040 words recognized (43/fr)
INFO: ngram_search_fwdtree.c(1552):   716841 senones evaluated (1340/fr)
INFO: ngram_search_fwdtree.c(1556):  3886682 channels searched (7264/fr), 290681 1st, 420516 last
INFO: ngram_search_fwdtree.c(1559):    36434 words for which last channels evaluated (68/fr)
INFO: ngram_search_fwdtree.c(1561):   241436 candidate words for entering last phone (451/fr)
INFO: ngram_search_fwdtree.c(1564): fwdtree 1.49 CPU 0.279 xRT
INFO: ngram_search_fwdtree.c(1567): fwdtree 1.49 wall 0.279 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 1164 words
INFO: ngram_search_fwdflat.c(948):    18063 words recognized (34/fr)
INFO: ngram_search_fwdflat.c(950):   364238 senones evaluated (681/fr)
INFO: ngram_search_fwdflat.c(952):  1221580 channels searched (2283/fr)
INFO: ngram_search_fwdflat.c(954):   107022 words searched (200/fr)
INFO: ngram_search_fwdflat.c(957):    72062 word transitions (134/fr)
INFO: ngram_search_fwdflat.c(960): fwdflat 0.62 CPU 0.115 xRT
INFO: ngram_search_fwdflat.c(963): fwdflat 0.62 wall 0.115 xRT
INFO: pocketsphinx.c(1168): je zelfs in een die het deed die het wie restaurant het ritje instellen (-26835)
word                 start end   pprob ascr       lscr       lback
<s>                  49    87    1.000 -393       0          0  
je                   88    99    1.000 -197       -286       2  
zelfs                100   130   1.000 -829       -831       2  
in                   131   136   1.000 -375       -343       2  
een                  137   152   1.000 -757       -236       3  
die                  153   166   1.000 -437       -592       2  
het                  167   185   1.000 -769       -385       2  
deed                 186   207   1.000 -960       -393       3  
die                  208   262   1.000 -1706      -500       2  
het                  263   298   1.000 -1463      -436       2  
<s>                  299   304   1.000 18735      -19094     1  
wie                  305   341   1.000 -967       -444       2  
<s>                  342   366   1.000 18617      -19198     1  
restaurant           367   435   1.000 -3149      -921       2  
het                  436   451   1.000 -571       -75        3  
ritje                452   498   1.000 -2196      -805       2  
instellen            499   565   1.000 -2803      -1012      1  
</s>                 566   581   1.000 -728       -104       2  
INFO: ngram_search_fwdtree.c(429): TOTAL fwdtree 1.49 CPU 0.279 xRT
INFO: ngram_search_fwdtree.c(432): TOTAL fwdtree 1.49 wall 0.279 xRT
INFO: ngram_search_fwdflat.c(176): TOTAL fwdflat 0.62 CPU 0.115 xRT
INFO: ngram_search_fwdflat.c(179): TOTAL fwdflat 0.62 wall 0.115 xRT

Best regards.

Last edit: Ece T 2019-04-02

Forced alignment with a reference transcript using pocketsphinx_continuous

Speech Recognition Toolkit

Forums

Help

Forced alignment with a reference transcript using pocketsphinx_continuous

Forced alignment with a reference transcript using pocketsphinx_continuous

Speech Recognition Toolkit

Forums

Help

Forced alignment with a reference transcript using pocketsphinx_continuous document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Forced alignment with a reference transcript using pocketsphinx_continuous