Menu

Forced alignment with a reference transcript using pocketsphinx_continuous

Help
Ece T
2019-04-02
2019-04-02
  • Ece T

    Ece T - 2019-04-02

    Hello,

    I am trying to apply forced alignment for audio in Dutch and accompanying transcripts. I see that the tool transcribes the audio itself, rather than using the reference transcript. How can this be done?

    Edit: Ubuntu, pocketsphinx-5prealpha

    I downloaded the Dutch files from here.

    Contents of with-word.jsgf:

    #JSGF V1.0;
    grammar word;
    public <s> = gezelschap die aan het eten is of die in een restaurant zit en iets willen gaan bestellen;
    

    I think the Dutch model works without an error (although it misses some words in speech-to-text), but it is not using the actual transcript.

    I have tried using the command below:

    pocketsphinx_continuous
    -lm nl-nl/voxforge_nl_sphinx.lm.bin
    -dict nl-nl/voxforge_nl_sphinx.dic  
    -hmm nl-nl/nl-nl/  
    -infile audio.wav 
    -jsgf with-word.jsgf
    -time yes 
    -backtrace yes 
    -fsgusefiller no
    -bestpath no 
    2>&1 > with-word.txt 
    

    Output of running this command:

    INFO: pocketsphinx.c(152): Parsed model-specific feature parameters from nl-nl/nl-nl//feat.params
    Current configuration:
    [NAME]          [DEFLT]     [VALUE]
    -agc            none        none
    -agcthresh      2.0     2.000000e+00
    -allphone               
    -allphone_ci        no      no
    -alpha          0.97        9.700000e-01
    -ascale         20.0        2.000000e+01
    -aw         1       1
    -backtrace      no      yes
    -beam           1e-48       1.000000e-48
    -bestpath       yes     no
    -bestpathlw     9.5     9.500000e+00
    -ceplen         13      13
    -cmn            live        batch
    -cmninit        40,3,-1     40,3,-1
    -compallsen     no      no
    -debug                  0
    -dict                   nl-nl/voxforge_nl_sphinx.dic
    -dictcase       no      no
    -dither         no      no
    -doublebw       no      no
    -ds         1       1
    -fdict                  
    -feat           1s_c_d_dd   1s_c_d_dd
    -featparams             
    -fillprob       1e-8        1.000000e-08
    -frate          100     100
    -fsg                    
    -fsgusealtpron      yes     yes
    -fsgusefiller       yes     no
    -fwdflat        yes     yes
    -fwdflatbeam        1e-64       1.000000e-64
    -fwdflatefwid       4       4
    -fwdflatlw      8.5     8.500000e+00
    -fwdflatsfwin       25      25
    -fwdflatwbeam       7e-29       7.000000e-29
    -fwdtree        yes     yes
    -hmm                    nl-nl/nl-nl/
    -input_endian       little      little
    -jsgf                   with-word.jsgf
    -keyphrase              
    -kws                    
    -kws_delay      10      10
    -kws_plp        1e-1        1.000000e-01
    -kws_threshold      1       1.000000e+00
    -latsize        5000        5000
    -lda                    
    -ldadim         0       0
    -lifter         0       22
    -lm                 nl-nl/voxforge_nl_sphinx.lm.bin
    -lmctl                  
    -lmname                 
    -logbase        1.0001      1.000100e+00
    -logfn                  
    -logspec        no      no
    -lowerf         133.33334   1.300000e+02
    -lpbeam         1e-40       1.000000e-40
    -lponlybeam     7e-29       7.000000e-29
    -lw         6.5     6.500000e+00
    -maxhmmpf       30000       30000
    -maxwpf         -1      -1
    -mdef                   
    -mean                   
    -mfclogdir              
    -min_endfr      0       0
    -mixw                   
    -mixwfloor      0.0000001   1.000000e-07
    -mllr                   
    -mmap           yes     yes
    -ncep           13      13
    -nfft           512     512
    -nfilt          40      25
    -nwpen          1.0     1.000000e+00
    -pbeam          1e-48       1.000000e-48
    -pip            1.0     1.000000e+00
    -pl_beam        1e-10       1.000000e-10
    -pl_pbeam       1e-10       1.000000e-10
    -pl_pip         1.0     1.000000e+00
    -pl_weight      3.0     3.000000e+00
    -pl_window      5       5
    -rawlogdir              
    -remove_dc      no      no
    -remove_noise       yes     yes
    -remove_silence     yes     yes
    -round_filters      yes     yes
    -samprate       16000       1.600000e+04
    -seed           -1      -1
    -sendump                
    -senlogdir              
    -senmgau                
    -silprob        0.005       5.000000e-03
    -smoothspec     no      no
    -svspec                 
    -tmat                   
    -tmatfloor      0.0001      1.000000e-04
    -topn           4       4
    -topn_beam      0       0
    -toprule                
    -transform      legacy      dct
    -unit_area      yes     yes
    -upperf         6855.4976   6.800000e+03
    -uw         1.0     1.000000e+00
    -vad_postspeech     50      50
    -vad_prespeech      20      20
    -vad_startspeech    10      10
    -vad_threshold      2.0     2.000000e+00
    -var                    
    -varfloor       0.0001      1.000000e-04
    -varnorm        no      no
    -verbose        no      no
    -warp_params                
    -warp_type      inverse_linear  inverse_linear
    -wbeam          7e-29       7.000000e-29
    -wip            0.65        6.500000e-01
    -wlen           0.025625    2.562500e-02
    
    INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='batch', VARNORM='no', AGC='none'
    INFO: acmod.c(152): Reading linear feature transformation from nl-nl/nl-nl//feature_transform
    INFO: mdef.c(518): Reading model definition: nl-nl/nl-nl//mdef
    INFO: bin_mdef.c(181): Allocating 173395 * 8 bytes (1354 KiB) for CD tree
    INFO: tmat.c(149): Reading HMM transition probability matrices: nl-nl/nl-nl//transition_matrices
    INFO: acmod.c(113): Attempting to use PTM computation module
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//means
    INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: 
    INFO: ms_gauden.c(244):  16x36
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//variances
    INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: 
    INFO: ms_gauden.c(244):  16x36
    INFO: ms_gauden.c(304): 144 variance values floored
    INFO: ptm_mgau.c(804): Number of codebooks exceeds 256: 2117
    INFO: acmod.c(115): Attempting to use semi-continuous computation module
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//means
    INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: 
    INFO: ms_gauden.c(244):  16x36
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//variances
    INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: 
    INFO: ms_gauden.c(244):  16x36
    INFO: ms_gauden.c(304): 144 variance values floored
    INFO: acmod.c(117): Falling back to general multi-stream GMM computation
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//means
    INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: 
    INFO: ms_gauden.c(244):  16x36
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//variances
    INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: 
    INFO: ms_gauden.c(244):  16x36
    INFO: ms_gauden.c(304): 144 variance values floored
    INFO: ms_senone.c(149): Reading senone mixture weights: nl-nl/nl-nl//mixture_weights
    INFO: ms_senone.c(200): Truncating senone logs3(pdf) values by 10 bits
    INFO: ms_senone.c(207): Not transposing mixture weights in memory
    INFO: ms_senone.c(268): Read mixture weights for 2117 senones: 1 features x 16 codewords
    INFO: ms_senone.c(320): Mapping senones to individual codebooks
    INFO: ms_mgau.c(144): The value of topn: 4
    INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
    INFO: dict.c(320): Allocating 1431496 * 32 bytes (44734 KiB) for word entries
    WARN: "hash_table.c", line 150: Very large hash table requested (2147244 entries)
    INFO: dict.c(333): Reading main dictionary: nl-nl/voxforge_nl_sphinx.dic
    INFO: dict.c(213): Dictionary size 1427397, allocated 15981 KiB for strings, 28425 KiB for phones
    INFO: dict.c(336): 1427397 words read
    INFO: dict.c(358): Reading filler dictionary: nl-nl/nl-nl//noisedict
    INFO: dict.c(213): Dictionary size 1427400, allocated 0 KiB for strings, 0 KiB for phones
    INFO: dict.c(361): 3 words read
    INFO: dict2pid.c(396): Building PID tables for dictionary
    INFO: dict2pid.c(406): Allocating 39^3 * 2 bytes (115 KiB) for word-initial triphones
    INFO: dict2pid.c(132): Allocated 36816 bytes (35 KiB) for word-final triphones
    INFO: dict2pid.c(196): Allocated 36816 bytes (35 KiB) for single-phone word triphones
    INFO: jsgf.c(706): Defined rule: PUBLIC <word.s>
    INFO: fsg_model.c(208): Computing transitive closure for null transitions
    INFO: fsg_model.c(270): 0 null transitions added
    INFO: fsg_search.c(227): FSG(beam: -1080, pbeam: -1080, wbeam: -634; wip: -26, pip: 0)
    INFO: fsg_search.c(173): Added 0 alternate word transitions
    INFO: fsg_lextree.c(110): Allocated 1440 bytes (1 KiB) for left and right context phones
    INFO: fsg_lextree.c(256): 79 HMM nodes in lextree (33 leaves)
    INFO: fsg_lextree.c(259): Allocated 11376 bytes (11 KiB) for all lextree nodes
    INFO: fsg_lextree.c(262): Allocated 4752 bytes (4 KiB) for lextree leafnodes
    INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format
    INFO: ngram_search_fwdtree.c(74): Initializing search tree
    INFO: ngram_search_fwdtree.c(101): 1177 unique initial diphones
    INFO: ngram_search_fwdtree.c(186): Creating search channels
    INFO: ngram_search_fwdtree.c(323): Max nonroot chan increased to 138548
    INFO: ngram_search_fwdtree.c(333): Created 764 root, 138420 non-root channels, 72 single-phone words
    INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
    INFO: fsg_search.c(265): TOTAL fsg 0.00 CPU -nan xRT
    INFO: fsg_search.c(268): TOTAL fsg 0.00 wall -nan xRT
    INFO: continuous.c(307): pocketsphinx_continuous COMPILED ON: Mar 29 2019, AT: 17:32:51
    
    INFO: ngram_search.c(459): Resized backpointer table to 10000 entries
    INFO: ngram_search.c(467): Resized score stack to 200000 entries
    INFO: ngram_search.c(459): Resized backpointer table to 20000 entries
    INFO: ngram_search.c(467): Resized score stack to 400000 entries
    INFO: ngram_search.c(459): Resized backpointer table to 40000 entries
    INFO: ngram_search.c(467): Resized score stack to 800000 entries
    INFO: cmn_live.c(120): Update from < 40.00  3.00 -1.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00 >
    INFO: cmn_live.c(138): Update to   < 41.02 -3.13  6.26 19.42 -5.71  4.42 -9.63  1.27 -7.67 -0.53 -3.03 -3.69  5.36 >
    INFO: ngram_search_fwdtree.c(1550):    23040 words recognized (43/fr)
    INFO: ngram_search_fwdtree.c(1552):   716841 senones evaluated (1340/fr)
    INFO: ngram_search_fwdtree.c(1556):  3886682 channels searched (7264/fr), 290681 1st, 420516 last
    INFO: ngram_search_fwdtree.c(1559):    36434 words for which last channels evaluated (68/fr)
    INFO: ngram_search_fwdtree.c(1561):   241436 candidate words for entering last phone (451/fr)
    INFO: ngram_search_fwdtree.c(1564): fwdtree 1.49 CPU 0.279 xRT
    INFO: ngram_search_fwdtree.c(1567): fwdtree 1.49 wall 0.279 xRT
    INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 1164 words
    INFO: ngram_search_fwdflat.c(948):    18063 words recognized (34/fr)
    INFO: ngram_search_fwdflat.c(950):   364238 senones evaluated (681/fr)
    INFO: ngram_search_fwdflat.c(952):  1221580 channels searched (2283/fr)
    INFO: ngram_search_fwdflat.c(954):   107022 words searched (200/fr)
    INFO: ngram_search_fwdflat.c(957):    72062 word transitions (134/fr)
    INFO: ngram_search_fwdflat.c(960): fwdflat 0.62 CPU 0.115 xRT
    INFO: ngram_search_fwdflat.c(963): fwdflat 0.62 wall 0.115 xRT
    INFO: pocketsphinx.c(1168): je zelfs in een die het deed die het wie restaurant het ritje instellen (-26835)
    word                 start end   pprob ascr       lscr       lback
    <s>                  49    87    1.000 -393       0          0  
    je                   88    99    1.000 -197       -286       2  
    zelfs                100   130   1.000 -829       -831       2  
    in                   131   136   1.000 -375       -343       2  
    een                  137   152   1.000 -757       -236       3  
    die                  153   166   1.000 -437       -592       2  
    het                  167   185   1.000 -769       -385       2  
    deed                 186   207   1.000 -960       -393       3  
    die                  208   262   1.000 -1706      -500       2  
    het                  263   298   1.000 -1463      -436       2  
    <s>                  299   304   1.000 18735      -19094     1  
    wie                  305   341   1.000 -967       -444       2  
    <s>                  342   366   1.000 18617      -19198     1  
    restaurant           367   435   1.000 -3149      -921       2  
    het                  436   451   1.000 -571       -75        3  
    ritje                452   498   1.000 -2196      -805       2  
    instellen            499   565   1.000 -2803      -1012      1  
    </s>                 566   581   1.000 -728       -104       2  
    INFO: ngram_search_fwdtree.c(429): TOTAL fwdtree 1.49 CPU 0.279 xRT
    INFO: ngram_search_fwdtree.c(432): TOTAL fwdtree 1.49 wall 0.279 xRT
    INFO: ngram_search_fwdflat.c(176): TOTAL fwdflat 0.62 CPU 0.115 xRT
    INFO: ngram_search_fwdflat.c(179): TOTAL fwdflat 0.62 wall 0.115 xRT
    

    Best regards.

     

    Last edit: Ece T 2019-04-02
  • Ece T

    Ece T - 2019-04-02

    I realized that removing the -lm parameter and using just the jsgf file forces the tool to use the transcript in the grammar. It seems to work now!

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.