Hello,
I am trying to apply forced alignment for audio in Dutch and accompanying transcripts. I see that the tool transcribes the audio itself, rather than using the reference transcript. How can this be done?
Edit: Ubuntu, pocketsphinx-5prealpha
I downloaded the Dutch files from here.
Contents of with-word.jsgf:
#JSGF V1.0; grammar word; public <s> = gezelschap die aan het eten is of die in een restaurant zit en iets willen gaan bestellen;
I think the Dutch model works without an error (although it misses some words in speech-to-text), but it is not using the actual transcript.
I have tried using the command below:
pocketsphinx_continuous -lm nl-nl/voxforge_nl_sphinx.lm.bin -dict nl-nl/voxforge_nl_sphinx.dic -hmm nl-nl/nl-nl/ -infile audio.wav -jsgf with-word.jsgf -time yes -backtrace yes -fsgusefiller no -bestpath no 2>&1 > with-word.txt
Output of running this command:
INFO: pocketsphinx.c(152): Parsed model-specific feature parameters from nl-nl/nl-nl//feat.params Current configuration: [NAME] [DEFLT] [VALUE] -agc none none -agcthresh 2.0 2.000000e+00 -allphone -allphone_ci no no -alpha 0.97 9.700000e-01 -ascale 20.0 2.000000e+01 -aw 1 1 -backtrace no yes -beam 1e-48 1.000000e-48 -bestpath yes no -bestpathlw 9.5 9.500000e+00 -ceplen 13 13 -cmn live batch -cmninit 40,3,-1 40,3,-1 -compallsen no no -debug 0 -dict nl-nl/voxforge_nl_sphinx.dic -dictcase no no -dither no no -doublebw no no -ds 1 1 -fdict -feat 1s_c_d_dd 1s_c_d_dd -featparams -fillprob 1e-8 1.000000e-08 -frate 100 100 -fsg -fsgusealtpron yes yes -fsgusefiller yes no -fwdflat yes yes -fwdflatbeam 1e-64 1.000000e-64 -fwdflatefwid 4 4 -fwdflatlw 8.5 8.500000e+00 -fwdflatsfwin 25 25 -fwdflatwbeam 7e-29 7.000000e-29 -fwdtree yes yes -hmm nl-nl/nl-nl/ -input_endian little little -jsgf with-word.jsgf -keyphrase -kws -kws_delay 10 10 -kws_plp 1e-1 1.000000e-01 -kws_threshold 1 1.000000e+00 -latsize 5000 5000 -lda -ldadim 0 0 -lifter 0 22 -lm nl-nl/voxforge_nl_sphinx.lm.bin -lmctl -lmname -logbase 1.0001 1.000100e+00 -logfn -logspec no no -lowerf 133.33334 1.300000e+02 -lpbeam 1e-40 1.000000e-40 -lponlybeam 7e-29 7.000000e-29 -lw 6.5 6.500000e+00 -maxhmmpf 30000 30000 -maxwpf -1 -1 -mdef -mean -mfclogdir -min_endfr 0 0 -mixw -mixwfloor 0.0000001 1.000000e-07 -mllr -mmap yes yes -ncep 13 13 -nfft 512 512 -nfilt 40 25 -nwpen 1.0 1.000000e+00 -pbeam 1e-48 1.000000e-48 -pip 1.0 1.000000e+00 -pl_beam 1e-10 1.000000e-10 -pl_pbeam 1e-10 1.000000e-10 -pl_pip 1.0 1.000000e+00 -pl_weight 3.0 3.000000e+00 -pl_window 5 5 -rawlogdir -remove_dc no no -remove_noise yes yes -remove_silence yes yes -round_filters yes yes -samprate 16000 1.600000e+04 -seed -1 -1 -sendump -senlogdir -senmgau -silprob 0.005 5.000000e-03 -smoothspec no no -svspec -tmat -tmatfloor 0.0001 1.000000e-04 -topn 4 4 -topn_beam 0 0 -toprule -transform legacy dct -unit_area yes yes -upperf 6855.4976 6.800000e+03 -uw 1.0 1.000000e+00 -vad_postspeech 50 50 -vad_prespeech 20 20 -vad_startspeech 10 10 -vad_threshold 2.0 2.000000e+00 -var -varfloor 0.0001 1.000000e-04 -varnorm no no -verbose no no -warp_params -warp_type inverse_linear inverse_linear -wbeam 7e-29 7.000000e-29 -wip 0.65 6.500000e-01 -wlen 0.025625 2.562500e-02 INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='batch', VARNORM='no', AGC='none' INFO: acmod.c(152): Reading linear feature transformation from nl-nl/nl-nl//feature_transform INFO: mdef.c(518): Reading model definition: nl-nl/nl-nl//mdef INFO: bin_mdef.c(181): Allocating 173395 * 8 bytes (1354 KiB) for CD tree INFO: tmat.c(149): Reading HMM transition probability matrices: nl-nl/nl-nl//transition_matrices INFO: acmod.c(113): Attempting to use PTM computation module INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//means INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: INFO: ms_gauden.c(244): 16x36 INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//variances INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: INFO: ms_gauden.c(244): 16x36 INFO: ms_gauden.c(304): 144 variance values floored INFO: ptm_mgau.c(804): Number of codebooks exceeds 256: 2117 INFO: acmod.c(115): Attempting to use semi-continuous computation module INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//means INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: INFO: ms_gauden.c(244): 16x36 INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//variances INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: INFO: ms_gauden.c(244): 16x36 INFO: ms_gauden.c(304): 144 variance values floored INFO: acmod.c(117): Falling back to general multi-stream GMM computation INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//means INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: INFO: ms_gauden.c(244): 16x36 INFO: ms_gauden.c(127): Reading mixture gaussian parameter: nl-nl/nl-nl//variances INFO: ms_gauden.c(242): 2117 codebook, 1 feature, size: INFO: ms_gauden.c(244): 16x36 INFO: ms_gauden.c(304): 144 variance values floored INFO: ms_senone.c(149): Reading senone mixture weights: nl-nl/nl-nl//mixture_weights INFO: ms_senone.c(200): Truncating senone logs3(pdf) values by 10 bits INFO: ms_senone.c(207): Not transposing mixture weights in memory INFO: ms_senone.c(268): Read mixture weights for 2117 senones: 1 features x 16 codewords INFO: ms_senone.c(320): Mapping senones to individual codebooks INFO: ms_mgau.c(144): The value of topn: 4 INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0 INFO: dict.c(320): Allocating 1431496 * 32 bytes (44734 KiB) for word entries WARN: "hash_table.c", line 150: Very large hash table requested (2147244 entries) INFO: dict.c(333): Reading main dictionary: nl-nl/voxforge_nl_sphinx.dic INFO: dict.c(213): Dictionary size 1427397, allocated 15981 KiB for strings, 28425 KiB for phones INFO: dict.c(336): 1427397 words read INFO: dict.c(358): Reading filler dictionary: nl-nl/nl-nl//noisedict INFO: dict.c(213): Dictionary size 1427400, allocated 0 KiB for strings, 0 KiB for phones INFO: dict.c(361): 3 words read INFO: dict2pid.c(396): Building PID tables for dictionary INFO: dict2pid.c(406): Allocating 39^3 * 2 bytes (115 KiB) for word-initial triphones INFO: dict2pid.c(132): Allocated 36816 bytes (35 KiB) for word-final triphones INFO: dict2pid.c(196): Allocated 36816 bytes (35 KiB) for single-phone word triphones INFO: jsgf.c(706): Defined rule: PUBLIC <word.s> INFO: fsg_model.c(208): Computing transitive closure for null transitions INFO: fsg_model.c(270): 0 null transitions added INFO: fsg_search.c(227): FSG(beam: -1080, pbeam: -1080, wbeam: -634; wip: -26, pip: 0) INFO: fsg_search.c(173): Added 0 alternate word transitions INFO: fsg_lextree.c(110): Allocated 1440 bytes (1 KiB) for left and right context phones INFO: fsg_lextree.c(256): 79 HMM nodes in lextree (33 leaves) INFO: fsg_lextree.c(259): Allocated 11376 bytes (11 KiB) for all lextree nodes INFO: fsg_lextree.c(262): Allocated 4752 bytes (4 KiB) for lextree leafnodes INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format INFO: ngram_search_fwdtree.c(74): Initializing search tree INFO: ngram_search_fwdtree.c(101): 1177 unique initial diphones INFO: ngram_search_fwdtree.c(186): Creating search channels INFO: ngram_search_fwdtree.c(323): Max nonroot chan increased to 138548 INFO: ngram_search_fwdtree.c(333): Created 764 root, 138420 non-root channels, 72 single-phone words INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25 INFO: fsg_search.c(265): TOTAL fsg 0.00 CPU -nan xRT INFO: fsg_search.c(268): TOTAL fsg 0.00 wall -nan xRT INFO: continuous.c(307): pocketsphinx_continuous COMPILED ON: Mar 29 2019, AT: 17:32:51 INFO: ngram_search.c(459): Resized backpointer table to 10000 entries INFO: ngram_search.c(467): Resized score stack to 200000 entries INFO: ngram_search.c(459): Resized backpointer table to 20000 entries INFO: ngram_search.c(467): Resized score stack to 400000 entries INFO: ngram_search.c(459): Resized backpointer table to 40000 entries INFO: ngram_search.c(467): Resized score stack to 800000 entries INFO: cmn_live.c(120): Update from < 40.00 3.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > INFO: cmn_live.c(138): Update to < 41.02 -3.13 6.26 19.42 -5.71 4.42 -9.63 1.27 -7.67 -0.53 -3.03 -3.69 5.36 > INFO: ngram_search_fwdtree.c(1550): 23040 words recognized (43/fr) INFO: ngram_search_fwdtree.c(1552): 716841 senones evaluated (1340/fr) INFO: ngram_search_fwdtree.c(1556): 3886682 channels searched (7264/fr), 290681 1st, 420516 last INFO: ngram_search_fwdtree.c(1559): 36434 words for which last channels evaluated (68/fr) INFO: ngram_search_fwdtree.c(1561): 241436 candidate words for entering last phone (451/fr) INFO: ngram_search_fwdtree.c(1564): fwdtree 1.49 CPU 0.279 xRT INFO: ngram_search_fwdtree.c(1567): fwdtree 1.49 wall 0.279 xRT INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 1164 words INFO: ngram_search_fwdflat.c(948): 18063 words recognized (34/fr) INFO: ngram_search_fwdflat.c(950): 364238 senones evaluated (681/fr) INFO: ngram_search_fwdflat.c(952): 1221580 channels searched (2283/fr) INFO: ngram_search_fwdflat.c(954): 107022 words searched (200/fr) INFO: ngram_search_fwdflat.c(957): 72062 word transitions (134/fr) INFO: ngram_search_fwdflat.c(960): fwdflat 0.62 CPU 0.115 xRT INFO: ngram_search_fwdflat.c(963): fwdflat 0.62 wall 0.115 xRT INFO: pocketsphinx.c(1168): je zelfs in een die het deed die het wie restaurant het ritje instellen (-26835) word start end pprob ascr lscr lback <s> 49 87 1.000 -393 0 0 je 88 99 1.000 -197 -286 2 zelfs 100 130 1.000 -829 -831 2 in 131 136 1.000 -375 -343 2 een 137 152 1.000 -757 -236 3 die 153 166 1.000 -437 -592 2 het 167 185 1.000 -769 -385 2 deed 186 207 1.000 -960 -393 3 die 208 262 1.000 -1706 -500 2 het 263 298 1.000 -1463 -436 2 <s> 299 304 1.000 18735 -19094 1 wie 305 341 1.000 -967 -444 2 <s> 342 366 1.000 18617 -19198 1 restaurant 367 435 1.000 -3149 -921 2 het 436 451 1.000 -571 -75 3 ritje 452 498 1.000 -2196 -805 2 instellen 499 565 1.000 -2803 -1012 1 </s> 566 581 1.000 -728 -104 2 INFO: ngram_search_fwdtree.c(429): TOTAL fwdtree 1.49 CPU 0.279 xRT INFO: ngram_search_fwdtree.c(432): TOTAL fwdtree 1.49 wall 0.279 xRT INFO: ngram_search_fwdflat.c(176): TOTAL fwdflat 0.62 CPU 0.115 xRT INFO: ngram_search_fwdflat.c(179): TOTAL fwdflat 0.62 wall 0.115 xRT
Best regards.
I realized that removing the -lm parameter and using just the jsgf file forces the tool to use the transcript in the grammar. It seems to work now!
Log in to post a comment.
Hello,
I am trying to apply forced alignment for audio in Dutch and accompanying transcripts. I see that the tool transcribes the audio itself, rather than using the reference transcript. How can this be done?
Edit: Ubuntu, pocketsphinx-5prealpha
I downloaded the Dutch files from here.
Contents of with-word.jsgf:
I think the Dutch model works without an error (although it misses some words in speech-to-text), but it is not using the actual transcript.
I have tried using the command below:
Output of running this command:
Best regards.
Last edit: Ece T 2019-04-02
I realized that removing the -lm parameter and using just the jsgf file forces the tool to use the transcript in the grammar. It seems to work now!