From: Daniel P. <dp...@gm...> - 2015-02-23 21:16:30
|
I notice quite a few people are signed up to the kaldi-developers list and it doesn't get very much traffic, so I thought I should send out some updates to let people know what's going on. There has been some email traffic about LSTMs-- a guy called Jiayu Du from Baidu has been working with Karel and others to implement it in Karel's setup. As far as I know it is not yet working better than the baseline- it seems LSTMs need quite a bit of tuning to get them to work. Also the training is pretty slow so we can't train them on a lot of data yet. In the last few months, in the nnet2 setup we have been using an architecture we call "multi-splice" where splicing over time happens not just at the start of the network but at multiple layers. Typically we splice 2 frames, but not adjacent ones, and the separation in time increases for higher layers of the network. (and the last layer or two typically don't have splicing). We modified the training code so it only evaluates the frames that it needs to evaluate (i.e. to allow gaps in time). This is usually giving improvements over our previous (p-norm) numbers, typically 5% relative or so. In time we plan to modify all the nnet2 recipes to use this. Vijay Peddinti (who was also involved in the multi-splice work) submitted a Kaldi-based system in the ASPIRE challenge (ASPIRE about reverberant speech in varying microphone conditions, and Fisher is the only training data you are allowed to use). We were just behind the top system (BBN) but our system was a single system and we believe theirs was a big combination. We used a single system, an online-nnet2 system, i.e. with iVectors as input as well as unadapted 40-dimensional MFCC features. Vijay trained the system on Fisher data perturbed with randomly chosen room responses and room noises that he got from Shinji Watanabe. He plans to release the recipe soon. We trained on 3 copies of the Fisher data (perturbed with different room responses). This took only 2 or 3 days for cross-entropy training, using up to 20 or so GPUs. We think the reason we did well in the evaluation is probably some combination of our setup being more scalable (so we can train on more data in a reasonable time- see http://arxiv-web3.library.cornell.edu/abs/1410.7455), and maybe Shinji's "real" room responses being better than the simulation-based ones that we heard some others used. While building this system we found some interesting things. Firstly we found and fixed a bug whereby the "min-active" parameter was not being properly enforced by the decoders; this meant that sometimes long utterances would be decoded as a truncated transcript, due to the decoder getting stuck in a state which cannot reach the majority of the decoding graph. Also we found that the use of iVectors as an adaptation method was not nearly as robust as we would like. Firstly, it was not very robust to variations in volume-- we had to normalize the energy of the test data to fix this. Probably this was because we normalized the training-data volume too carefully; in future we plan to ensure the training data has variations in volume. Secondly, we found that it was very important to exclude silence from the iVector estimation. This was unexpected, as we included silence in training time and we normally include it in test time (to avoid the hassle of voice activity detection). Perhaps the issue was that the silences we encountered in test were too different from those we encountered in training time. In the end the way we resolved this was to do a first decoding pass, get the ctm with confidences (lattice-to-ctm-conf), and filter out words with confidence <1, words with improbably long durations, mm, mhm, laughter, noise, and so on, and use only what remained for iVector estimation. We also found that it was important, for long utterances, to scale up the prior term in iVector estimation (we added the --max-count option to some programs to handle this; it's done by scaling down the counts). Incidentally, the way we handled segmentation in this system was pretty brain-dead- we just used overlapping segments of 10 seconds, shifted by 5 seconds each time, and spliced things together at the end, at the ctm level, using the concept that the transcript will only be disrupted close to the boundary of each segment and transcripts closer to the middle should be OK. We will try to incorporate some of the lessons learned from ASPIRE (e.g. about the importance of excluding silence for iVector estimation) into the online-nnet2 setup as soon as we can. Something else that was added to Kaldi recently (Guoguo Chen did this) is a compact non-FST format for ARPA language models (search for carpa, or const-arpa) which enables memory-efficient rescoring of lattices with un-pruned language models. You'll see that in some of the scripts, such as Fisher English, we're already using this. We have also begun adding pronunciation probabilities to some of the scripts (work with Guoguo Chen and Hainan Xu). Here is an excerpt from the Librispeech training script (run.sh): steps/get_prons.sh --cmd "$train_cmd" data/train_clean_460 data/lang exp/tri4b_ali_clean_460 utils/dict_dir_add_pronprobs.sh data/local/dict exp/tri4b_ali_clean_460/pron_counts_nowb.txt data/local/dict_pp utils/prepare_lang.sh data/local/dict_pp "<SPOKEN_NOISE>" data/local/lang_tmp_pp data/lang_pp So basically after we've built the system a few times, we get the training alignments, estimate pronunciation probabilities, re-estimate the dictionary with pronunciation probabilities, and re-build the "lang" directory to "lang_pp" (with pronunciation probabilities in the L.fst), and afterwards we use that. It gives a pretty small improvement, e.g. 0.2% is typical. This doesn't require any deep changes in Kaldi, as we basically always supported pronunciation probabilities, we just hadn't done the scripting to support them in the standard setups. (Note: we normalize the probs so the max prob for each word is 1... we believe this is common, e.g. IBM does it). Right now Hainan and Guoguo are working on adding word-specific silence probabilities, estimated from data; this will be incorporated into the same workflow as we currently use for estimating pronunciation probabilities so would be no extra hassle for the user. It looks like we can generally get an extra 0.1% to 0.3% from this. It's not finalized yet, but when it is we'll start including it in the scripts. Something that turned out to be important for getting the silence probabilities, mentioned above, to work is the use of a word insertion penalty in scoring. Previously we have mostly avoided the use of word insertion penalties in Kaldi. The reason we could get away with this is that the standard Kaldi scripts use a silence probability of 0.5 (meaning not-having-a-silence also has a probability of 0.5), and this acts as a fixed word insertion penalty with a reasonable value. When we estimate the probabilities from data, the silence-probability is lower (e.g. 0.1 or 0.2) so that penalty gets decreased, at least when there is no silence. Jan Trmal has added a scoring script steps/score_kaldi.sh which can be used in cases where you don't need to do sclite scoring, and which supports searching over insertion penalties and also generates detailed statistics similar to sclite. We're using this as the scoring script in the WSJ setup and will switch over other setups to use it in future. Other recently added features and options: The alignment and training scripts now support a "--careful" option which should in theory improve alignment quality. It is a method that is designed to detect alignment errors where you ate up the words in the transcript too soon. It doesn't normally seem to make much of a difference, but it might make a difference for setups with long segments and/or errorful transcriptsions. (but in those scenarios, see also steps/cleanup/find_bad_utts.sh and egs/wsj/s5/local/run_segmentation.sh). The sMBR training (although not yet Karel's nnet1 version, he will add it soon) supports an option --one-silence-class. If you notice your sMBR training is producing too many insertions and too few deletions, you can try setting this to true and see if it helps; this fixes an asymmetry in the objective function whereby insertions were not penalized. This was important for us in the ASPIRE evaluation. We have of course been improving the recipes. Minhua Wu and Guoguo Chen and others have been working on a new, improved version of the Switchboard recipe in egs/swbd/s5c/ (including more consistent data normalization) and a recipe where we train on Fisher and Switchboard together and test on eval2000 (egs/fisher_swbd/s5/). Tony Robinson, Karel and others have been improving the Tedlium recipe; Tony has released some much-improved language models that he built for that recipe, and we will soon incorporate these onto the scripts. David Snyder has been improving the speaker-id setup, with the help of Daniel Garcia-Romero (e.g. looking at issues like whitening and length normalization), and hopes to add sre10 examples soon. We don't claim to be in the forefront of speaker-id research (yet), but it's sometimes convenient to have a "native" Kaldi implementation of speaker-id. If you contributed something and were not mentioned here, apologies... what I said above was just what came immediately to my mind, and I'm excluding things that are more research-y and less immediately relevant to Kaldi users. Dan |