[Kaldi-developers] Some Kaldi updates

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I notice quite a few people are signed up to the kaldi-developers list and
it doesn't get very much traffic, so I thought I should send out some
updates to let people know what's going on.

There has been some email traffic about LSTMs-- a guy called Jiayu Du from
Baidu has been working with Karel and others to implement it in Karel's
setup.  As far as I know it is not yet working better than the baseline- it
seems LSTMs need quite a bit of tuning to get them to work.  Also the
training is pretty slow so we can't train them on a lot of data yet.

In the last few months, in the nnet2 setup we have been using an
architecture we call "multi-splice" where splicing over time happens not
just at the start of the network but at multiple layers.  Typically we
splice 2 frames, but not adjacent ones, and the separation in time
increases for higher layers of the network.  (and the last layer or two
typically don't have splicing).  We modified the training code so it only
evaluates the frames that it needs to evaluate (i.e. to allow gaps in
time).  This is usually giving improvements over our previous (p-norm)
numbers, typically 5% relative or so.  In time we plan to modify all the
nnet2 recipes to use this.

Vijay Peddinti (who was also involved in the multi-splice work) submitted a
Kaldi-based system in the ASPIRE challenge (ASPIRE about reverberant speech
in varying microphone conditions, and Fisher is the only training data you
are allowed to use).  We were just behind the top system (BBN) but our
system was a single system and we believe theirs was a big combination.  We
used a single system, an online-nnet2 system, i.e. with iVectors as input
as well as unadapted 40-dimensional MFCC features.  Vijay trained the
system on Fisher data perturbed with randomly chosen room responses and
room noises that he got from Shinji Watanabe.  He plans to release the
recipe soon.  We trained on 3 copies of the Fisher data (perturbed with
different room responses).  This took only 2 or 3 days for cross-entropy
training, using up to 20 or so GPUs.  We think the reason we did well in
the evaluation is probably some combination of our setup being more
scalable (so we can train on more data in a reasonable time- see
http://arxiv-web3.library.cornell.edu/abs/1410.7455), and maybe Shinji's
"real" room responses being better than the simulation-based ones that we
heard some others used.

While building this system we found some interesting things.  Firstly we
found and fixed a bug whereby the "min-active" parameter was not being
properly enforced by the decoders; this meant that sometimes long
utterances would be decoded as a truncated transcript, due to the decoder
getting stuck in a state which cannot reach the majority of the decoding
graph.  Also we found that the use of iVectors as an adaptation method was
not nearly as robust as we would like.  Firstly, it was not very robust to
variations in volume-- we had to normalize the energy of the test data to
fix this.  Probably this was because we normalized the training-data volume
too carefully; in future we plan to ensure the training data has variations
in volume.
Secondly, we found that it was very important to exclude silence from the
iVector estimation.  This was unexpected, as we included silence in
training time and we normally include it in test time (to avoid the hassle
of voice activity detection).  Perhaps the issue was that the silences we
encountered in test were too different from those we encountered in
training time.  In the end the way we resolved this was to do a first
decoding pass, get the ctm with confidences (lattice-to-ctm-conf), and
filter out words with confidence <1, words with improbably long durations,
mm, mhm, laughter, noise, and so on, and use only what remained for iVector
estimation.  We also found that it was important, for long utterances, to
scale up the prior term in iVector estimation (we added the --max-count
option to some programs to handle this; it's done by scaling down the
counts).
Incidentally, the way we handled segmentation in this system was pretty
brain-dead- we just used overlapping segments of 10 seconds, shifted by 5
seconds each time, and spliced things together at the end, at the ctm
level, using the concept that the transcript will only be disrupted close
to the boundary of each segment and transcripts closer to the middle should
be OK.
We will try to incorporate some of the lessons learned from ASPIRE (e.g.
about the importance of excluding silence for iVector estimation) into the
online-nnet2 setup as soon as we can.

Something else that was added to Kaldi recently (Guoguo Chen did this) is a
compact non-FST format for ARPA language models (search for carpa, or
const-arpa) which enables memory-efficient rescoring of lattices with
un-pruned language models.  You'll see that in some of the scripts, such as
Fisher English, we're already using this.

We have also begun adding pronunciation probabilities to some of the
scripts (work with Guoguo Chen and Hainan Xu).  Here is an excerpt from the
Librispeech training script (run.sh):

 steps/get_prons.sh --cmd "$train_cmd"  data/train_clean_460 data/lang
exp/tri4b_ali_clean_460
 utils/dict_dir_add_pronprobs.sh data/local/dict
 exp/tri4b_ali_clean_460/pron_counts_nowb.txt data/local/dict_pp
 utils/prepare_lang.sh data/local/dict_pp "<SPOKEN_NOISE>"
 data/local/lang_tmp_pp data/lang_pp

So basically after we've built the system a few times, we get the training
alignments, estimate pronunciation probabilities, re-estimate the
dictionary with pronunciation probabilities, and re-build the "lang"
directory to "lang_pp" (with pronunciation probabilities in the L.fst), and
afterwards we use that.  It gives a pretty small improvement, e.g. 0.2% is
typical.  This doesn't require any deep changes in Kaldi, as we basically
always supported pronunciation probabilities, we just hadn't done the
scripting to support them in the standard setups.  (Note: we normalize the
probs so the max prob for each word is 1... we believe this is common, e.g.
IBM does it).

Right now Hainan and Guoguo are working on adding word-specific silence
probabilities, estimated from data; this will be incorporated into the same
workflow as we currently use for estimating pronunciation probabilities so
would be no extra hassle for the user.  It looks like we can generally get
an extra 0.1% to 0.3% from this.  It's not finalized yet, but when it is
we'll start including it in the scripts.

Something that turned out to be important for getting the silence
probabilities, mentioned above, to work is the use of a word insertion
penalty in scoring.  Previously we have mostly avoided the use of word
insertion penalties in Kaldi.  The reason we could get away with this is
that the standard Kaldi scripts use a silence probability of 0.5 (meaning
not-having-a-silence also has a probability of 0.5), and this acts as a
fixed word insertion penalty with a reasonable value.  When we estimate the
probabilities from data, the silence-probability is lower (e.g. 0.1 or 0.2)
so that penalty gets decreased, at least when there is no silence.  Jan
Trmal has added a scoring script steps/score_kaldi.sh which can be used in
cases where you don't need to do sclite scoring, and which supports
searching over insertion penalties and also generates detailed statistics
similar to sclite.  We're using this as the scoring script in the WSJ setup
and will switch over other setups to use it in future.

Other recently added features and options:
  The alignment and training scripts now support a "--careful" option which
should in theory improve alignment quality.  It is a method that is
designed to detect alignment errors where you ate up the words in the
transcript too soon.  It doesn't normally seem to make much of a
difference, but it might make a difference for setups with long segments
and/or errorful transcriptsions.  (but in those scenarios, see
also steps/cleanup/find_bad_utts.sh
and egs/wsj/s5/local/run_segmentation.sh).

The sMBR training (although not yet Karel's nnet1 version, he will add it
soon) supports an option --one-silence-class.  If you notice your sMBR
training is producing too many insertions and too few deletions, you can
try setting this to true and see if it helps; this fixes an asymmetry in
the objective function whereby insertions were not penalized.  This was
important for us in the ASPIRE evaluation.

 We have of course been improving the recipes.  Minhua Wu and Guoguo Chen
and others have been working on a new, improved version of the Switchboard
recipe in egs/swbd/s5c/ (including more consistent data normalization) and
a recipe where we train on Fisher and Switchboard together and test on
eval2000 (egs/fisher_swbd/s5/).  Tony Robinson, Karel and others have been
improving the Tedlium recipe; Tony has released some much-improved language
models that he built for that recipe, and we will soon incorporate these
onto the scripts.

David Snyder has been improving the speaker-id setup, with the help of
Daniel Garcia-Romero (e.g. looking at issues like whitening and length
normalization), and hopes to add sre10 examples soon.  We don't claim to be
in the forefront of speaker-id research (yet), but it's sometimes
convenient to have a "native" Kaldi implementation of speaker-id.

If you contributed something and were not mentioned here, apologies... what
I said above was just what came immediately to my mind, and I'm excluding
things that are more research-y and less immediately relevant to Kaldi
users.

Dan