I tried searching the forum for any previous questions on training or adapting a network based on all the utterances from a particular speaker -- in the spirit of LIN adaptation (http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=687200, http://research.microsoft.com/pubs/230082/IS141354.pdf). I'm using nnet1(or Karel's DNN implementation),
I'm looking for a nnet1 binary that has functionality similar to gmm-est-fmllr-gpost, which is able to take in a --spk2utt option to train fMLLR matrices based on all of the utterances from a speaker after decoding using a SI model (unsupervised adaptation). Instead I would like to estimate speaker dependent transforms as is done in LIN adaptation using one of the nnet-train-frmshuff-... binaries. I don't see such an option in nnet-train-frmshuff, but I do see a binary called nnet-train-perutt. nnet-train-perutt, I understand shuffles frames from within an utterance.
If the kind of functionality I'm looking for doesn't exist, what might be the best way of going about implementing it?
Any help would be appreciated.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That paper is very old and I doubt the results are applicable to modern systems.
Various people have published papers on adaptation techniques like
what you mention, but the results were always very disappointing. So
no, it's not supported, and probably won't be unless these methods
start to look promising.
(Papers that reported improvements often used supervised adaptation
data, which is an unusual scenario, and the improvements were often
still quite small).
I tried searching the forum for any previous questions on training or
adapting a network based on all the utterances from a particular speaker --
in the spirit of LIN adaptation
(http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=687200, http://research.microsoft.com/pubs/230082/IS141354.pdf). I'm using nnet1(or
Karel's DNN implementation),
I'm looking for a nnet1 binary that has functionality similar to
gmm-est-fmllr-gpost, which is able to take in a --spk2utt option to train
fMLLR matrices based on all of the utterances from a speaker after decoding
using a SI model (unsupervised adaptation). Instead I would like to estimate
speaker dependent transforms as is done in LIN adaptation using one of the
nnet-train-frmshuff-... binaries. I don't see such an option in
nnet-train-frmshuff, but I do see a binary called nnet-train-perutt.
nnet-train-perutt, I understand shuffles frames from within an utterance.
If the kind of functionality I'm looking for doesn't exist, what might be
the best way of going about implementing it?
Hi,
no, the adaptation of NN weights to individual speakers is not supported.
The current state-of-the art technique is wither to use iVector based
features
and/or fMLLR features computed by an auxiliary GMM model.
The functionality of the binaries is following:
nnet-train-frmshuff - SGD with frame-shuffling, useful for feedforward NNs,
nnet-train-perutt - SGD with per-utterance updates (i.e. w/o frame
shuffling), the lists are shuffled, it's useful for training recurrent
networks
But it does not do speaker adaptation...
Best regards,
Karel.
Dne 30. 6. 2015 v 18:58 Daniel Povey napsal(a):
That paper is very old and I doubt the results are applicable to
modern systems.
Various people have published papers on adaptation techniques like
what you mention, but the results were always very disappointing. So
no, it's not supported, and probably won't be unless these methods
start to look promising.
(Papers that reported improvements often used supervised adaptation
data, which is an unusual scenario, and the improvements were often
still quite small).
I tried searching the forum for any previous questions on training or
adapting a network based on all the utterances from a particular
speaker --
in the spirit of LIN adaptation
(http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=687200,
http://research.microsoft.com/pubs/230082/IS141354.pdf). I'm using
nnet1(or
Karel's DNN implementation),
I'm looking for a nnet1 binary that has functionality similar to
gmm-est-fmllr-gpost, which is able to take in a --spk2utt option
to train
fMLLR matrices based on all of the utterances from a speaker after
decoding
using a SI model (unsupervised adaptation). Instead I would like
to estimate
speaker dependent transforms as is done in LIN adaptation using
one of the
nnet-train-frmshuff-... binaries. I don't see such an option in
nnet-train-frmshuff, but I do see a binary called nnet-train-perutt.
nnet-train-perutt, I understand shuffles frames from within an
utterance.
If the kind of functionality I'm looking for doesn't exist, what
might be
the best way of going about implementing it?
Any help would be appreciated.
------------------------------------------------------------------------
LIN Adaptation, Karel DNN
------------------------------------------------------------------------
Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/kaldi/discussion/1355348/
<https://sourceforge.net/p/kaldi/discussion/1355348>
To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/
<https://sourceforge.net/auth/subscriptions>
I tried searching the forum for any previous questions on training or adapting a network based on all the utterances from a particular speaker -- in the spirit of LIN adaptation (http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=687200, http://research.microsoft.com/pubs/230082/IS141354.pdf). I'm using nnet1(or Karel's DNN implementation),
I'm looking for a nnet1 binary that has functionality similar to gmm-est-fmllr-gpost, which is able to take in a --spk2utt option to train fMLLR matrices based on all of the utterances from a speaker after decoding using a SI model (unsupervised adaptation). Instead I would like to estimate speaker dependent transforms as is done in LIN adaptation using one of the nnet-train-frmshuff-... binaries. I don't see such an option in nnet-train-frmshuff, but I do see a binary called nnet-train-perutt. nnet-train-perutt, I understand shuffles frames from within an utterance.
If the kind of functionality I'm looking for doesn't exist, what might be the best way of going about implementing it?
Any help would be appreciated.
That paper is very old and I doubt the results are applicable to modern systems.
Various people have published papers on adaptation techniques like
what you mention, but the results were always very disappointing. So
no, it's not supported, and probably won't be unless these methods
start to look promising.
(Papers that reported improvements often used supervised adaptation
data, which is an unusual scenario, and the improvements were often
still quite small).
Dan
On Tue, Jun 30, 2015 at 10:53 AM, Mohan speechmachine@users.sf.net wrote:
Hi,
no, the adaptation of NN weights to individual speakers is not supported.
The current state-of-the art technique is wither to use iVector based
features
and/or fMLLR features computed by an auxiliary GMM model.
The functionality of the binaries is following:
nnet-train-frmshuff - SGD with frame-shuffling, useful for feedforward NNs,
nnet-train-perutt - SGD with per-utterance updates (i.e. w/o frame
shuffling), the lists are shuffled, it's useful for training recurrent
networks
But it does not do speaker adaptation...
Best regards,
Karel.
Dne 30. 6. 2015 v 18:58 Daniel Povey napsal(a):