I have success training the DNN nnet2 online model with MFCC feature.
So I do the same thing again with fbank feature.
However, I got an unexpected WER, almost the sentence is recognized incorrectly.
I have checked the compute_prob_valid.*.log, it looks fine. With MFCC the final
value is 0.6065 and with fbank it is 0.573.
The command for the decoding is:
online2-wav-nnet2-latgen-faster --online=true --do-endpointing=false --config=online_nnet2_decoding.conf --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=words.txt final.mdl HCLG.fst ...
I also checked the online_nnet2_decoding.conf. It was generated correctly for the fbank:
--feature-type=fbank
--fbank-config=...fbank.conf
--ivector-extraction-config=...ivector_extractor.conf
--endpoint.silence-phones=...
I would appreciate if you could give me some hints to find out the problem!
Thank you,
Yours sincerely,
Truong Do
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't think I have ever run the setup with the fbank features--
there is really no point, because we use MFCC without dimension
reduction, which are just a linearly transformed version of the fbank
features. It is possible that there is some bug somewhere. Decode
with a higher verbose level and look for the objecctive-function
changes reported for the iVectors. (You re-trained the iVector
extractor on top of fbank features, right?). That would narrow down
whether something is going wrong with the iVectors.
Dan
I have success training the DNN nnet2 online model with MFCC feature.
So I do the same thing again with fbank feature.
However, I got an unexpected WER, almost the sentence is recognized
incorrectly.
I have checked the compute_prob_valid.*.log, it looks fine. With MFCC the
final
value is 0.6065 and with fbank it is 0.573.
The command for the decoding is:
online2-wav-nnet2-latgen-faster --online=true --do-endpointing=false
--config=online_nnet2_decoding.conf --max-active=7000 --beam=15.0
--lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=words.txt
final.mdl HCLG.fst ...
I also checked the online_nnet2_decoding.conf. It was generated correctly
for the fbank:
--feature-type=fbank
--fbank-config=...fbank.conf
--ivector-extraction-config=...ivector_extractor.conf
--endpoint.silence-phones=...
I would appreciate if you could give me some hints to find out the problem!
I don't think I have ever run the setup with the fbank features there is really no point, because we use MFCC without dimension reduction, which are just a linearly transformed version of the fbank features
If so, why the result from fbank and mfcc features are slightly difference.
And when I combined the result from those 2 systems, I got some improvement (based on my experiment before).
Decode with a higher verbose level
The objective function improvement from estimating the iVector looks correct,
it is increase when we see more frames.
Do you think the problem is in graph HCLG.fst?
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 65.8562
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 65.8981
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 66.0219
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 66.2964
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 66.6309
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 66.8939
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 67.1176
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 67.5152
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 68.459
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 69.3028
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Those objective function improvements are too large- they should be
around 10. It could indicate a mismatch in the iVector extractor
(e.g. trained on wrong data? mismatch in cmvn?)
What were the objf improvements like in training? The averages should
have been printed in the log.
Dan
I don't think I have ever run the setup with the fbank features there is
really no point, because we use MFCC without dimension reduction, which are
just a linearly transformed version of the fbank features
If so, why the result from fbank and mfcc features are slightly difference.
And when I combined the result from those 2 systems, I got some improvement
(based on my experiment before).
Decode with a higher verbose level
The objective function improvement from estimating the iVector looks
correct,
it is increase when we see more frames.
Do you think the problem is in graph HCLG.fst?
VLOG[4]
(online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650)
Objective function improvement from estimating the iVector (vs. default
value) is 65.8562
VLOG[4]
(online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650)
Objective function improvement from estimating the iVector (vs. default
value) is 65.8981
VLOG[4]
(online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650)
Objective function improvement from estimating the iVector (vs. default
value) is 66.0219
VLOG[4]
(online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650)
Objective function improvement from estimating the iVector (vs. default
value) is 66.2964
VLOG[4]
(online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650)
Objective function improvement from estimating the iVector (vs. default
value) is 66.6309
VLOG[4]
(online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650)
Objective function improvement from estimating the iVector (vs. default
value) is 66.8939
VLOG[4]
(online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650)
Objective function improvement from estimating the iVector (vs. default
value) is 67.1176
VLOG[4]
(online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650)
Objective function improvement from estimating the iVector (vs. default
value) is 67.5152
VLOG[4]
(online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650)
Objective function improvement from estimating the iVector (vs. default
value) is 68.459
VLOG[4]
(online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650)
Objective function improvement from estimating the iVector (vs. default
value) is 69.3028
Hi all,
I have success training the DNN nnet2 online model with MFCC feature.
So I do the same thing again with fbank feature.
However, I got an unexpected WER, almost the sentence is recognized incorrectly.
I have checked the compute_prob_valid.*.log, it looks fine. With MFCC the final
value is 0.6065 and with fbank it is 0.573.
The command for the decoding is:
online2-wav-nnet2-latgen-faster --online=true --do-endpointing=false --config=online_nnet2_decoding.conf --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=words.txt final.mdl HCLG.fst ...
I also checked the online_nnet2_decoding.conf. It was generated correctly for the fbank:
--feature-type=fbank
--fbank-config=...fbank.conf
--ivector-extraction-config=...ivector_extractor.conf
--endpoint.silence-phones=...
I would appreciate if you could give me some hints to find out the problem!
Thank you,
Yours sincerely,
Truong Do
I don't think I have ever run the setup with the fbank features--
there is really no point, because we use MFCC without dimension
reduction, which are just a linearly transformed version of the fbank
features. It is possible that there is some bug somewhere. Decode
with a higher verbose level and look for the objecctive-function
changes reported for the iVectors. (You re-trained the iVector
extractor on top of fbank features, right?). That would narrow down
whether something is going wrong with the iVectors.
Dan
Thanks you for your help.
If so, why the result from fbank and mfcc features are slightly difference.
And when I combined the result from those 2 systems, I got some improvement (based on my experiment before).
The objective function improvement from estimating the iVector looks correct,
it is increase when we see more frames.
Do you think the problem is in graph HCLG.fst?
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 65.8562
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 65.8981
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 66.0219
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 66.2964
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 66.6309
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 66.8939
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 67.1176
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 67.5152
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 68.459
VLOG[4] (online2-wav-nnet2-latgen-faster:GetIvector():ivector-extractor.cc:650) Objective function improvement from estimating the iVector (vs. default value) is 69.3028
Those objective function improvements are too large- they should be
around 10. It could indicate a mismatch in the iVector extractor
(e.g. trained on wrong data? mismatch in cmvn?)
What were the objf improvements like in training? The averages should
have been printed in the log.
Dan
On Tue, Jul 14, 2015 at 8:21 PM, Do Quoc Truong truongdq54@users.sf.net wrote:
Hi Dan,
I found the mistake, the problem is I used the wrong ivector extractor.
Thank you so much for your advice.