Could you tell what is the state of the art with the noise robustness right now?
Does it still make sense to include the Wiener filter into a pipeline?
But I found the only one mention of that in the Denoise class.
I am not sure, whether it is employed by default, or I must include it into the pipeline explicitely. Are you going to implement the real PNCC feature extraction, or MFCC+Denoise do this?
Thanks in advance.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Could you tell what is the state of the art with the noise robustness right now?
This is a complex subject. State of the art is that you need to know the noise profile in order to cancel it effectively. Simple algorithms based on spectral subtraction are also popular and allow some improvements.
Does it still make sense to include the Wiener filter into a pipeline?
No
I am not sure, whether it is employed by default, or I must include it into the pipeline explicitely. Are you going to implement the real PNCC feature extraction, or MFCC+Denoise do this?
Denoise is automatically enabled in AutoCepstrum if feat.params have -remove_noise yes or if you add it to processing pipeline.
Are you going to implement the real PNCC feature extraction, or MFCC+Denoise do this?
We are using denoised MFCC by default in CMUSphinx, we are not going to implement PNCC.
Can I use the voxforge2 acoustic models 'as is' with the new pipeline, or they must be retrained using -remove_noise yes ?
I adapted the models using MAP, does the noise removal affect the adaptation process?
I would be very pleased if you provided some references concerning noise profiles and their usage in noise cancellation, I have audio data recorded, grouped by users, so I believe I can use it for the noise cancellation.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can I use the voxforge2 acoustic models 'as is' with the new pipeline, or they must be retrained using -remove_noise yes ?
You can use existing models but it's better to retrain. The WER difference on clean data must be minor.
I adapted the models using MAP, does the noise removal affect the adaptation process?
No
I would be very pleased if you provided some references concerning noise profiles and their usage in noise cancellation, I have audio data recorded, grouped by users, so I believe I can use it for the noise cancellation.
There is no such thing as noise profile in current implementation though it seems reasonable thing to have.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, I tried to retrain the models and to use the above described pipeline.
And got an accuracy improvement 4% absolute (74.65 -> 78.72).
The only strange thing: I found that the recognizer does not work when using DiscreteCosineTransform2, as it is suggested here:
And got an accuracy improvement 4% absolute (74.65 -> 78.72).
Ok, great
I got just empty output instead of hypotheses. However, it works with DiscreteCosineTransform.
You probably misconfigured something in training. New trainer has updated properties, so you need to update etc/feat.params and sphinx_train.cfg. After that in sphinx_train.cfg you should see CFG_TRANSFORM configuration variable which must be set to dct and in model feat.params you should see -transform dct too
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yeah, -transform in feat.params is set to legacy...
But, I don't have the CFG_TRANSFORM in my current config. Should I just add it?
Maybe you have a sample of sphinx_train.cfg for the VoxForge corpus to share with me?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
~~~~~~~~~~~~
$CFG_WAVFILE_SRATE = 16000.0;
$CFG_NUM_FILT = 25; # For wideband speech it's 25, for telephone 8khz reasonable value is 15
$CFG_LO_FILT = 130; # For telephone 8kHz speech value is 200
$CFG_HI_FILT = 6800; # For telephone 8kHz speech value is 3500
$CFG_TRANSFORM = "dct"; # Previously legacy transform is used, but dct is more accurate
$CFG_LIFTER = "22"; # Cepstrum lifter is smoothing to improve recognition
$CFG_VECTOR_LENGTH = 13; # 13 is usually enough
~~~~~~~~~~~~~~
and it should be fine. See sphinxtrain/etc/sphinx_train.cfg as a template.\
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can you share both of your updated files feat.params and sphinx_train.cfg, i would also like to see how it works and optimize the performance in the presence of noise.
Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Here they are. Plus the feature extraction script.
I did not tested them yet, but it must be the correct configuration.
Please note that the latest sphinxbase and sphinxtrain must be installed.
So, having conducted experiments I found that denoising helps much for my task: 74.65% -> 81.38%.
But when I adapted the voxforge models using MAP in LOSO way, I got 72.93% accuracy... o_0
Previously I had a substantial improvement at that stage...
At the same time, adaptation on test data without LOSO (whole bunch of test records used to adapt and to evaluate) provided me with 97.50% of accuracy, so the data itself is fine.
Just can't figure out what it might be... Instability with respect to unseen data after adaptation?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
At the same time, adaptation on test data without LOSO (whole bunch of test records used to adapt and to evaluate) provided me with 97.50% of accuracy, so the data itself is fine.
That feels like an issue for me, you definitely should not get an improvement from 81% to 97%.
But when I adapted the voxforge models using MAP in LOSO way, I got 72.93% accuracy... o_0
There could be many issues here from different language weight for adapted model to wider beams or issues with feature extraction. It's hard to say what is going on here.
Please note that MAP adaptation of continuous model requires quite a lot of data. At least few hours.
Last edit: Nickolay V. Shmyrev 2014-06-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I use MAP to adapt models to the channel and accent (heavy Russian), rather than for speaker adaptation. So, to predict the performance on new users, I split test data into speaker-dependent folds and perform cross-validation using those folds: 1 fold for testing, others for adaptation.
Such a configuration is referred as MAP_LOSO.
In contrast to that, in MAP_full I use all the available data to adapt and to test models.
Previously, I had something like:
No adapt - 74.65%
MAP_LOSO - 81.06%
MAP_full - 94.21%
Having introduced denoising (retrained models, new pipeline in config, new feature extraction), I got:
No adapt - 81.38%
MAP_LOSO - 72.93%
MAP_full - 97.50%
The config files are equal, except the ModelLoader location value and the frontend pipeline. I do not have much data (~1h), but so far it helped...
Absolutely mystical :)
Last edit: Dmytro Prylipko 2014-06-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For a new frontend you usually need to reevaluate all other parameters (language weight, beams)
Absolutely mystical :)
Usually there is a reason, however, it's hard to give you that just looking on the numbers. You can try MLLR adaptation instead of MAP, you can also play with tau parameter of map_adapt to control interpolation between adaptation data and original models.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So, I have trained and tested the Voxforge acoustic model with denoising. The next obvious thing is to share it. How I can do this? Can I upload it to SourceForge (https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/English%20Voxforge/)?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It's great you created a new version of voxforge-en, please upload it to dropbox and give here a link and I'll publish it in our downloads. I suppose that the model is updated with the latest data and also has same file structure (etc folder included).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for the nice post on the popular resource, it covers few interesting parts.
I'm looking on your model and see you trained with -nfilt 40. This is not the optimal value, nfilt must be around 25. I would also train more senones since Voxforge data is big these days.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There is also an issue that accuracy of your model is less than the accuracy of the voxforge-en-0.4. This is actually an issue I encountered and the reason I stopped to update voxforge models. Somehow the quality of voxforge data reduced with the time so the accuracy of the model drops if you include the new data. This is a subject for research though.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Dear Nickolay,
Could you tell what is the state of the art with the noise robustness right now?
Does it still make sense to include the Wiener filter into a pipeline?
Also, import of the PNCC features has been announced almost a year ago:
http://nshmyrev.blogspot.de/2013/06/around-noise-robust-pncc-features.html
But I found the only one mention of that in the Denoise class.
I am not sure, whether it is employed by default, or I must include it into the pipeline explicitely. Are you going to implement the real PNCC feature extraction, or MFCC+Denoise do this?
Thanks in advance.
This is a complex subject. State of the art is that you need to know the noise profile in order to cancel it effectively. Simple algorithms based on spectral subtraction are also popular and allow some improvements.
No
Denoise is automatically enabled in AutoCepstrum if feat.params have -remove_noise yes or if you add it to processing pipeline.
We are using denoised MFCC by default in CMUSphinx, we are not going to implement PNCC.
Related
Forums: Using xml configuration files or predefined defaults
Could you give a hint on the explicit usage of Denoise in the Sphinx4 pipeline?
What I understood from here:
http://cmusphinx.sourceforge.net/doc/sphinx4/edu/cmu/sphinx/frontend/AutoCepstrum.html
it looks like:
StreamDataSource
Preemphasizer
RaisedCosineWindower
DiscreteFourierTransform
MelFrequencyFilterBank
Denoise
DiscreteCosineTransform2
Lifter
BatchCMN
DeltasFeatureExtractor
FeatureTransform
Is it correct?
Can I use the voxforge2 acoustic models 'as is' with the new pipeline, or they must be retrained using -remove_noise yes ?
I adapted the models using MAP, does the noise removal affect the adaptation process?
I would be very pleased if you provided some references concerning noise profiles and their usage in noise cancellation, I have audio data recorded, grouped by users, so I believe I can use it for the noise cancellation.
Yes
You can use existing models but it's better to retrain. The WER difference on clean data must be minor.
No
There is no such thing as noise profile in current implementation though it seems reasonable thing to have.
Well, I tried to retrain the models and to use the above described pipeline.
And got an accuracy improvement 4% absolute (74.65 -> 78.72).
The only strange thing: I found that the recognizer does not work when using DiscreteCosineTransform2, as it is suggested here:
http://cmusphinx.sourceforge.net/doc/sphinx4/edu/cmu/sphinx/frontend/AutoCepstrum.html
I got just empty output instead of hypotheses. However, it works with DiscreteCosineTransform.
Ok, great
You probably misconfigured something in training. New trainer has updated properties, so you need to update
etc/feat.params
andsphinx_train.cfg
. After that insphinx_train.cfg
you should seeCFG_TRANSFORM
configuration variable which must be set todct
and in model feat.params you should see-transform dct
tooYeah, -transform in feat.params is set to legacy...
But, I don't have the CFG_TRANSFORM in my current config. Should I just add it?
Maybe you have a sample of sphinx_train.cfg for the VoxForge corpus to share with me?
Just add in your config:
~~~~~~~~~~~~
$CFG_WAVFILE_SRATE = 16000.0;
$CFG_NUM_FILT = 25; # For wideband speech it's 25, for telephone 8khz reasonable value is 15
$CFG_LO_FILT = 130; # For telephone 8kHz speech value is 200
$CFG_HI_FILT = 6800; # For telephone 8kHz speech value is 3500
$CFG_TRANSFORM = "dct"; # Previously legacy transform is used, but dct is more accurate
$CFG_LIFTER = "22"; # Cepstrum lifter is smoothing to improve recognition
$CFG_VECTOR_LENGTH = 13; # 13 is usually enough
~~~~~~~~~~~~~~
and it should be fine. See sphinxtrain/etc/sphinx_train.cfg as a template.\
hi Dmytro,
Can you share both of your updated files feat.params and sphinx_train.cfg, i would also like to see how it works and optimize the performance in the presence of noise.
Thanks
Here they are. Plus the feature extraction script.
I did not tested them yet, but it must be the correct configuration.
Please note that the latest sphinxbase and sphinxtrain must be installed.
So, having conducted experiments I found that denoising helps much for my task: 74.65% -> 81.38%.
But when I adapted the voxforge models using MAP in LOSO way, I got 72.93% accuracy... o_0
Previously I had a substantial improvement at that stage...
At the same time, adaptation on test data without LOSO (whole bunch of test records used to adapt and to evaluate) provided me with 97.50% of accuracy, so the data itself is fine.
Just can't figure out what it might be... Instability with respect to unseen data after adaptation?
I'm not sure what do you mean by "LOSO way"
That feels like an issue for me, you definitely should not get an improvement from 81% to 97%.
There could be many issues here from different language weight for adapted model to wider beams or issues with feature extraction. It's hard to say what is going on here.
Please note that MAP adaptation of continuous model requires quite a lot of data. At least few hours.
Last edit: Nickolay V. Shmyrev 2014-06-19
LOSO = leave on speaker out.
I use MAP to adapt models to the channel and accent (heavy Russian), rather than for speaker adaptation. So, to predict the performance on new users, I split test data into speaker-dependent folds and perform cross-validation using those folds: 1 fold for testing, others for adaptation.
Such a configuration is referred as MAP_LOSO.
In contrast to that, in MAP_full I use all the available data to adapt and to test models.
Previously, I had something like:
No adapt - 74.65%
MAP_LOSO - 81.06%
MAP_full - 94.21%
Having introduced denoising (retrained models, new pipeline in config, new feature extraction), I got:
No adapt - 81.38%
MAP_LOSO - 72.93%
MAP_full - 97.50%
The config files are equal, except the ModelLoader location value and the frontend pipeline. I do not have much data (~1h), but so far it helped...
Absolutely mystical :)
Last edit: Dmytro Prylipko 2014-06-19
For a new frontend you usually need to reevaluate all other parameters (language weight, beams)
Usually there is a reason, however, it's hard to give you that just looking on the numbers. You can try MLLR adaptation instead of MAP, you can also play with tau parameter of map_adapt to control interpolation between adaptation data and original models.
You're right, however I wanted to see first the effect of the new models and the pipeline only. Further fine-tuning to be performed yet.
Your suggestion to play with tau is a good idea. I can remember I could not find where to specify the parameter :)
I tried to use fixed tau instead of bayes mean, and it helped!
Now the accuracy for MAP_LOSO is around 84% for different tau.
So, I have trained and tested the Voxforge acoustic model with denoising. The next obvious thing is to share it. How I can do this? Can I upload it to SourceForge (https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/English%20Voxforge/)?
Dear Dmytro
It's great you created a new version of voxforge-en, please upload it to dropbox and give here a link and I'll publish it in our downloads. I suppose that the model is updated with the latest data and also has same file structure (etc folder included).
Well, it might be not the latest, but not outdated, May 2014.
https://www.dropbox.com/sh/gmi65tjcz901llk/AABYhlRZ9QHnPdA6zjDenE_1a/voxforge-en
So, what is your judgement?
Sorry I didn't have time to check accuracy. I will check soon. Thank you.
Last edit: Nickolay V. Shmyrev 2014-06-27
Welcome to discuss and to contribute:
http://habrahabr.ru/post/227099/
Hello
Thank you for the nice post on the popular resource, it covers few interesting parts.
I'm looking on your model and see you trained with -nfilt 40. This is not the optimal value, nfilt must be around 25. I would also train more senones since Voxforge data is big these days.
There is also an issue that accuracy of your model is less than the accuracy of the voxforge-en-0.4. This is actually an issue I encountered and the reason I stopped to update voxforge models. Somehow the quality of voxforge data reduced with the time so the accuracy of the model drops if you include the new data. This is a subject for research though.
@Dmytro
For state-of-the-art have a look at following reference
Li, J., Deng, L., Gong, Y., & Haeb-Umbach, R. (2014). An Overview of
Noise-Robust Automatic Speech Recognition. IEEE/ACM Transactions on Audio,
Speech & Language Processing, 22(4), 745-777.
It has loads of pointers, but unfortunately not satisfactory numerical
comparison amongst the methods.
Last edit: Nickolay V. Shmyrev 2014-06-30