I'm working on a project where we take small clips, use them to train the
default acoustic model, and test the new resulting accuracy. What we want to
show is that as the number of clips increases, the accuracy (as a whole) tends
to increase over a period of time. So we train on 1 clip, then 2 clips, then
3, etc., testing accuracy at each stage. Both the training and testing clips
are from the same audio segment (a lecture from MIT OpenCourseWare). The
problem is, after testing, some clips increase accuracy, some decrease it, and
even when we train with the "good" clips, we get wildly different results. So
my question is, firstly, is this claim feasible? Should we be getting steady
increases in accuracy as the training corpus increases? Secondly, if this is
feasible, are we doing anything wrong in the training/testing process?
Here's the (mostly) complete data we are using for training/testing. It
includes 29 example training clips, as well as a simple Java program that
trains a model automatically based on the input. The testing file is
"ELECtestN.wav".
The problem is, after testing, some clips increase accuracy, some decrease
it, and even when we train with the "good" clips, we get wildly different
results.
These short chunks of audio are called utterances, not clips.
So my question is, firstly, is this claim feasible? Should we be getting
steady increases in accuracy as the training corpus increases?
For map adaptation reasonable improvement should start with 5 minutes of
adapation audio (way more than you have
tried) and increase up to 20 hours of adaptation audio. The data you are using
for adaptation is too small for MAP adaptation.
You can try MLLR adaptation, but for that you need a continuous model.
There is also an issue with your utterances. They MUST have a small period of
silence (0.2s) on boundaries. You cut too much.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you very much for your help. We have about 30 minutes of adaptation
audio to work with in total. It seems the only issue now is to correctly
segment that audio into utterances that can be used by the adaptation program.
How would you suggest doing this? Is manually segmenting the only feasible
way?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Okay, we have been able to use the long audio aligner to segment audio, but we
still haven't been getting consistent improvement from adaptation. Here are
the utterances, from 10 minutes of audio:
There was also a 6-minute test clip for testing accuracy (from the same
chemistry lecture). Do you think there is something wrong with these clips, or
should the adaptation be working?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There was also a 6-minute test clip for testing accuracy (from the same
chemistry lecture). Do you think there is something wrong with these clips, or
should the adaptation be working?
Sorry since you didn't provide the information about your experiments, neither
the data that was used for testing nor the exact decoder configuration it's
hard to give you a detailed answer.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I used the word_align.pl script to calculate WER. It started off at about 45%
and was around 44% by the end of adaptation, but along the way, it went as low
as 39% and as high as 55%.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I used the word_align.pl script to calculate WER. It started off at about
45% and was around 44% by the end of adaptation, but along the way, it went as
low as 39% and as high as 55%.
Sorry, it's not clear what is "along the way". Are you doing something during
adaptation not described in the tutorial? There is nothing about "along the
way" in the tutorial. Let me repeat:
Since you didn't provide the information about your experiments, neither
the data that was used for testing and adaptation including all required
files nor the exact decoder configuration and the command line it's hard
to give you a detailed answer.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I'm working on a project where we take small clips, use them to train the
default acoustic model, and test the new resulting accuracy. What we want to
show is that as the number of clips increases, the accuracy (as a whole) tends
to increase over a period of time. So we train on 1 clip, then 2 clips, then
3, etc., testing accuracy at each stage. Both the training and testing clips
are from the same audio segment (a lecture from MIT OpenCourseWare). The
problem is, after testing, some clips increase accuracy, some decrease it, and
even when we train with the "good" clips, we get wildly different results. So
my question is, firstly, is this claim feasible? Should we be getting steady
increases in accuracy as the training corpus increases? Secondly, if this is
feasible, are we doing anything wrong in the training/testing process?
Here's the (mostly) complete data we are using for training/testing. It
includes 29 example training clips, as well as a simple Java program that
trains a model automatically based on the input. The testing file is
"ELECtestN.wav".
https://www.dropbox.com/sh/0l3ua164hoqg0cb/O26fIS8KPB
Thank you for any help.
Typo: when I say "training" I mean "adapting." We are using the default WSJ 8k
model and adapting that.
These short chunks of audio are called utterances, not clips.
For map adaptation reasonable improvement should start with 5 minutes of
adapation audio (way more than you have
tried) and increase up to 20 hours of adaptation audio. The data you are using
for adaptation is too small for MAP adaptation.
You can try MLLR adaptation, but for that you need a continuous model.
There is also an issue with your utterances. They MUST have a small period of
silence (0.2s) on boundaries. You cut too much.
Thank you very much for your help. We have about 30 minutes of adaptation
audio to work with in total. It seems the only issue now is to correctly
segment that audio into utterances that can be used by the adaptation program.
How would you suggest doing this? Is manually segmenting the only feasible
way?
We have long audio aligner tool which can assign timestamps in the long
recording. You can use the timestamps for segmentation later.
http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/long-audio-
aligner/
Okay, we have been able to use the long audio aligner to segment audio, but we
still haven't been getting consistent improvement from adaptation. Here are
the utterances, from 10 minutes of audio:
https://www.dropbox.com/sh/hmqsij2195jdpck/4XxmSbk0LU
There was also a 6-minute test clip for testing accuracy (from the same
chemistry lecture). Do you think there is something wrong with these clips, or
should the adaptation be working?
Sorry since you didn't provide the information about your experiments, neither
the data that was used for testing nor the exact decoder configuration it's
hard to give you a detailed answer.
The test clip we are using is called "sschemTEST.wav" and is in the "testing"
folder, with the accompanying transcription:
https://www.dropbox.com/sh/hmqsij2195jdpck/4XxmSbk0LU
This is decoded with pocketsphinx_continuous with the adapted hubwsj 8k model,
the giga64k LM, and the cmu07a dictionary.
And what is the WER before and after adaptation? How do you calculate it?
I used the word_align.pl script to calculate WER. It started off at about 45%
and was around 44% by the end of adaptation, but along the way, it went as low
as 39% and as high as 55%.
Sorry, it's not clear what is "along the way". Are you doing something during
adaptation not described in the tutorial? There is nothing about "along the
way" in the tutorial. Let me repeat:
Since you didn't provide the information about your experiments, neither
the data that was used for testing and adaptation including all required
files nor the exact decoder configuration and the command line it's hard
to give you a detailed answer.