So, I would like to test how my acoustic model ( a CD one) is working on 2
different speakers.
Here is what I did, but am not sure it is the right thing
I had N recordings for speaker 1 and M for speaker 2 with their transcription files
I have used SphinxAlign on the 2 sets of recordings and obtained 2 folders of wdseg/phseg files, one for each speaker
I have used a perl script that, for each Speaker Folder, did the following
3.1. for each wdseg file, it divided the total acoustic score by the number of
frames, obtaining a average / frame of acoustic score
3.2. for all wdseg files in the folder, I made the sum of all these averages
and divided it by the number of files
I have obtained therefore 2 scores, one for each speaker which I compared.
Question is : is this a good mesure of how an acoustic model reacts for the 2
speakers?
If not, what other choices do I have?
Thank you
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Question is : is this a good mesure of how an acoustic model reacts for the
2 speakers?
Sorry, acostic model is not the thing that reacts to speakers like a dog
reacts to a fart.
Maybe you wanted to measure something else in this case the first thing you
need to do is to find out what you want to measure first. Maybe if you will
give more extended description people will be able to suggest you the right
words.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
:) so, having 2 speakers, I want to know if the Acoustic Model is good or not
for both of them
If I do Sphinx Decode, I get the WER, ok? I want a similar stuff but only with
the acoustic model and not with to mix the language model.
So, any inputs on that? Also, "Sorry, acostic model is not the thing that
reacts to speakers like a dog reacts to a fart. " is not a valid scientific
answer.
I would prefer something telling me why making an average of the Acoustic
Score doesn't work, maybe explaining me better what is that acoustic score.
Thank you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
:) so, having 2 speakers, I want to know if the Acoustic Model is good or
not for both of them
What is "good"? Models aren't good or bad as is, they aren't film heroes.
Model can be good for some application for example for phonetic segmentation
or for recognition. And you just need to measure the performance of the model
on that application, not the performance of the model as is.
If I do Sphinx Decode, I get the WER, ok? I want a similar stuff but only
with the acoustic model and not with to mix the language model. So, any inputs
on that?
If you want to abstract from the language model, you can measure phonetic
recognizer error rate. This is a standard approach for measuring acoustic
model quality for example in TIMIT experiments.
Also, "Sorry, acostic model is not the thing that reacts to speakers like a
dog reacts to a fart. " is not a valid scientific answer.
So the question is. If question is not properly stated you can't recieve
proper answer.
I would prefer something telling me why making an average of the Acoustic
Score doesn't work, maybe explaining me better what is that acoustic score.
When you are doing forced alignment you are trying to fit the audio into
transcription which is not always a good idea. For example if there is
mismatch between ideal transcription and real pronunciation you'll get the
misalignments and other bad things. Automatic segmentation is still very error
prone. You can check phonetic labels for example to see that.
Because of that the score of the model on the segment is meaningless. It
doesn't show how your model will behave when it will be less restricted by the
grammar. Average of meaningless scores is even more meaningless than the score
itself.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
ok..thanks a lot for explaining me this. I understand that my question wasn't
the best chosen. But now, all is clear.
Anyway, about this : "If you want to abstract from the language model, you can
measure phonetic recognizer error rate. This is a standard approach for
measuring acoustic model quality for example in TIMIT experiments. "
Can you tell me how to measure this? Or guide me to some online reading about
this on Sphinx3? I realise I already took much of your time, so I apologise.
Thank you
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
To measure the error rate you can just compare phoentic recognizer output with
reference phonetic transcription same way as word error rate is calculated.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
So, I would like to test how my acoustic model ( a CD one) is working on 2
different speakers.
Here is what I did, but am not sure it is the right thing
3.1. for each wdseg file, it divided the total acoustic score by the number of
frames, obtaining a average / frame of acoustic score
3.2. for all wdseg files in the folder, I made the sum of all these averages
and divided it by the number of files
Question is : is this a good mesure of how an acoustic model reacts for the 2
speakers?
If not, what other choices do I have?
Thank you
Sorry, acostic model is not the thing that reacts to speakers like a dog
reacts to a fart.
Maybe you wanted to measure something else in this case the first thing you
need to do is to find out what you want to measure first. Maybe if you will
give more extended description people will be able to suggest you the right
words.
:) so, having 2 speakers, I want to know if the Acoustic Model is good or not
for both of them
If I do Sphinx Decode, I get the WER, ok? I want a similar stuff but only with
the acoustic model and not with to mix the language model.
So, any inputs on that? Also, "Sorry, acostic model is not the thing that
reacts to speakers like a dog reacts to a fart. " is not a valid scientific
answer.
I would prefer something telling me why making an average of the Acoustic
Score doesn't work, maybe explaining me better what is that acoustic score.
Thank you.
What is "good"? Models aren't good or bad as is, they aren't film heroes.
Model can be good for some application for example for phonetic segmentation
or for recognition. And you just need to measure the performance of the model
on that application, not the performance of the model as is.
If you want to abstract from the language model, you can measure phonetic
recognizer error rate. This is a standard approach for measuring acoustic
model quality for example in TIMIT experiments.
So the question is. If question is not properly stated you can't recieve
proper answer.
When you are doing forced alignment you are trying to fit the audio into
transcription which is not always a good idea. For example if there is
mismatch between ideal transcription and real pronunciation you'll get the
misalignments and other bad things. Automatic segmentation is still very error
prone. You can check phonetic labels for example to see that.
Because of that the score of the model on the segment is meaningless. It
doesn't show how your model will behave when it will be less restricted by the
grammar. Average of meaningless scores is even more meaningless than the score
itself.
ok..thanks a lot for explaining me this. I understand that my question wasn't
the best chosen. But now, all is clear.
Anyway, about this : "If you want to abstract from the language model, you can
measure phonetic recognizer error rate. This is a standard approach for
measuring acoustic model quality for example in TIMIT experiments. "
Can you tell me how to measure this? Or guide me to some online reading about
this on Sphinx3? I realise I already took much of your time, so I apologise.
Thank you
You can find the documentation about setting up phonetic recognizer here:
http://cmusphinx.sourceforge.net/wiki/phonemerecognition
To measure the error rate you can just compare phoentic recognizer output with
reference phonetic transcription same way as word error rate is calculated.
Ok, thank you. I guess this thread is closed.