There are other posts asking how to get additional feature sets (or remove the
MFCC process entirely) but, I was a bit confused reading them so I hoped that
a more specialized response would clear it up.
(by the way, is there a comprehensive wiki or resource somewhere? is it all
here: http://cmusphinx.sourceforge.net/wiki/ -- because I wasn't entirely sure, and I thought there would be more
documentation).
Anyway, for the main topic -
how do I add my own feature set (a logarithmic semitone scale so that I can
use feature sets based off music theory in addition to the MFCC sets)?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
(for unimportant reasons) I am not interested in words, sentences, or prosodic
features or anything similar to that at the moment. All that I care about are
frequency relationships.
So how exactly would I go about effectively streamlining it just for this
niche application?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
(for unimportant reasons) I am not interested in words, sentences, or
prosodic features or anything similar to that at the moment. All that I care
about are frequency relationships.
Sorry, it's not quite clear what are you talking about here. Please note that
CMUSphinx is quite tightly designed to work with speech features. It will not
work quite well for features of the arbitrary type.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Let me attempt to clarify better this time, my initial statement was vague.
I understand that you need to write feature files and train models with them
and use them for an efficient speech recognition system. And I also understand
that this is the purpose of the Sphinx system, hence why it is designed to
work this way.
However, at the moment what I am doing at the moment is not to get the
highest recognition rate or run a legitimate test. What my goal is right now,
is to evaluate a new proposal of my own for a science project that does not
use only conventional features. What I want to add to the feature set, in
addition to using MFCC's (etc.) are so-called 'harmony' features. I want to
use statistical measures like standard deviancy and interval relationships to
attempt to improve the recognition rate, with the hypothesis that frequency
relationships, and the further concepts of consonance and dissonance that hold
those relationships, hold emotional content.
I understand full well that Sphinx was certainly not designed for this
purpose. But what I was asking for in the original post was a way to extend
(or simply to just do a quick modification) so that I can augment the default
feature set with a feature set of my own using a logarithmic, 12 semitone
scale.
Then, afterwards, I'd proceed to do a lot of tweaking and formalism etc.
But right now, that formalism is not necessary because, once again, I am not
using Sphinx at this moment to guage accuracy. I am just using it to see if my
invented feature set improves the (emotion) recognition rate by any stretch,
before I even decide whether or not to do anything else.
In another thread asking for emotion rather than simple speech recognition,
the person was referred to the OpenEARS project, which I am checking out, but
it isn't quite as well documented as Sphinx, so I hope that what I am
proposing is not too difficult to at least get a rough implementation going.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
However, at the moment what I am doing at the moment is not to get the
highest recognition rate or run a legitimate test
and interval relationships to attempt to improve the recognition rate,
Sorry, I see a contradiction here. Do you want to improve recognition rate or
not?
But what I was asking for in the original post was a way to extend (or
simply to just do a quick modification) so that I can augment the default
feature set with a feature set of my own using a logarithmic, 12 semitone
scale.
There is no quick modification, you need to write your own code to do that.
In another thread asking for emotion rather than simple speech recognition,
the person was referred to the OpenEARS project, which I am checking out, but
it isn't quite as well documented as Sphinx, so I hope that what I am
proposing is not too difficult to at least get a rough implementation going.
CMUSphinx has little use for emotion recognition unfortunately. It doesn't
implement much resusable parts to implement gaussian classifier and/or
trainer.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, the ultimate goal is to improve the recognition rate. But that is a long-
term goal.
My short-term goal is to simply get a working model of my theory into
implementation. And that is why I was inquiring about the extensibility of the
system, and also hoping to get some information around it.
Which you supplied perfectly, so thank you.
Essentially, what I gleaned from this thread is that if I want to accomplish
what I just outlined, what I would need to do is do a rewrite of the code
blocks pertaining to feature sets in sphinx-4?
How feasible is it to do this versus attempting to find another program set
entirely? As in, yes, the system is extensible, but can you direct me to some
resources for setting out to rewrite this portion? I anticipate numerous
errors in doing this, so hopefully it is well documented, unless I can simply
find some sort of plugin.
And lastly, are you saying that I would require a guassian classifier (to
classify the emotions) which would require an entirely separate section of
written code?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My short-term goal is to simply get a working model of my theory into
implementation. And that is why I was inquiring about the extensibility of the
system, and also hoping to get some information around it.
Ok, if you need some help on that please ask
And lastly, are you saying that I would require a guassian classifier (to
classify the emotions) which would require an entirely separate section of
written code?
Yes, I wrote that.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry about replying to an old thread, but I think I might have something,
while not ideal, could help.
At work, I use openSMILE to extract features from the .raw files spit out by
the continuous listener.
We've run openSMILE in batch over all of our .raw files from our previous
experiments and from the features generated (plus annotations done on the
corpus as well as log files), we then used Weka (java machine learning tool)
to develop models for detecting affective states such as uncertainty. The
difference between what we're doing and what you're doing is that we're not
trying to improve the recognition - we just want to know that when the user
responded with "gravity" whether they sounded uncertain or not.
By the time we run our models in real time, it's already post-recognition
(otherwise we wouldn't have a .raw file to analyze yet), I suppose that if
you stored the n-best hypotheses or the lattice that you might be able to
better select the best hypothesis once you've extracted features from the
audio. Another thing you could do, if time isn't an issue, post analyzing the
audio, re-run recognition on the .raw file, rather than microphone input, and
use whatever information you've obtained from the audio analysis on a second
pass to try and improve recognition.
Hope this helps,
-Scott
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There are other posts asking how to get additional feature sets (or remove the
MFCC process entirely) but, I was a bit confused reading them so I hoped that
a more specialized response would clear it up.
I have run the model using MFCC coefficients by following the tutorial:
http://www.speech.cs.cmu.edu/sphinx/tutorial.html
(by the way, is there a comprehensive wiki or resource somewhere? is it all
here: http://cmusphinx.sourceforge.net/wiki/ -- because I wasn't entirely sure, and I thought there would be more
documentation).
Anyway, for the main topic -
how do I add my own feature set (a logarithmic semitone scale so that I can
use feature sets based off music theory in addition to the MFCC sets)?
To add one more thing:
(for unimportant reasons) I am not interested in words, sentences, or prosodic
features or anything similar to that at the moment. All that I care about are
frequency relationships.
So how exactly would I go about effectively streamlining it just for this
niche application?
Sourceforge wiki is the most recent and up-to-date source for CMUSphinx
documentation. Other sources are often outdated.
You need to write feature files in CMUSphinx MFC format and train the model
using them. See the wiki page
http://cmusphinx.sourceforge.net/wiki/mfcformat
See also the training tutorial
http://cmusphinx.sourceforge.net/wiki/tutorialam
Sorry, it's not quite clear what are you talking about here. Please note that
CMUSphinx is quite tightly designed to work with speech features. It will not
work quite well for features of the arbitrary type.
Thank you for the fast response.
Let me attempt to clarify better this time, my initial statement was vague.
I understand that you need to write feature files and train models with them
and use them for an efficient speech recognition system. And I also understand
that this is the purpose of the Sphinx system, hence why it is designed to
work this way.
However, at the moment what I am doing at the moment is not to get the
highest recognition rate or run a legitimate test. What my goal is right now,
is to evaluate a new proposal of my own for a science project that does not
use only conventional features. What I want to add to the feature set, in
addition to using MFCC's (etc.) are so-called 'harmony' features. I want to
use statistical measures like standard deviancy and interval relationships to
attempt to improve the recognition rate, with the hypothesis that frequency
relationships, and the further concepts of consonance and dissonance that hold
those relationships, hold emotional content.
I understand full well that Sphinx was certainly not designed for this
purpose. But what I was asking for in the original post was a way to extend
(or simply to just do a quick modification) so that I can augment the default
feature set with a feature set of my own using a logarithmic, 12 semitone
scale.
Then, afterwards, I'd proceed to do a lot of tweaking and formalism etc.
But right now, that formalism is not necessary because, once again, I am not
using Sphinx at this moment to guage accuracy. I am just using it to see if my
invented feature set improves the (emotion) recognition rate by any stretch,
before I even decide whether or not to do anything else.
If there is anything vague about that, or if I have misunderstood the
extensible capabilities of the Sphinx system for what I am proposing or
anything else, then please let me know. (the only contemporary study for the
method is here: https://docs.google.com/file/d/0B8BtbepxYJ4vNjQyNWYxMWMtOGZkM
y00ZjYzLTk1MWEtZDI3MjI0MTFlZTg2/edit
)
In another thread asking for emotion rather than simple speech recognition,
the person was referred to the OpenEARS project, which I am checking out, but
it isn't quite as well documented as Sphinx, so I hope that what I am
proposing is not too difficult to at least get a rough implementation going.
Sorry, I see a contradiction here. Do you want to improve recognition rate or
not?
There is no quick modification, you need to write your own code to do that.
CMUSphinx has little use for emotion recognition unfortunately. It doesn't
implement much resusable parts to implement gaussian classifier and/or
trainer.
Yes, the ultimate goal is to improve the recognition rate. But that is a long-
term goal.
My short-term goal is to simply get a working model of my theory into
implementation. And that is why I was inquiring about the extensibility of the
system, and also hoping to get some information around it.
Which you supplied perfectly, so thank you.
Essentially, what I gleaned from this thread is that if I want to accomplish
what I just outlined, what I would need to do is do a rewrite of the code
blocks pertaining to feature sets in sphinx-4?
How feasible is it to do this versus attempting to find another program set
entirely? As in, yes, the system is extensible, but can you direct me to some
resources for setting out to rewrite this portion? I anticipate numerous
errors in doing this, so hopefully it is well documented, unless I can simply
find some sort of plugin.
And lastly, are you saying that I would require a guassian classifier (to
classify the emotions) which would require an entirely separate section of
written code?
Ok, if you need some help on that please ask
Yes, I wrote that.
Sorry about replying to an old thread, but I think I might have something,
while not ideal, could help.
At work, I use openSMILE to extract features from the .raw files spit out by
the continuous listener.
We've run openSMILE in batch over all of our .raw files from our previous
experiments and from the features generated (plus annotations done on the
corpus as well as log files), we then used Weka (java machine learning tool)
to develop models for detecting affective states such as uncertainty. The
difference between what we're doing and what you're doing is that we're not
trying to improve the recognition - we just want to know that when the user
responded with "gravity" whether they sounded uncertain or not.
By the time we run our models in real time, it's already post-recognition
(otherwise we wouldn't have a .raw file to analyze yet), I suppose that if
you stored the n-best hypotheses or the lattice that you might be able to
better select the best hypothesis once you've extracted features from the
audio. Another thing you could do, if time isn't an issue, post analyzing the
audio, re-run recognition on the .raw file, rather than microphone input, and
use whatever information you've obtained from the audio analysis on a second
pass to try and improve recognition.
Hope this helps,
-Scott