Menu

Adding Arbitrary Features [frequency features

Help
2012-03-12
2012-09-22
  • Siphonblast

    Siphonblast - 2012-03-12

    There are other posts asking how to get additional feature sets (or remove the
    MFCC process entirely) but, I was a bit confused reading them so I hoped that
    a more specialized response would clear it up.

    I have run the model using MFCC coefficients by following the tutorial:
    http://www.speech.cs.cmu.edu/sphinx/tutorial.html

    (by the way, is there a comprehensive wiki or resource somewhere? is it all
    here: http://cmusphinx.sourceforge.net/wiki/ -- because I wasn't entirely sure, and I thought there would be more
    documentation).

    Anyway, for the main topic -

    how do I add my own feature set (a logarithmic semitone scale so that I can
    use feature sets based off music theory in addition to the MFCC sets)?

     
  • Siphonblast

    Siphonblast - 2012-03-12

    To add one more thing:

    (for unimportant reasons) I am not interested in words, sentences, or prosodic
    features or anything similar to that at the moment. All that I care about are
    frequency relationships.

    So how exactly would I go about effectively streamlining it just for this
    niche application?

     
  • Nickolay V. Shmyrev

    is it all here: http://cmusphinx.sourceforge.net/wiki/ -- because I wasn't entirely sure, and I thought there
    would be more documentation).

    Sourceforge wiki is the most recent and up-to-date source for CMUSphinx
    documentation. Other sources are often outdated.

    how do I add my own feature set (a logarithmic semitone scale so that I can
    use feature sets based off music theory in addition to the MFCC sets)?

    You need to write feature files in CMUSphinx MFC format and train the model
    using them. See the wiki page

    http://cmusphinx.sourceforge.net/wiki/mfcformat

    See also the training tutorial

    http://cmusphinx.sourceforge.net/wiki/tutorialam

    (for unimportant reasons) I am not interested in words, sentences, or
    prosodic features or anything similar to that at the moment. All that I care
    about are frequency relationships.

    Sorry, it's not quite clear what are you talking about here. Please note that
    CMUSphinx is quite tightly designed to work with speech features. It will not
    work quite well for features of the arbitrary type.

     
  • Siphonblast

    Siphonblast - 2012-03-12

    Thank you for the fast response.

    Let me attempt to clarify better this time, my initial statement was vague.

    I understand that you need to write feature files and train models with them
    and use them for an efficient speech recognition system. And I also understand
    that this is the purpose of the Sphinx system, hence why it is designed to
    work this way.

    However, at the moment what I am doing at the moment is not to get the
    highest recognition rate or run a legitimate test. What my goal is right now,
    is to evaluate a new proposal of my own for a science project that does not
    use only conventional features. What I want to add to the feature set, in
    addition to using MFCC's (etc.) are so-called 'harmony' features. I want to
    use statistical measures like standard deviancy and interval relationships to
    attempt to improve the recognition rate, with the hypothesis that frequency
    relationships, and the further concepts of consonance and dissonance that hold
    those relationships, hold emotional content.

    I understand full well that Sphinx was certainly not designed for this
    purpose. But what I was asking for in the original post was a way to extend
    (or simply to just do a quick modification) so that I can augment the default
    feature set with a feature set of my own using a logarithmic, 12 semitone
    scale.

    Then, afterwards, I'd proceed to do a lot of tweaking and formalism etc.

    But right now, that formalism is not necessary because, once again, I am not
    using Sphinx at this moment to guage accuracy. I am just using it to see if my
    invented feature set improves the (emotion) recognition rate by any stretch,
    before I even decide whether or not to do anything else.

    If there is anything vague about that, or if I have misunderstood the
    extensible capabilities of the Sphinx system for what I am proposing or
    anything else, then please let me know. (the only contemporary study for the
    method is here: https://docs.google.com/file/d/0B8BtbepxYJ4vNjQyNWYxMWMtOGZkM
    y00ZjYzLTk1MWEtZDI3MjI0MTFlZTg2/edit

    )

    In another thread asking for emotion rather than simple speech recognition,
    the person was referred to the OpenEARS project, which I am checking out, but
    it isn't quite as well documented as Sphinx, so I hope that what I am
    proposing is not too difficult to at least get a rough implementation going.

     
  • Nickolay V. Shmyrev

    However, at the moment what I am doing at the moment is not to get the
    highest recognition rate or run a legitimate test

    and interval relationships to attempt to improve the recognition rate,

    Sorry, I see a contradiction here. Do you want to improve recognition rate or
    not?

    But what I was asking for in the original post was a way to extend (or
    simply to just do a quick modification) so that I can augment the default
    feature set with a feature set of my own using a logarithmic, 12 semitone
    scale.

    There is no quick modification, you need to write your own code to do that.

    In another thread asking for emotion rather than simple speech recognition,
    the person was referred to the OpenEARS project, which I am checking out, but
    it isn't quite as well documented as Sphinx, so I hope that what I am
    proposing is not too difficult to at least get a rough implementation going.

    CMUSphinx has little use for emotion recognition unfortunately. It doesn't
    implement much resusable parts to implement gaussian classifier and/or
    trainer.

     
  • Siphonblast

    Siphonblast - 2012-03-12

    Yes, the ultimate goal is to improve the recognition rate. But that is a long-
    term goal.

    My short-term goal is to simply get a working model of my theory into
    implementation. And that is why I was inquiring about the extensibility of the
    system, and also hoping to get some information around it.

    Which you supplied perfectly, so thank you.

    Essentially, what I gleaned from this thread is that if I want to accomplish
    what I just outlined, what I would need to do is do a rewrite of the code
    blocks pertaining to feature sets in sphinx-4?

    How feasible is it to do this versus attempting to find another program set
    entirely? As in, yes, the system is extensible, but can you direct me to some
    resources for setting out to rewrite this portion? I anticipate numerous
    errors in doing this, so hopefully it is well documented, unless I can simply
    find some sort of plugin.

    And lastly, are you saying that I would require a guassian classifier (to
    classify the emotions) which would require an entirely separate section of
    written code?

     
  • Nickolay V. Shmyrev

    My short-term goal is to simply get a working model of my theory into
    implementation. And that is why I was inquiring about the extensibility of the
    system, and also hoping to get some information around it.

    Ok, if you need some help on that please ask

    And lastly, are you saying that I would require a guassian classifier (to
    classify the emotions) which would require an entirely separate section of
    written code?

    Yes, I wrote that.

     
  • Scott Silliman

    Scott Silliman - 2012-05-26

    Sorry about replying to an old thread, but I think I might have something,
    while not ideal, could help.

    At work, I use openSMILE to extract features from the .raw files spit out by
    the continuous listener.

    We've run openSMILE in batch over all of our .raw files from our previous
    experiments and from the features generated (plus annotations done on the
    corpus as well as log files), we then used Weka (java machine learning tool)
    to develop models for detecting affective states such as uncertainty. The
    difference between what we're doing and what you're doing is that we're not
    trying to improve the recognition - we just want to know that when the user
    responded with "gravity" whether they sounded uncertain or not.

    By the time we run our models in real time, it's already post-recognition
    (otherwise we wouldn't have a .raw file to analyze yet), I suppose that if
    you stored the n-best hypotheses or the lattice that you might be able to
    better select the best hypothesis once you've extracted features from the
    audio. Another thing you could do, if time isn't an issue, post analyzing the
    audio, re-run recognition on the .raw file, rather than microphone input, and
    use whatever information you've obtained from the audio analysis on a second
    pass to try and improve recognition.

    Hope this helps,

    -Scott

     

Log in to post a comment.