Menu

Voice project idea

2008-10-01
2012-09-22
  • Johannes Buchner

    Hello dear developers and researchers!

    I would like to do a project that has to do with speech and would hope
    of some input from you.
    I would like to make a system that allows to use commands, eventually
    with a simple command syntax that is developed by the user at runtime.
    For example, the user records the words "increase volume" and assigns
    it to a command. Then it can be recognised the next time and the
    command executed.

    I thought of the following approach:
    I don't want the system to understand voice as english words/sentences,
    I want to stay one level below that, just look at it as a sequence of
    phonemes. I think this removes one obstacle in recognition, although
    I know that the additional knowledge of words/syntax is leveraged in
    speech recognition software. Seeing sequences of phonemes has the
    benefit that I don't have to ship any database, and the user can create
    his own from scratch for any language. (I want to do not much more than
    50 commands which will be used by a single user).
    Of course the "sequence of phonemes" will be a sequence of n-tupels of
    the time and the probability value for each phonem.

    The main work I will have to do is to write or train an algorithm for
    comparing two such recordings, to decide wether the recordings of such
    n-dimensional figures are similar enough.

    I imagine the recording workflow for the user to make a recording of and
    be able to add it to a box of, for example, "increase
    volume"-recordings, so each time the system doesn't detect a pattern,
    the user can improve the comparision quality by adding a new
    "pronunciation".

    I would be very interested what you think about the idea.

    To be honest, I do not have a lot of experience with voice systems, but
    I know it is a very complex area and not very accurate. I also know I
    probably have a overly simplified view of the area.
    I can't imagine I'm the first thinking in this direction
    (away from spelled words), I would be interested in similar
    projects/papers.

    I am looking for a system/project that delivers the phonemes (or
    something of a similar level) to me.

    Any input is appreciated.

    Best regards,
    Johannes

    PS: I would also be glad to know more places where I can ask this and
    gather more input (besides books and university courses). Feel free to
    forward.


    Things&Links I found so far that could be related:
    - CMU Sphinx, it provides "tokens" with scores. Maybe they can be
    seperated from the dictionary module and their content accessed.
    - spectrogram labeling tool
    http://www.dcs.shef.ac.uk/~martin/MAD/slt/slt.htm
    - Some guy with the same idea in 2001, no answer.
    http://www.experts-exchange.com/Programming/Languages/Java/Q_20060231.html
    - CSLU Toolkit http://www.cslu.ogi.edu/toolkit/ - don't know if it
    helps me in any way.
    - http://project.uet.itgo.com/speech.htm "How Speech Recognition Works"

    --
    You can also mail to buchner.johannes@gmx.at

     
    • Anonymous

      Anonymous - 2008-12-01

      Most people don't talk in phonemes, they think and talk in words.

      I'm sure Sphinx has APIs for extracting just the phonemes, however going from phonemes to words is not the simplest of tasks. From what I've seen so far, just trying to model the phonemes is a task that an ordinary user would be loathe to perform (and in some cases, incapable).

      Unless your target audience is downloading off of 56k modems, downloading a database of sounds shouldn't be an issue.

      Sphinx has a number of facilities for doing the parts that you want -- but it also does much, much more.

      For your "increase volume" example, if Sphinx can give you the actual text "increase volume" 99% of the time, why would you want to waste your time trying to match "in" + "cr" + "eas" + <pause> + "vol" + "ume" against your own ideas when there is an equivalent (and more powerful) mechanism already available?

       

Log in to post a comment.