I would like to do a project that has to do with speech and would hope
of some input from you.
I would like to make a system that allows to use commands, eventually
with a simple command syntax that is developed by the user at runtime.
For example, the user records the words "increase volume" and assigns
it to a command. Then it can be recognised the next time and the
command executed.
I thought of the following approach:
I don't want the system to understand voice as english words/sentences,
I want to stay one level below that, just look at it as a sequence of
phonemes. I think this removes one obstacle in recognition, although
I know that the additional knowledge of words/syntax is leveraged in
speech recognition software. Seeing sequences of phonemes has the
benefit that I don't have to ship any database, and the user can create
his own from scratch for any language. (I want to do not much more than
50 commands which will be used by a single user).
Of course the "sequence of phonemes" will be a sequence of n-tupels of
the time and the probability value for each phonem.
The main work I will have to do is to write or train an algorithm for
comparing two such recordings, to decide wether the recordings of such
n-dimensional figures are similar enough.
I imagine the recording workflow for the user to make a recording of and
be able to add it to a box of, for example, "increase
volume"-recordings, so each time the system doesn't detect a pattern,
the user can improve the comparision quality by adding a new
"pronunciation".
I would be very interested what you think about the idea.
To be honest, I do not have a lot of experience with voice systems, but
I know it is a very complex area and not very accurate. I also know I
probably have a overly simplified view of the area.
I can't imagine I'm the first thinking in this direction
(away from spelled words), I would be interested in similar
projects/papers.
I am looking for a system/project that delivers the phonemes (or
something of a similar level) to me.
Any input is appreciated.
Best regards,
Johannes
PS: I would also be glad to know more places where I can ask this and
gather more input (besides books and university courses). Feel free to
forward.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2008-12-01
Most people don't talk in phonemes, they think and talk in words.
I'm sure Sphinx has APIs for extracting just the phonemes, however going from phonemes to words is not the simplest of tasks. From what I've seen so far, just trying to model the phonemes is a task that an ordinary user would be loathe to perform (and in some cases, incapable).
Unless your target audience is downloading off of 56k modems, downloading a database of sounds shouldn't be an issue.
Sphinx has a number of facilities for doing the parts that you want -- but it also does much, much more.
For your "increase volume" example, if Sphinx can give you the actual text "increase volume" 99% of the time, why would you want to waste your time trying to match "in" + "cr" + "eas" + <pause> + "vol" + "ume" against your own ideas when there is an equivalent (and more powerful) mechanism already available?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello dear developers and researchers!
I would like to do a project that has to do with speech and would hope
of some input from you.
I would like to make a system that allows to use commands, eventually
with a simple command syntax that is developed by the user at runtime.
For example, the user records the words "increase volume" and assigns
it to a command. Then it can be recognised the next time and the
command executed.
I thought of the following approach:
I don't want the system to understand voice as english words/sentences,
I want to stay one level below that, just look at it as a sequence of
phonemes. I think this removes one obstacle in recognition, although
I know that the additional knowledge of words/syntax is leveraged in
speech recognition software. Seeing sequences of phonemes has the
benefit that I don't have to ship any database, and the user can create
his own from scratch for any language. (I want to do not much more than
50 commands which will be used by a single user).
Of course the "sequence of phonemes" will be a sequence of n-tupels of
the time and the probability value for each phonem.
The main work I will have to do is to write or train an algorithm for
comparing two such recordings, to decide wether the recordings of such
n-dimensional figures are similar enough.
I imagine the recording workflow for the user to make a recording of and
be able to add it to a box of, for example, "increase
volume"-recordings, so each time the system doesn't detect a pattern,
the user can improve the comparision quality by adding a new
"pronunciation".
I would be very interested what you think about the idea.
To be honest, I do not have a lot of experience with voice systems, but
I know it is a very complex area and not very accurate. I also know I
probably have a overly simplified view of the area.
I can't imagine I'm the first thinking in this direction
(away from spelled words), I would be interested in similar
projects/papers.
I am looking for a system/project that delivers the phonemes (or
something of a similar level) to me.
Any input is appreciated.
Best regards,
Johannes
PS: I would also be glad to know more places where I can ask this and
gather more input (besides books and university courses). Feel free to
forward.
Things&Links I found so far that could be related:
- CMU Sphinx, it provides "tokens" with scores. Maybe they can be
seperated from the dictionary module and their content accessed.
- spectrogram labeling tool
http://www.dcs.shef.ac.uk/~martin/MAD/slt/slt.htm
- Some guy with the same idea in 2001, no answer.
http://www.experts-exchange.com/Programming/Languages/Java/Q_20060231.html
- CSLU Toolkit http://www.cslu.ogi.edu/toolkit/ - don't know if it
helps me in any way.
- http://project.uet.itgo.com/speech.htm "How Speech Recognition Works"
--
You can also mail to buchner.johannes@gmx.at
Most people don't talk in phonemes, they think and talk in words.
I'm sure Sphinx has APIs for extracting just the phonemes, however going from phonemes to words is not the simplest of tasks. From what I've seen so far, just trying to model the phonemes is a task that an ordinary user would be loathe to perform (and in some cases, incapable).
Unless your target audience is downloading off of 56k modems, downloading a database of sounds shouldn't be an issue.
Sphinx has a number of facilities for doing the parts that you want -- but it also does much, much more.
For your "increase volume" example, if Sphinx can give you the actual text "increase volume" 99% of the time, why would you want to waste your time trying to match "in" + "cr" + "eas" + <pause> + "vol" + "ume" against your own ideas when there is an equivalent (and more powerful) mechanism already available?