** the task **
I have RSI, which is currently preventing me from coding ( using the keyboard
is just too painful )
I want to construct a speech-based ' keyboard ' that will allow me to code
again
I want to use a base of say 50 phoneme pairs
bah bar beh bih bee boor boo burr
sah sar seh sih see soor soo surr
etc:
maybe eight consonants and eight vowels, this would make a table of 64 phoneme
pairs
so each element in the table maps onto some key (or key combination)
so eg 'sah' might map onto the letter 's', whereas 'soor' might map onto '!'
and 'surr' might map onto my code editor's ' find in project ' shortcut
this way I could write code by speaking what sounds like a stream of gibberish
** the solution **
I'm hoping to build on the open-ears project, which wraps pocket sphinx for
iOS.
but I can't see how to start going about this.
I have been looking through the documentation on the Sphinx wiki, and I am
experiencing information overload. I'm finding it hard to get the first bite
on the Bob Apple.
could someone detail which steps are needed in order to do this?
I would like to construct the dictionary using the correct phonetic symbols,
and I would like each word to be identical to its phonetic representation, so
if '∫a' ( that is 'sha', if it doesn't come out on your screen ) sounds, the
engine should return '∫a' as the word.
if I can get the point of receiving a string '∫a li bu na toh...', I can
process this string and generate the appropriate keystrokes -- not a problem.
But how to get to this point? Can someone help?
PS I am putting this here as a reference: 'pocketsphinx can return partial
hypothesis with ps_get_hyp before utterance is ended' (thanks to nshm on the
cmusphinx IRC channel) -- I am putting it in as this may be of use if I need
to type as I go, rather than waiting to finish an utterance and then wait for
it to all appear on the screen together. However for a first pass I'm not
going to consider this complication.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Unfortunately, it won't work. With just one syllable type (cv, consonant-
vowel) and 64 of them that can occur in any order in any context, you will not
get acceptable accuracy. Voice "typing" with 26 letters barely works, and
that's using a mix of cv vv vc syllables.
By comparison, spoken English has an average of 9 (if memory serves) phonemes
following any other (lower perplexity) and the average word is 2.4 syllables.
You're also dealing with the problem of some of your 64 cv pairs having never
occurred, at all, in the training data used to build any common models (hub4,
voxforge, etc) so you'll get even lower performance.
A more tenable approach is to build a language model for a coding task. I
believe there's already a few of these floating around, including a commercial
configuration for Dragon Naturally Speaking. Then you speak code, not
"gibberish". Unless you're just trying to one-up the "happy hackers" with
their all-black "no letters on the keys" keyboards.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This is getting back to why I got into this field. I used to do assistive
technology applications. First, check out the VoiceCoder forum.
And here's a few apps. ShortTalk is probably closest to what you're trying to
do. And their codewords break the recognizer the same way I fear yours will...
Personally, I prefer a Kinesis Advantage keyboard. Or are you too far gone for that?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2011-11-09
Hi Wiz,
thanks for answering!
I have been looking through these links today. Unfortunately I am too far gone
for any sort of keyboard solution. For sure I should get the most ergonomic
keyboard, chair, desk, etc but this will be the tip of the iceberg.
I haven't seen anything I like the look of in terms of speech assisted coding.
I am aware that typing letters by voice fails. But this isas far as I can see
a strawman argument. the reason is because we have 26 combinations that are
not suitably distinct.
so 'TED' -> 'tee' 'ee' 'dee', already there is a lot of scope for trouble. T
and D belong to the same phonetic group, only one is plosive and the other
non-plosive. exactly the same shape in the mouth is involved. As far as I can
feel, the only difference is a contraction in the diaphragm that gives 'T' its
explosion of air.
the difference is surely going to be bordering on imperceptible to any
recognition algorithm... even over the phone people have to resort to ' Alpha
Bravo Charlie ', pretty much everything is done by context.
and the 'eee' in the middle could easily get absorbed by the first letter,
although I would have thought that a clear break would be enough to
discriminate.
what I am proposing is to pick a small number, maybe a grid of 6 x 6. six
constant sounds that are as distinct from one another as possible, and six
vowel sounds similarly.
let's say
ba ta ra sha la ka ha nga ma
well that is nine, and they all involve a different physiological mechanism to
produce the sound.
I would be quite surprised if the engine confined to these nine words would
fail to get close to 100%.
now moving in the other direction, maybe there is less scope for creativity
with the vowels, there are fewer parameters that can be modified; shape of
mouth, position of tongue etc, but each syllable is generating a clear
repeating waveform so maybe the bonus of being able to use this waveform for
detection outweighs this.
let's say I arrive at five distinct vowels, again, I would be quite surprised
if the engine couldn't discriminate with a good accuracy.
so with this logic, my intuition is that it is at least worth performing the
experiment and actually checking to see what accuracy I DO get, and how this
improves by reducing the phoneme base, ie find which phonemes are getting
confused the most, and eliminating one of them. Rinse and repeat...
I cannot find any indication that someone has performed this experiment...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I cannot find any indication that someone has performed this experiment...
This is proposed basically every month. It's possible to make spelling system
but it's usually created for correction, not for the full dictation. Single
user dictation system works without such big troubles. Proper spelling
functionality in the dictation system requires some work on the decoder part
too. It's not that easy.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
** the task **
I have RSI, which is currently preventing me from coding ( using the keyboard
is just too painful )
I want to construct a speech-based ' keyboard ' that will allow me to code
again
I want to use a base of say 50 phoneme pairs
bah bar beh bih bee boor boo burr
sah sar seh sih see soor soo surr
etc:
maybe eight consonants and eight vowels, this would make a table of 64 phoneme
pairs
so each element in the table maps onto some key (or key combination)
so eg 'sah' might map onto the letter 's', whereas 'soor' might map onto '!'
and 'surr' might map onto my code editor's ' find in project ' shortcut
this way I could write code by speaking what sounds like a stream of gibberish
** the solution **
I'm hoping to build on the open-ears project, which wraps pocket sphinx for
iOS.
but I can't see how to start going about this.
I have been looking through the documentation on the Sphinx wiki, and I am
experiencing information overload. I'm finding it hard to get the first bite
on the Bob Apple.
could someone detail which steps are needed in order to do this?
I would like to construct the dictionary using the correct phonetic symbols,
and I would like each word to be identical to its phonetic representation, so
if '∫a' ( that is 'sha', if it doesn't come out on your screen ) sounds, the
engine should return '∫a' as the word.
if I can get the point of receiving a string '∫a li bu na toh...', I can
process this string and generate the appropriate keystrokes -- not a problem.
But how to get to this point? Can someone help?
PS I am putting this here as a reference: 'pocketsphinx can return partial
hypothesis with ps_get_hyp before utterance is ended' (thanks to nshm on the
cmusphinx IRC channel) -- I am putting it in as this may be of use if I need
to type as I go, rather than waiting to finish an utterance and then wait for
it to all appear on the screen together. However for a first pass I'm not
going to consider this complication.
Interesting idea.
Unfortunately, it won't work. With just one syllable type (cv, consonant-
vowel) and 64 of them that can occur in any order in any context, you will not
get acceptable accuracy. Voice "typing" with 26 letters barely works, and
that's using a mix of cv vv vc syllables.
By comparison, spoken English has an average of 9 (if memory serves) phonemes
following any other (lower perplexity) and the average word is 2.4 syllables.
You're also dealing with the problem of some of your 64 cv pairs having never
occurred, at all, in the training data used to build any common models (hub4,
voxforge, etc) so you'll get even lower performance.
A more tenable approach is to build a language model for a coding task. I
believe there's already a few of these floating around, including a commercial
configuration for Dragon Naturally Speaking. Then you speak code, not
"gibberish". Unless you're just trying to one-up the "happy hackers" with
their all-black "no letters on the keys" keyboards.
This is getting back to why I got into this field. I used to do assistive
technology applications. First, check out the
VoiceCoder forum.
And here's a few apps. ShortTalk is probably closest to what you're trying to
do. And their codewords break the recognizer the same way I fear yours will...
Harmonia
at Berkeley.
ShortTalk from
Aarhus.
VoiceCode from the NRC Canada
Personally, I prefer a Kinesis Advantage keyboard. Or are you too far gone for that?
Hi Wiz,
thanks for answering!
I have been looking through these links today. Unfortunately I am too far gone
for any sort of keyboard solution. For sure I should get the most ergonomic
keyboard, chair, desk, etc but this will be the tip of the iceberg.
I haven't seen anything I like the look of in terms of speech assisted coding.
I am aware that typing letters by voice fails. But this isas far as I can see
a strawman argument. the reason is because we have 26 combinations that are
not suitably distinct.
so 'TED' -> 'tee' 'ee' 'dee', already there is a lot of scope for trouble. T
and D belong to the same phonetic group, only one is plosive and the other
non-plosive. exactly the same shape in the mouth is involved. As far as I can
feel, the only difference is a contraction in the diaphragm that gives 'T' its
explosion of air.
the difference is surely going to be bordering on imperceptible to any
recognition algorithm... even over the phone people have to resort to ' Alpha
Bravo Charlie ', pretty much everything is done by context.
and the 'eee' in the middle could easily get absorbed by the first letter,
although I would have thought that a clear break would be enough to
discriminate.
what I am proposing is to pick a small number, maybe a grid of 6 x 6. six
constant sounds that are as distinct from one another as possible, and six
vowel sounds similarly.
let's say
ba ta ra sha la ka ha nga ma
well that is nine, and they all involve a different physiological mechanism to
produce the sound.
I would be quite surprised if the engine confined to these nine words would
fail to get close to 100%.
now moving in the other direction, maybe there is less scope for creativity
with the vowels, there are fewer parameters that can be modified; shape of
mouth, position of tongue etc, but each syllable is generating a clear
repeating waveform so maybe the bonus of being able to use this waveform for
detection outweighs this.
let's say I arrive at five distinct vowels, again, I would be quite surprised
if the engine couldn't discriminate with a good accuracy.
so with this logic, my intuition is that it is at least worth performing the
experiment and actually checking to see what accuracy I DO get, and how this
improves by reducing the phoneme base, ie find which phonemes are getting
confused the most, and eliminating one of them. Rinse and repeat...
I cannot find any indication that someone has performed this experiment...
This is proposed basically every month. It's possible to make spelling system
but it's usually created for correction, not for the full dictation. Single
user dictation system works without such big troubles. Proper spelling
functionality in the dictation system requires some work on the decoder part
too. It's not that easy.