Menu

Recognising one from 50 phoneme pairs

Help
Anonymous
2011-11-06
2012-09-22
  • Anonymous

    Anonymous - 2011-11-06

    ** the task **
    I have RSI, which is currently preventing me from coding ( using the keyboard
    is just too painful )

    I want to construct a speech-based ' keyboard ' that will allow me to code
    again

    I want to use a base of say 50 phoneme pairs

    bah bar beh bih bee boor boo burr
    sah sar seh sih see soor soo surr
    etc:

    maybe eight consonants and eight vowels, this would make a table of 64 phoneme
    pairs

    so each element in the table maps onto some key (or key combination)

    so eg 'sah' might map onto the letter 's', whereas 'soor' might map onto '!'
    and 'surr' might map onto my code editor's ' find in project ' shortcut

    this way I could write code by speaking what sounds like a stream of gibberish

    ** the solution **
    I'm hoping to build on the open-ears project, which wraps pocket sphinx for
    iOS.

    but I can't see how to start going about this.

    I have been looking through the documentation on the Sphinx wiki, and I am
    experiencing information overload. I'm finding it hard to get the first bite
    on the Bob Apple.

    could someone detail which steps are needed in order to do this?

    I would like to construct the dictionary using the correct phonetic symbols,
    and I would like each word to be identical to its phonetic representation, so
    if '∫a' ( that is 'sha', if it doesn't come out on your screen ) sounds, the
    engine should return '∫a' as the word.

    if I can get the point of receiving a string '∫a li bu na toh...', I can
    process this string and generate the appropriate keystrokes -- not a problem.
    But how to get to this point? Can someone help?

    PS I am putting this here as a reference: 'pocketsphinx can return partial
    hypothesis with ps_get_hyp before utterance is ended' (thanks to nshm on the

    cmusphinx IRC channel) -- I am putting it in as this may be of use if I need

    to type as I go, rather than waiting to finish an utterance and then wait for
    it to all appear on the screen together. However for a first pass I'm not
    going to consider this complication.

     
  • Joseph S. Wisniewski

    Interesting idea.

    Unfortunately, it won't work. With just one syllable type (cv, consonant-
    vowel) and 64 of them that can occur in any order in any context, you will not
    get acceptable accuracy. Voice "typing" with 26 letters barely works, and
    that's using a mix of cv vv vc syllables.

    By comparison, spoken English has an average of 9 (if memory serves) phonemes
    following any other (lower perplexity) and the average word is 2.4 syllables.

    You're also dealing with the problem of some of your 64 cv pairs having never
    occurred, at all, in the training data used to build any common models (hub4,
    voxforge, etc) so you'll get even lower performance.

    A more tenable approach is to build a language model for a coding task. I
    believe there's already a few of these floating around, including a commercial
    configuration for Dragon Naturally Speaking. Then you speak code, not
    "gibberish". Unless you're just trying to one-up the "happy hackers" with
    their all-black "no letters on the keys" keyboards.

     
  • Joseph S. Wisniewski

    This is getting back to why I got into this field. I used to do assistive
    technology applications. First, check out the
    VoiceCoder forum.

    And here's a few apps. ShortTalk is probably closest to what you're trying to
    do. And their codewords break the recognizer the same way I fear yours will...

    Harmonia
    at Berkeley.

    ShortTalk from
    Aarhus.

    VoiceCode from the NRC Canada

    Personally, I prefer a Kinesis Advantage keyboard. Or are you too far gone for that?

     
  • Anonymous

    Anonymous - 2011-11-09

    Hi Wiz,

    thanks for answering!

    I have been looking through these links today. Unfortunately I am too far gone
    for any sort of keyboard solution. For sure I should get the most ergonomic
    keyboard, chair, desk, etc but this will be the tip of the iceberg.

    I haven't seen anything I like the look of in terms of speech assisted coding.

    I am aware that typing letters by voice fails. But this isas far as I can see
    a strawman argument. the reason is because we have 26 combinations that are
    not suitably distinct.

    so 'TED' -> 'tee' 'ee' 'dee', already there is a lot of scope for trouble. T
    and D belong to the same phonetic group, only one is plosive and the other
    non-plosive. exactly the same shape in the mouth is involved. As far as I can
    feel, the only difference is a contraction in the diaphragm that gives 'T' its
    explosion of air.

    the difference is surely going to be bordering on imperceptible to any
    recognition algorithm... even over the phone people have to resort to ' Alpha
    Bravo Charlie ', pretty much everything is done by context.

    and the 'eee' in the middle could easily get absorbed by the first letter,
    although I would have thought that a clear break would be enough to
    discriminate.

    what I am proposing is to pick a small number, maybe a grid of 6 x 6. six
    constant sounds that are as distinct from one another as possible, and six
    vowel sounds similarly.

    let's say
    ba ta ra sha la ka ha nga ma

    well that is nine, and they all involve a different physiological mechanism to
    produce the sound.

    I would be quite surprised if the engine confined to these nine words would
    fail to get close to 100%.

    now moving in the other direction, maybe there is less scope for creativity
    with the vowels, there are fewer parameters that can be modified; shape of
    mouth, position of tongue etc, but each syllable is generating a clear
    repeating waveform so maybe the bonus of being able to use this waveform for
    detection outweighs this.

    let's say I arrive at five distinct vowels, again, I would be quite surprised
    if the engine couldn't discriminate with a good accuracy.

    so with this logic, my intuition is that it is at least worth performing the
    experiment and actually checking to see what accuracy I DO get, and how this
    improves by reducing the phoneme base, ie find which phonemes are getting
    confused the most, and eliminating one of them. Rinse and repeat...

    I cannot find any indication that someone has performed this experiment...

     
  • Nickolay V. Shmyrev

    I cannot find any indication that someone has performed this experiment...

    This is proposed basically every month. It's possible to make spelling system
    but it's usually created for correction, not for the full dictation. Single
    user dictation system works without such big troubles. Proper spelling
    functionality in the dictation system requires some work on the decoder part
    too. It's not that easy.

     

Log in to post a comment.