Menu

Help begginer to choose direction to take.

cornelyus
2010-09-07
2012-09-22
1 2 3 > >> (Page 1 of 3)
  • cornelyus

    cornelyus - 2010-09-07

    Good afternoon..
    I have developed recently an application for windows7 that uses speech
    recognition.
    I have developed with Windows SAPI making the speech recognition, but as of
    lately i wasn't getting the accurate results i wanted, so i decided to try
    other engine, and Sphinx seems to be an approach for me.
    I just need some guidelines to choose which direction to take, deciding the
    decoder, if i need acoustic model training or not.. etc.

    So, what i have is a command and control app, i have very well defined
    expressions i will use. Other than these i use numbers . So i am guessing
    having a grammar would be the way to go..

    I thought about using PocketSphinx because it is C, so i can encapsulate it in
    C# faster.. but should i spend more time with Sphinx4 , would be better on the
    long run? it seems better documented and with more updates recently.

    Would creating or adapting an acoustic model with recordings of the people
    that will use it, and only the necessary expressions, be a valid solution?

    DUmb question maybe, but the perfect solution for a good speech recognition
    would be having an acoustic model for each user? Depending on the user that
    will run the app, his "profile" would be loaded.. or it's best having an
    acoustic model with recordings of every person that will use the App? Or even,
    having the acoustic model default ( i think it's the wsj one ) and just adapt
    the language model / dictionary / grammar to the expressions i need?

    Thanks in advance for your time.

     
  • Nickolay V. Shmyrev

    Hello

    So, what i have is a command and control app, i have very well defined
    expressions i will use. Other than these i use numbers . So i am guessing
    having a grammar would be the way to go..

    Control of the computer with speech have limited usability actually. In a long
    term you need to target dictation.

    I thought about using PocketSphinx because it is C, so i can encapsulate it
    in C# faster.. but should i spend more time with Sphinx4 , would be better on
    the long run? it seems better documented and with more updates recently.

    I think you need to try pocketsphinx

    Would creating or adapting an acoustic model with recordings of the people
    that will use it, and only the necessary expressions, be a valid solution?

    We think that adaptation is a good strategy to take

    Dumb question maybe, but the perfect solution for a good speech recognition
    would be having an acoustic model for each user? Depending on the user that
    will run the app, his "profile" would be loaded.. or it's best having an
    acoustic model with recordings of every person that will use the App? Or even,
    having the acoustic model default ( i think it's the wsj one ) and just adapt
    the language model / dictionary / grammar to the expressions i need?

    Optimal solution would be to store adaptation profile for each user. That
    include both adapted language model and adapted acoustic model. You can have
    enrollment stage to create them.

     
  • cornelyus

    cornelyus - 2010-09-08

    Control of the computer with speech have limited usability actually. In a
    long term you need to target dictation.

    I'm not wanting to control the computer.. what i meant was that my App has
    defined command inputs, that the user learns to use. These are small well
    defined expressions... enough for what we need. I don't think dictation will
    be easier on the users.. if they can say "whatever" they want.. or were you
    refering to something else?

    thanks for your responses.

     
  • Nickolay V. Shmyrev

    These are small well defined expressions... enough for what we need.

    Great, then open language interface will be the feature for the next version
    of your application.

     
  • cornelyus

    cornelyus - 2010-09-09

    :)

    SO i think i'll start with pocketsphinx, and try and do a acoustic model
    adaptation.

    Just a couple more questions:
    - I will need numbers identification until 1million probably. WHat i sould do is record only the endings? Like , 1 , 2 , 3.. 10, 20, 30.. 100,200,300.. 1000,2000,3000 ??

    -The grammars i use now are SAPI compliant, so they are grxml files. I can access semantics on this too.. Is there a feature like this for the jsgf grammars files pocketsphinx uses? I needed this feature to divide the sentences into meaning information:

    Example: <person> travels to <city>
    If it identifies Ricardo travels to Lisboa , with semantics i can extract the
    name and the city ... know what i mean? </city></person>

    Thanks for your help

     
  • cornelyus

    cornelyus - 2010-09-09

    I'm sorry but what did you mean by

    "You can have enrollment stage to create them."

     
  • Nickolay V. Shmyrev

    Just a couple more questions:
    - I will need numbers identification until 1million probably. WHat i sould do is record only the endings? Like , 1 , 2 , 3.. 10, 20, 30.. 100,200,300.. 1000,2000,3000 ??

    Sorry, I don't quite understand how "record" is applicable here.

    Example: <person> travels to <city>
    If it identifies Ricardo travels to Lisboa , with semantics i can extract the
    name and the city ... know what i mean? </city></person>

    There is no sense extraction right now. You have to analyze the output from
    the recognizer yourself. But probably such framework will appear soon. At
    least there is Olimpus project doing that:

    http://wiki.speech.cs.cmu.edu/olympus/index.php/Olympus

    I'm sorry but what did you mean by "You can have enrollment stage to create
    them."

    I meant that you can give user small text to read and then use this small
    recording for acoustic model adaptaion.

     
  • cornelyus

    cornelyus - 2010-09-10

    Just a couple more questions:
    - I will need numbers identification until 1million probably. WHat i sould do is record only the endings? Like , 1 , 2 , 3.. 10, 20, 30.. 100,200,300.. 1000,2000,3000 ??

    Sorry, I don't quite understand how "record" is applicable here.

    I am talking about that "enrollment stage" .. for the acoustic model
    adaptation. I thought that for this all the users that i will have in the App,
    could record their voice with the expressions my App needs. I was asking if
    when recording numbers i just need thoss expressions.
    ex: record 1000, 200, 50 or just record 1250.

    Or.. maybe.. as u said.. just recording a small text from the user would be
    enough for acoustic model adaptation.

    In summary.. for acoustic model adaptation, it's better to have the users
    recording all the expressions i need, or just a portion of text?

     
  • Nickolay V. Shmyrev

    There is not need to read boring numbers. Give them adaptation text that's
    easy and fun to read. Try to select few variants and present them for
    particular user randomly.

    Have you seen Dragon 10? They suggest user to read a paragraph from "Dogbert's
    Top Secret Top Management Handbook" for example.

     
  • cornelyus

    cornelyus - 2010-09-10

    There is not need to read boring numbers. Give them adaptation text that's
    easy and fun to read. Try to select few variants and present them for
    particular user randomly.

    Have you seen Dragon 10? They suggest user to read a paragraph from "Dogbert's
    Top Secret Top Management Handbook" for example

    Oh ok... because i can train my users.. they are specific ones.. it's not a
    global App. I will look for that book..

    When i do acoustic adaptation.. this means i will have a new acoustic model
    that will have better results for the person i collected the audio?

    I am not quite clear if i will have one acoustic model for each user.. or keep
    building the acoustic model with adaptation from each user.. will only have 1
    acoustic model, but with adaptation for the users..

    What about if a user that i have never adapted uses the App.. the results will
    be less accurate.. but a LOT less accurate?

    Again.. thanks again for your time and patience

     
  • Nickolay V. Shmyrev

    Oh ok... because i can train my users.. they are specific ones.. it's not a
    global App. I will look for that book..

    You still need to care about them.

    When i do acoustic adaptation.. this means i will have a new acoustic model
    that will have better results for the person i collected the audio?

    Yes

    I am not quite clear if i will have one acoustic model for each user.. or
    keep building the acoustic model with adaptation from each user.. will only
    have 1 acoustic model, but with adaptation for the users..

    If you can identify users it's better to keep separate models for them. You
    can also have a model adapted to all your users, but it will be less accurate
    than per-users models

    What about if a user that i have never adapted uses the App.. the results
    will be less accurate.. but a LOT less accurate?

    I wouldn't use adapted model for a new user. It's better to start with default
    model

     
  • cornelyus

    cornelyus - 2010-10-08

    Hey again...

    started the tutorial for model adaptation (http://cmusphinx.sourceforge.net/w
    iki/tutorialadapt)

    and a question arised.
    The dic file (arctic20.dic) has all the words from the arctic20 text.

    If i give to my users a text to read that is smaller than all the expressions
    i need ( like you suggested previously ), the dictionary should have ALL the
    expressions i need right? Not just the ones that the users will record... is
    that it?

    Other thing.. don't know if you can help but... i already have a little C#
    wrapper, but i want pocketsphinx running on background.. this should involve
    threading so it doesn't "freeze" my user interface right?

    Take care

     
  • cornelyus

    cornelyus - 2010-10-08

    forgot to ask..

    do i have access on pocketsphinx on choosing the input/output interface?
    Meaning.. can i choose what microphone to use, if i have more than one
    available? Or is it always the default one ?

    thank you

     
  • Nickolay V. Shmyrev

    If i give to my users a text to read that is smaller than all the
    expressions i need ( like you suggested previously ), the dictionary should
    have ALL the expressions i need right? Not just the ones that the users will
    record... is that it?

    If you want to recognize the word it must be in the dictionary

    Other thing.. don't know if you can help but... i already have a little C#
    wrapper, but i want pocketsphinx running on background.. this should involve
    threading so it doesn't "freeze" my user interface right?

    Yes

    do i have access on pocketsphinx on choosing the input/output interface?
    Meaning.. can i choose what microphone to use, if i have more than one
    available? Or is it always the default one ?

    In Linux there is -adcdev option that lets you choose the device. In Windows,
    only default one is used.

     
  • cornelyus

    cornelyus - 2010-10-11

    good morning!

    Not for now.. but not being able to choose what microphone i can use on my app
    could bring some troubles in the future..

    Let me rephrase my first question, just to be clear and don't ask again.

    I have identified all the expressions i need on my recognition application. I
    have created a dictionary file that covers all this.
    Now, you said i could use a random text for the "enrollment stage". This text
    has phrases with words that aren't covered on the dictionary file created
    earlier, meaning this text has expressions that won't need to be recognized on
    my application.

    So, for the adaptation stage, the dictionary file has to have all the
    expressions i need to be identified plus the ones from the random text of
    training? Or the dictionary file is only used / important on the recognition
    phase, so it should only have the words that will be recognized?

    Thank you

     
  • Nickolay V. Shmyrev

    So, for the adaptation stage, the dictionary file has to have all the
    expressions i need to be identified plus the ones from the random text of
    training?

    Yes

     
  • cornelyus

    cornelyus - 2010-10-11

    right..

    What if i present random sentences to the user on the enrollment stage.. i'll
    have to create a dictionary for those sentences specifically to adapt the
    acoustic model?

    By the way.. the transcription file http://www.speech.cs.cmu.edu/cmusphinx/mo
    indocs/arctic20.transcription
    has to have upper case letters?

     
  • cornelyus

    cornelyus - 2010-10-11

    and for adaptation u suggest linux environment right?

     
  • Nickolay V. Shmyrev

    What if i present random sentences to the user on the enrollment stage..
    i'll have to create a dictionary for those sentences specifically to adapt the
    acoustic model?

    Yes.

    Sorry I don't quite understand your problem. I didn't tell you to present
    random sentences. I told you to use interesting sentences, not random ones.
    You can choose 2-3 paragraphs from some nice book. There is no issue to add
    those paragraph to the dictionary.

    In case you want to have random ones, you can first of all check that all
    words are present in cmudict. If it's not so, move to next sample. cmudict is
    quite representative

    and for adaptation u suggest linux environment right?

    No, adaptation works in Linux, Windows and MacOS. There is no environment
    restriction.

     
  • cornelyus

    cornelyus - 2010-10-12

    Yes.

    Sorry I don't quite understand your problem. I didn't tell you to present
    random sentences. I told you to use interesting sentences, not random ones.
    You can choose 2-3 paragraphs from some nice book. There is no issue to add
    those paragraph to the dictionary.

    In case you want to have random ones, you can first of all check that all
    words are present in cmudict. If it's not so, move to next sample. cmudict is
    quite representative

    Sorry if i used "random". I Will select 2-3 paragraphs of a nice book. I will
    add this to the dictionary.

    Maybe what i'm confused about is if the dictionary file. I thought that the
    dictionary reflects ONLY the words i want to be recognized. If i create a
    dictionary file to the model adaptation with the words i want recognized, plus
    the ones from the "nice book", that is not a problem when run my application
    doing recognition?

    No, adaptation works in Linux, Windows and MacOS. There is no environment
    restriction.

    I was asking for doing the tutorial of adaptation. Because i compiled
    sphinxbase, pocketsphinx and sphinxtrain on WIndows, and i can't find
    pocketsphinx_mdef_convert anywhere.

    Thank you

     
  • cornelyus

    cornelyus - 2010-10-12

    I did a search on forums but found nothing about this.

    On the adaptation tutorial, when i do this
    sphinx_fe ''cat wsj1/feat.params'' -samprate 16000 -c arctic20.listoffiles
    -di . -do . -ei raw -eo mfc -raw yes

    i get an error.

    I checked ths sphinx_fe parameters, and on my feat.params file ( on the hmm
    models of pocketsphinx 0.6.1 ) there are parameters not found on sphinx_fe
    like -svspec. Should i not use this hmm model?

     
  • Nickolay V. Shmyrev

    Hello

    I thought that the dictionary reflects ONLY the words i want to be
    recognized.

    No, dictionary could be bigger.

    that is not a problem when run my application doing recognition?

    It will just work

    Because i compiled sphinxbase, pocketsphinx and sphinxtrain on WIndows, and
    i can't find pocketsphinx_mdef_convert anywhere. Thank you

    Ah, sorry about that. We need to create a project to compile that binary. If
    you can submit one that would be helpful!

    On the adaptation tutorial, when i do this sphinx_fe ''cat
    wsj1/feat.params'' -samprate 16000 -c arctic20.listoffiles -di . -do . -ei raw
    -eo mfc -raw yes i get an error. I checked ths sphinx_fe parameters, and on my
    feat.params file ( on the hmm models of pocketsphinx 0.6.1 ) there are
    parameters not found on sphinx_fe like -svspec. Should i not use this hmm
    model?

    Sorry, we will update the tutorial to match pocketsphinx 0.6.1. very soon. You
    only need to pass -svspec as an argument to bw command on the next stage of
    the adaptation. You should filter it from sphinx_fe options. You can use
    hub4_wsj, moreover it's recommended to use this model. If you could update the
    tutorial or suggest the way to make it more clear that would be very
    appreciated!

     
  • cornelyus

    cornelyus - 2010-10-27

    Hello..

    Been meaning to reply to you but was working on another project..

    Ah, sorry about that. We need to create a project to compile that binary. If
    you can submit one that would be helpful!

    Before i saw this, i changed my working for Linux.. if i got the time in the
    future i will try this on Windows..

    Sorry, we will update the tutorial to match pocketsphinx 0.6.1. very soon.
    You only need to pass -svspec as an argument to bw command on the next stage
    of the adaptation. You should filter it from sphinx_fe options. You can use
    hub4_wsj, moreover it's recommended to use this model. If you could update the
    tutorial or suggest the way to make it more clear that would be very
    appreciated!

    I saw someone ( you ?) updated the tutorial. Thanks..

    Now i have an adapted acoustic model after following the tutorial...

    What differences should i notice when running for example
    pocketsphinx_continuous ?

    By the way.. what's the best way to create grammars suitable for pocketsphinx?

    thank you

     
  • Nickolay V. Shmyrev

    What differences should i notice when running for example
    pocketsphinx_continuous ?

    Improved accuracy?

    By the way.. what's the best way to create grammars suitable for
    pocketsphinx?

    If you want human-readable grammars, you can use JSGF grammar format. For
    machine-generated grammars use fsg_* API in sphinxbase.

     
  • cornelyus

    cornelyus - 2010-10-28

    Improved accuracy?

    yes .. of course.. but is there a way of measuring that? I saw something about
    a WER variable, but i think that's just for Sphinx4 right?

    If you want human-readable grammars, you can use JSGF grammar format. For
    machine-generated grammars use fsg_* API in sphinxbase.

    After making you the question I saw on other thread saying we could use jsgf
    grammars also.. wasn't aware of that, thought fsg was the only type accepted.

     
1 2 3 > >> (Page 1 of 3)

Log in to post a comment.