Menu

sphinx 0.7 improvements at vocally

2011-12-20
2012-09-22
  • Michael Betser

    Michael Betser - 2011-12-20

    Dear SPHINX developers,

    I have been working with sphinx for a year now, and I have made some
    modifications to it to suit my needs, and I was eager to share them with you,
    with the blessing of my company. I have been working mainly on the s3decoder
    and the pocketsphinx decoder, and mainly to do phoneme transcription and
    alignment. I was mainly working under MACOSX and windows wp systems.

    Originally, I have made the modifications on the following library versions:
    sphinx3-0.8
    pocketsphinx-0.6.1
    sphinxbase-0.6.1
    It was before you released the 0.7 version and I just merged with 0.7. I have
    tested it but not as extensively as the previous one, so there might be some
    problems left, and I apologize for that in advance.

    Now concerning the modifications I propose:

    The first modification I made concerns the length of the file processed (for
    the s3 alignment task), and the duration of live decoding using pocketsphinx
    or s3decodelive.
    - First thing I did is to unify the frame counter declaration using a single identical type frmcnt_t (defined in prim_types.h) through sphinxbase, s3, pocketsphinx (which is basically an int32 now), and some parts of sphinxtrain. I did the same for sample counters (smpcnt_t) and some others. The advantage is that you can check the coherence of the program concerning the type just by changing the definition. for example changing from int32 to int64, there should be no error/warning. This modification touches a lot of files.
    - I have added a new fsg mode, fsg_delay, which does not store the complete decision from the beginning of the file, and forces the decision at a fixed time delay.
    - I solved a score saturation bug by decreasing all scores when getting too close to saturation (I think there is the same mechanism in s3livedecode)
    The result is that you can phone align files with any file length, and can do
    live phone recognition with no (almost) time limit.

    One last thing is that you mention in the doc that there is no phone
    recognition mode in pocketsphinx as in s3, which is not completely true. I got
    similar results compared to s3 allphone mode by
    - giving a fake dictionnary which simply maps the phone to itself,
    - giving a simple grammar, which loops all phones to themselves,
    - and deactivating some search options which seriously slowed down the recognition, namely: -fsgusealtpron, -fsgusefiller, -fwdtree, -fwdflat and -bestpath.

    The second modification concerns cmd_ln: I have added some functions to
    manipulate and merge cmd_ln objects. the purpose is to use several "tools" in
    a same program. For example you want to use the feature extraction and the
    alignment (in sphinx3) or you want to do map adaptation and decoding. The idea
    is that each of these "tools" have an now an interface, and that their options
    are managed using cmd_ln_t objects, which need to be merged to have the global
    options for all the tools. I also made object interfaces for sphinx3 feature
    extraction, sphinx3 alignment, and the mllr adaptation in sphinx train. I
    found the mechanism quite convenient for my own usage, and if this feature
    interests you I can also give you the interfaces I made for those programs. A
    consequence of this is that we need a coherence of the option names through
    all the tool (for example -hmmdir should be the same everywhere).
    One example has been added to sphinxbase:
    files added:
    ./sphinxbase/include/sphinxbase
    fex.h -> defines a feature exchange object
    lab_file.h -> a set of basic function to write htk headers
    sphinx_wave2feat.h -> defines un objet wrapper around the feature extraction
    functions
    ./sphinxbase/src
    fe/fe_file.c -> where the wave2feat functions are implemented
    fe/fex.c
    util/lab_file.c
    It illustrates how I tried to wrap the high level functionalities of sphinx
    defined in
    programs into an object. The basic idea is to have at least:
    - one function to initialize a cmd_ln object with the module defaults
    - one function to create a module object from a cmd_ln object
    - one free function
    between the two first we can merge several cmd_ln from different modules using
    the cmd_ln_add
    function I have added. Hence the necessity to have a coherence of the option
    names through all
    the tools.

    Other modifications

    fe_interface.c
    - added fe_copy (copy a feature extraction object)

    agc.c
    - solved a bug in agc_emax concerning initial condition
    - added function agc_reset to solve bug when processing multiple files.

    feat.c
    - added agc_reset to reinit when starting new phrase

    mdef.c
    - changed mdef number-of-model check to a warning, so we can remove unwanted fillers models directly in the mdef file

    acmod_set.c
    - changed the storing mode of triphones names to a hash table instead of a tree

    itree.c, lexicon.c
    - added _free function

    model_def_io.c
    - solved memory leak problem caused by a static table returned out of its function
    - changed free function so that it is better modularized and no memory leaks

    gauden.c, mllr.c, mod_inv.c
    - added free(->veclen) to solve memory leaks
    - changed some functions to use the re-entrant cmd_ln api

    feat.h
    - now feat_agc and feat_cmn are externals

    • changed all silence ref "sil" to "SIL",(through the S3 macro)
    • tried to uniformize option names using the pocketsphinx options as ref. changed:
      "-hmmdir" to "-hmm", "-moddeffn" "-mdeffn" etc to "-mdef", "-meanfn" to
      "-mean", "-varfn" to "-var",
      "-tmatfn" to "-tmat", "-sendumpffn" to "-sendump", "-featparamfn" to
      "-featparam", "-mixwfn" to "-mixw",
      "-lmctlfn" to "-lmctl", "-dictfn" to "-dict".
    • in sphinxtrain changed most NO_ID to NO_ACMOD, in order to avoid an unuseful dependency to itree.h
    • changed test script to use the new option names in sphinxtrain.
    • changed bw main.c to use the reentrant cmd_ln API.

    There is a serious problem with read_seno_dtree.c, the structure used there
    seem to have changed a lot and the file is used nowhere -> to be removed?
    idem with s3_open.c.

    I have run the test functions over sphinxbase, sphinxtrain and pocketsphinx:
    - for sphinxbase and sphinxtrain, no problem. I have noticed, that there is a FAIL with init_gau gau count generation, but it was already present in the release version.
    - for pocketsphinx, there are 2 failed tests: test_fwflat and test_senfh. The decoder finds "go forward ten years" instead of "go forward ten meters", whereas in the release version it finds "go forward and users", which dosen't seem better...

    Tell me if your are interested in those modifications, and which ones. I can
    send a zipped version of my modifications on sphinxbase, sphinxtrain and
    pocketsphinx if you want, so you can check the modifications by yourself.

    Michaël Betser.

     
  • Nickolay V. Shmyrev

    Hello Michaël

    I have been working with sphinx for a year now, and I have made some
    modifications to it to suit my needs, and I was eager to share them with you,
    with the blessing of my company. I have been working mainly on the s3decoder
    and the pocketsphinx decoder, and mainly to do phoneme transcription and
    alignment. I was mainly working under MACOSX and windows wp systems.

    This is a really great news, thanks a lot for you and your company. Sure it
    should help all of us.

    Would you be interested to publish more about the ways you are using CMUSphinx
    on your blog?

    The first modification I made concerns the length of the file processed (for
    the s3 alignment task), and the duration of live decoding using pocketsphinx
    or s3decodelive.

    This should be already done in pocketsphinx with frame_idx_t. You need some
    more arguments why do you need another changes of this type. You are welcome
    to start a separate thread for this issue.

    One last thing is that you mention in the doc that there is no phone
    recognition mode in pocketsphinx as in s3, which is not completely true. I got
    similar results compared to s3 allphone mode by
    - giving a fake dictionnary which simply maps the phone to itself,
    - giving a simple grammar, which loops all phones to themselves,
    - and deactivating some search options which seriously slowed down the recognition, namely: -fsgusealtpron, -fsgusefiller, -fwdtree, -fwdflat and -bestpath.

    This is similar but not really equivalent:
    Speed/memory usage is significantly different because you are running more
    generic code. Proper phone loop is much simplier.
    You can't use phone language model (they are quite critical for accuracy
    actually)

    The second modification concerns cmd_ln: I have added some functions to
    manipulate and merge cmd_ln objects. the purpose is to use several "tools" in
    a same program.

    You want to merge argument definitions, not cmd_ln_t, do you? If something
    else please elaborate

    fe_interface.c
    - added fe_copy (copy a feature extraction object)

    That's a nice addition

    • solved a bug in agc_emax concerning initial condition
    • added function agc_reset to solve bug when processing multiple files.

    That would be great to have

    mdef.c
    - changed mdef number-of-model check to a warning, so we can remove unwanted fillers models directly in the mdef file

    You need to remove unwanted fillers from noisedict, not from mdef. Is it
    correct?

    acmod_set.c
    - changed the storing mode of triphones names to a hash table instead of a tree

    That's good

    itree.c, lexicon.c
    - added _free function

    Lexicon should be fixed already. Itree should go away I think.

    model_def_io.c
    - solved memory leak problem caused by a static table returned out of its function
    - changed free function so that it is better modularized and no memory leaks

    Would be nice to have this

    gauden.c, mllr.c, mod_inv.c
    - added free(->veclen) to solve memory leaks
    - changed some functions to use the re-entrant cmd_ln api

    Should be fixed already if not please submit

    feat.h
    - now feat_agc and feat_cmn are externals

    Good

    • tried to uniformize option names using the pocketsphinx options as ref.
      changed:
      "-hmmdir" to "-hmm", "-moddeffn" "-mdeffn" etc to "-mdef", "-meanfn" to
      "-mean", "-varfn" to "-var",
      "-tmatfn" to "-tmat", "-sendumpffn" to "-sendump", "-featparamfn" to
      "-featparam", "-mixwfn" to "-mixw",
      "-lmctlfn" to "-lmctl", "-dictfn" to "-dict".
    • in sphinxtrain changed most NO_ID to NO_ACMOD, in order to avoid an unuseful dependency to itree.h
    • changed test script to use the new option names in sphinxtrain.
    • changed bw main.c to use the reentrant cmd_ln API.

    Good changes

    There is a serious problem with read_seno_dtree.c, the structure used there
    seem to have changed a lot and the file is used nowhere -> to be removed?
    idem with s3_open.c

    Exactly

    sphinxtrain, no problem. I have noticed, that there is a FAIL with init_gau
    gau count generation, but it was already present in the release version.

    That's expected, yes. We need some time to fix this.

    • for pocketsphinx, there are 2 failed tests: test_fwflat and test_senfh.
      The decoder finds "go forward ten years" instead of "go forward ten meters",
      whereas in the release version it finds "go forward and users", which dosen't
      seem better...

    Works here, maybe it's caused by your modifications?

    Tell me if your are interested in those modifications, and which ones.

    Sure we are!

    Thanks a lot for your contribution in advance!

     
  • Michael Betser

    Michael Betser - 2011-12-21

    Hello,

    Thanks for your reply! I ll talk to my boss, but i am sure that we can say
    something about what we do here with sphinx. We have a cool demonstration
    where we animate real time a 3D head with the results of phone recognition
    from a microphone. We could maybe post a video of it.

    Here are some precisions about my modifications:

    Concerning the frame_idx_t type, I couldn't find it in the 0.7 release, is it
    in a newer version?
    For example in pocket sphinx you have a lot of functions like this
    POCKETSPHINX_EXPORT
    ps_nbest_t ps_nbest(ps_decoder_t ps, int sf, int ef, char const ctx1, char
    const
    ctx2);
    where sf and ef should be frame counters, to ensure some coherence. The
    problem is that at some other points in the code other types were used like
    int16, or uint16 etc. I tried to unify all these declarations using a unique
    type. My motivation was be sure that there were no more int16 that made the
    system crash after some time. I did the same for the sample counters which are
    now all int64. Together with the two other changes, it allows continuous
    phoneme analysis from a microphone input for example. I also made the same
    changes in sphinx3 to be able to align a file with any length.

    Concerning the allphone loop, I agree it would be cleaner to have a specific
    mode, but I insist that the accuracy was very similar to the sphinx3 allphones
    mode, and one other thing I forgot to say is that when deactivating the
    options mentioned in my first message pocketsphinx was much faster than
    sphinx3 (by roughtly a factor 5). So I thought that this method could be added
    in the documentation until we have a proper allphones mode.

    For the cmd_ln_t modifications, the goal originally was to wrap a sphinx tool,
    let's say the mllr adaptation, so I can use it in other programs. First thing
    is to create an object to store all the specific data of the tool. Then the
    external options of the tool are stored in a cmd_ln_t object, and are modified
    through this object. But what happens now if I need to combine mllr with a
    second tool? And that my new tool has its own specific new options? The
    simpliest way I found was to merge the argument definitions through cmd_ln_t
    to avoid to have double definitions of the same options (which would happen if
    we use macros). It is also more flexible than defining macro of macros.

    Do you have a mail or a ftp where I can send the files?

    Michaël

     
  • Nickolay V. Shmyrev

    Hello

    Concerning the frame_idx_t type, I couldn't find it in the 0.7 release, is
    it in a newer version?

    You can checkout latest development version from the subversion trunk

    https://sourceforge.net/scm/?type=svn&group_id=1904

    For example in pocket sphinx you have a lot of functions like this
    POCKETSPHINX_EXPORT
    ps_nbest_t ps_nbest(ps_decoder_t ps, int sf, int ef, char const ctx1, char
    const
    ctx2);
    where sf and ef should be frame counters, to ensure some coherence.

    Not necessary. Int type here should just work.

    avoid to have double definitions of the same options (which would happen if
    we use macros). It is also more flexible than defining macro of macros.

    I'm sure there are other workarounds

    Do you have a mail or a ftp where I can send the files?

    https://sourceforge.net/tracker/?group_id=1904&atid=351904

     
  • Nickolay V. Shmyrev

    We have a cool demonstration where we animate real time a 3D head with the
    results of phone recognition from a microphone. We could maybe post a video of
    it.

    That would be very interesting to share. We are eager to see that :)

    Concerning the allphone loop, I agree it would be cleaner to have a specific
    mode, but I insist that the accuracy was very similar to the sphinx3 allphones
    mode, and one other thing I forgot to say is that when deactivating the
    options mentioned in my first message pocketsphinx was much faster than
    sphinx3 (by roughtly a factor 5). So I thought that this method could be added
    in the documentation until we have a proper allphones mode

    I think proper allphone mode is very close. It actually quite some complex
    thing to design and implement. But performance advantages of will be
    significant even compared to current pocketsphinx.

     
  • Michael Betser

    Michael Betser - 2011-12-28

    Hello,

    I have downloaded the latest sphinx version and indeed I have found the
    frame_idx_t type, but surprisingly it was of the int16 type which is
    contradictory with most of pocketsphinx functions which declare frame counters
    as int (which might also be ambiguous as it is int32 on most systems but not
    necessarily). int16 is surprising because it is quite limited: with a 0.01s
    step, you can only process 5 min signals, which is quite small.
    I don't mind to rename my type into the frame_idx_t type, but I think it is
    useful to extend the use of this type at least for three reasons: first, it
    makes the programmer aware that a frame counter can be overflowed and that a
    predefined type exists, and secondly it allows to check the consistency of the
    program concerning this type (by simply changing the definition to an other
    kind of int), at last on systems with (very) low ressources the type can be
    changed easily to reduce the memory and the process time.

    You did not say anything about the new fsg mode I have added, which allows to
    force a decision at a fixed time delay. It is this new mode together with the
    generalization of the int32 use for all frames counters, which allowed me to
    animate my 3d head in real time. I think those two are the most useful
    additions I made to sphinx, but of course it is your decision to add them or
    not.

    Otherwise how do you prefer me to submit my modifications? Shall I make a
    branch in the svn repository (which would probably be easier if you accept my
    frame counter modifications)? Or shall I make one big patch of only the
    modified files and send them on the patch tracker (still tractable without the
    frame counter modifications)?

    Concerning the video I'll talk to my boss next week.

    Wish you and all the sphinx team a happy new year!

     
  • Michael Betser

    Michael Betser - 2012-01-17

    hello,

    As you did not answer my last message, I just write a reminder.

    Michaël.

     
  • Nickolay V. Shmyrev

    Sorry, I missed the question you have.

    Or shall I make one big patch of only the modified files and send them on
    the patch tracker (still tractable without the frame counter modifications)?

    Please submit patch to the tracker

     
  • Anonymous

    Anonymous - 2012-02-28

    Just wondering, what's the status on this patch?

     

Log in to post a comment.