CMU Sphinx / Forums / Speech Recognition Theory: sphinx 0.7 improvements at vocally

Michael Betser - 2011-12-20

Dear SPHINX developers,

I have been working with sphinx for a year now, and I have made some
modifications to it to suit my needs, and I was eager to share them with you,
with the blessing of my company. I have been working mainly on the s3decoder
and the pocketsphinx decoder, and mainly to do phoneme transcription and
alignment. I was mainly working under MACOSX and windows wp systems.

Originally, I have made the modifications on the following library versions:
sphinx3-0.8
pocketsphinx-0.6.1
sphinxbase-0.6.1
It was before you released the 0.7 version and I just merged with 0.7. I have
tested it but not as extensively as the previous one, so there might be some
problems left, and I apologize for that in advance.

Now concerning the modifications I propose:

The first modification I made concerns the length of the file processed (for
the s3 alignment task), and the duration of live decoding using pocketsphinx
or s3decodelive.
- First thing I did is to unify the frame counter declaration using a single identical type frmcnt_t (defined in prim_types.h) through sphinxbase, s3, pocketsphinx (which is basically an int32 now), and some parts of sphinxtrain. I did the same for sample counters (smpcnt_t) and some others. The advantage is that you can check the coherence of the program concerning the type just by changing the definition. for example changing from int32 to int64, there should be no error/warning. This modification touches a lot of files.
- I have added a new fsg mode, fsg_delay, which does not store the complete decision from the beginning of the file, and forces the decision at a fixed time delay.
- I solved a score saturation bug by decreasing all scores when getting too close to saturation (I think there is the same mechanism in s3livedecode)
The result is that you can phone align files with any file length, and can do
live phone recognition with no (almost) time limit.

One last thing is that you mention in the doc that there is no phone
recognition mode in pocketsphinx as in s3, which is not completely true. I got
similar results compared to s3 allphone mode by
- giving a fake dictionnary which simply maps the phone to itself,
- giving a simple grammar, which loops all phones to themselves,
- and deactivating some search options which seriously slowed down the recognition, namely: -fsgusealtpron, -fsgusefiller, -fwdtree, -fwdflat and -bestpath.

The second modification concerns cmd_ln: I have added some functions to
manipulate and merge cmd_ln objects. the purpose is to use several "tools" in
a same program. For example you want to use the feature extraction and the
alignment (in sphinx3) or you want to do map adaptation and decoding. The idea
is that each of these "tools" have an now an interface, and that their options
are managed using cmd_ln_t objects, which need to be merged to have the global
options for all the tools. I also made object interfaces for sphinx3 feature
extraction, sphinx3 alignment, and the mllr adaptation in sphinx train. I
found the mechanism quite convenient for my own usage, and if this feature
interests you I can also give you the interfaces I made for those programs. A
consequence of this is that we need a coherence of the option names through
all the tool (for example -hmmdir should be the same everywhere).
One example has been added to sphinxbase:
files added:
./sphinxbase/include/sphinxbase
fex.h -> defines a feature exchange object
lab_file.h -> a set of basic function to write htk headers
sphinx_wave2feat.h -> defines un objet wrapper around the feature extraction
functions
./sphinxbase/src
fe/fe_file.c -> where the wave2feat functions are implemented
fe/fex.c
util/lab_file.c
It illustrates how I tried to wrap the high level functionalities of sphinx
defined in
programs into an object. The basic idea is to have at least:
- one function to initialize a cmd_ln object with the module defaults
- one function to create a module object from a cmd_ln object
- one free function
between the two first we can merge several cmd_ln from different modules using
the cmd_ln_add
function I have added. Hence the necessity to have a coherence of the option
names through all
the tools.

Other modifications

fe_interface.c
- added fe_copy (copy a feature extraction object)

agc.c
- solved a bug in agc_emax concerning initial condition
- added function agc_reset to solve bug when processing multiple files.

feat.c
- added agc_reset to reinit when starting new phrase

mdef.c
- changed mdef number-of-model check to a warning, so we can remove unwanted fillers models directly in the mdef file

acmod_set.c
- changed the storing mode of triphones names to a hash table instead of a tree

itree.c, lexicon.c
- added _free function

model_def_io.c
- solved memory leak problem caused by a static table returned out of its function
- changed free function so that it is better modularized and no memory leaks

gauden.c, mllr.c, mod_inv.c
- added free(->veclen) to solve memory leaks
- changed some functions to use the re-entrant cmd_ln api

feat.h
- now feat_agc and feat_cmn are externals

changed all silence ref "sil" to "SIL",(through the S3 macro)

tried to uniformize option names using the pocketsphinx options as ref. changed:
"-hmmdir" to "-hmm", "-moddeffn" "-mdeffn" etc to "-mdef", "-meanfn" to
"-mean", "-varfn" to "-var",
"-tmatfn" to "-tmat", "-sendumpffn" to "-sendump", "-featparamfn" to
"-featparam", "-mixwfn" to "-mixw",
"-lmctlfn" to "-lmctl", "-dictfn" to "-dict".

in sphinxtrain changed most NO_ID to NO_ACMOD, in order to avoid an unuseful dependency to itree.h

changed test script to use the new option names in sphinxtrain.

changed bw main.c to use the reentrant cmd_ln API.

There is a serious problem with read_seno_dtree.c, the structure used there
seem to have changed a lot and the file is used nowhere -> to be removed?
idem with s3_open.c.

I have run the test functions over sphinxbase, sphinxtrain and pocketsphinx:
- for sphinxbase and sphinxtrain, no problem. I have noticed, that there is a FAIL with init_gau gau count generation, but it was already present in the release version.
- for pocketsphinx, there are 2 failed tests: test_fwflat and test_senfh. The decoder finds "go forward ten years" instead of "go forward ten meters", whereas in the release version it finds "go forward and users", which dosen't seem better...

Tell me if your are interested in those modifications, and which ones. I can
send a zipped version of my modifications on sphinxbase, sphinxtrain and
pocketsphinx if you want, so you can check the modifications by yourself.

Michaël Betser.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-12-21

Hello Michaël

I have been working with sphinx for a year now, and I have made some
modifications to it to suit my needs, and I was eager to share them with you,
with the blessing of my company. I have been working mainly on the s3decoder
and the pocketsphinx decoder, and mainly to do phoneme transcription and
alignment. I was mainly working under MACOSX and windows wp systems.

This is a really great news, thanks a lot for you and your company. Sure it
should help all of us.

Would you be interested to publish more about the ways you are using CMUSphinx
on your blog?

The first modification I made concerns the length of the file processed (for
the s3 alignment task), and the duration of live decoding using pocketsphinx
or s3decodelive.

This should be already done in pocketsphinx with frame_idx_t. You need some
more arguments why do you need another changes of this type. You are welcome
to start a separate thread for this issue.

One last thing is that you mention in the doc that there is no phone
recognition mode in pocketsphinx as in s3, which is not completely true. I got
similar results compared to s3 allphone mode by
- giving a fake dictionnary which simply maps the phone to itself,
- giving a simple grammar, which loops all phones to themselves,
- and deactivating some search options which seriously slowed down the recognition, namely: -fsgusealtpron, -fsgusefiller, -fwdtree, -fwdflat and -bestpath.

This is similar but not really equivalent:
Speed/memory usage is significantly different because you are running more
generic code. Proper phone loop is much simplier.
You can't use phone language model (they are quite critical for accuracy
actually)

The second modification concerns cmd_ln: I have added some functions to
manipulate and merge cmd_ln objects. the purpose is to use several "tools" in
a same program.

You want to merge argument definitions, not cmd_ln_t, do you? If something
else please elaborate

fe_interface.c
- added fe_copy (copy a feature extraction object)

That's a nice addition

solved a bug in agc_emax concerning initial condition

added function agc_reset to solve bug when processing multiple files.

That would be great to have

mdef.c
- changed mdef number-of-model check to a warning, so we can remove unwanted fillers models directly in the mdef file

You need to remove unwanted fillers from noisedict, not from mdef. Is it
correct?

acmod_set.c
- changed the storing mode of triphones names to a hash table instead of a tree

That's good

itree.c, lexicon.c
- added _free function

Lexicon should be fixed already. Itree should go away I think.

model_def_io.c
- solved memory leak problem caused by a static table returned out of its function
- changed free function so that it is better modularized and no memory leaks

Would be nice to have this

gauden.c, mllr.c, mod_inv.c
- added free(->veclen) to solve memory leaks
- changed some functions to use the re-entrant cmd_ln api

Should be fixed already if not please submit

feat.h
- now feat_agc and feat_cmn are externals

Good

tried to uniformize option names using the pocketsphinx options as ref.
changed:
"-hmmdir" to "-hmm", "-moddeffn" "-mdeffn" etc to "-mdef", "-meanfn" to
"-mean", "-varfn" to "-var",
"-tmatfn" to "-tmat", "-sendumpffn" to "-sendump", "-featparamfn" to
"-featparam", "-mixwfn" to "-mixw",
"-lmctlfn" to "-lmctl", "-dictfn" to "-dict".

in sphinxtrain changed most NO_ID to NO_ACMOD, in order to avoid an unuseful dependency to itree.h

changed test script to use the new option names in sphinxtrain.

changed bw main.c to use the reentrant cmd_ln API.

Good changes

There is a serious problem with read_seno_dtree.c, the structure used there
seem to have changed a lot and the file is used nowhere -> to be removed?
idem with s3_open.c

Exactly

sphinxtrain, no problem. I have noticed, that there is a FAIL with init_gau
gau count generation, but it was already present in the release version.

That's expected, yes. We need some time to fix this.

for pocketsphinx, there are 2 failed tests: test_fwflat and test_senfh.
The decoder finds "go forward ten years" instead of "go forward ten meters",
whereas in the release version it finds "go forward and users", which dosen't
seem better...

Works here, maybe it's caused by your modifications?

Tell me if your are interested in those modifications, and which ones.

Sure we are!

Thanks a lot for your contribution in advance!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Betser - 2011-12-21

Hello,

Thanks for your reply! I ll talk to my boss, but i am sure that we can say
something about what we do here with sphinx. We have a cool demonstration
where we animate real time a 3D head with the results of phone recognition
from a microphone. We could maybe post a video of it.

Here are some precisions about my modifications:

Concerning the frame_idx_t type, I couldn't find it in the 0.7 release, is it
in a newer version?
For example in pocket sphinx you have a lot of functions like this
POCKETSPHINX_EXPORT
ps_nbest_t ps_nbest(ps_decoder_t ps, int sf, int ef, char const ctx1, char
const ctx2);
where sf and ef should be frame counters, to ensure some coherence. The
problem is that at some other points in the code other types were used like
int16, or uint16 etc. I tried to unify all these declarations using a unique
type. My motivation was be sure that there were no more int16 that made the
system crash after some time. I did the same for the sample counters which are
now all int64. Together with the two other changes, it allows continuous
phoneme analysis from a microphone input for example. I also made the same
changes in sphinx3 to be able to align a file with any length.

Concerning the allphone loop, I agree it would be cleaner to have a specific
mode, but I insist that the accuracy was very similar to the sphinx3 allphones
mode, and one other thing I forgot to say is that when deactivating the
options mentioned in my first message pocketsphinx was much faster than
sphinx3 (by roughtly a factor 5). So I thought that this method could be added
in the documentation until we have a proper allphones mode.

For the cmd_ln_t modifications, the goal originally was to wrap a sphinx tool,
let's say the mllr adaptation, so I can use it in other programs. First thing
is to create an object to store all the specific data of the tool. Then the
external options of the tool are stored in a cmd_ln_t object, and are modified
through this object. But what happens now if I need to combine mllr with a
second tool? And that my new tool has its own specific new options? The
simpliest way I found was to merge the argument definitions through cmd_ln_t
to avoid to have double definitions of the same options (which would happen if
we use macros). It is also more flexible than defining macro of macros.

Do you have a mail or a ftp where I can send the files?

Michaël

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-12-25

Hello

Concerning the frame_idx_t type, I couldn't find it in the 0.7 release, is
it in a newer version?

You can checkout latest development version from the subversion trunk

https://sourceforge.net/scm/?type=svn&group_id=1904

For example in pocket sphinx you have a lot of functions like this
POCKETSPHINX_EXPORT
ps_nbest_t ps_nbest(ps_decoder_t ps, int sf, int ef, char const ctx1, char
const ctx2);
where sf and ef should be frame counters, to ensure some coherence.

Not necessary. Int type here should just work.

avoid to have double definitions of the same options (which would happen if
we use macros). It is also more flexible than defining macro of macros.

I'm sure there are other workarounds

Do you have a mail or a ftp where I can send the files?

https://sourceforge.net/tracker/?group_id=1904&atid=351904

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-12-27

We have a cool demonstration where we animate real time a 3D head with the
results of phone recognition from a microphone. We could maybe post a video of
it.

That would be very interesting to share. We are eager to see that :)

Concerning the allphone loop, I agree it would be cleaner to have a specific
mode, but I insist that the accuracy was very similar to the sphinx3 allphones
mode, and one other thing I forgot to say is that when deactivating the
options mentioned in my first message pocketsphinx was much faster than
sphinx3 (by roughtly a factor 5). So I thought that this method could be added
in the documentation until we have a proper allphones mode

I think proper allphone mode is very close. It actually quite some complex
thing to design and implement. But performance advantages of will be
significant even compared to current pocketsphinx.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Betser - 2011-12-28

Hello,

I have downloaded the latest sphinx version and indeed I have found the
frame_idx_t type, but surprisingly it was of the int16 type which is
contradictory with most of pocketsphinx functions which declare frame counters
as int (which might also be ambiguous as it is int32 on most systems but not
necessarily). int16 is surprising because it is quite limited: with a 0.01s
step, you can only process 5 min signals, which is quite small.
I don't mind to rename my type into the frame_idx_t type, but I think it is
useful to extend the use of this type at least for three reasons: first, it
makes the programmer aware that a frame counter can be overflowed and that a
predefined type exists, and secondly it allows to check the consistency of the
program concerning this type (by simply changing the definition to an other
kind of int), at last on systems with (very) low ressources the type can be
changed easily to reduce the memory and the process time.

You did not say anything about the new fsg mode I have added, which allows to
force a decision at a fixed time delay. It is this new mode together with the
generalization of the int32 use for all frames counters, which allowed me to
animate my 3d head in real time. I think those two are the most useful
additions I made to sphinx, but of course it is your decision to add them or
not.

Otherwise how do you prefer me to submit my modifications? Shall I make a
branch in the svn repository (which would probably be easier if you accept my
frame counter modifications)? Or shall I make one big patch of only the
modified files and send them on the patch tracker (still tractable without the
frame counter modifications)?

Concerning the video I'll talk to my boss next week.

Wish you and all the sphinx team a happy new year!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Betser - 2012-01-17

hello,

As you did not answer my last message, I just write a reminder.

Michaël.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2012-01-18

Sorry, I missed the question you have.

Or shall I make one big patch of only the modified files and send them on
the patch tracker (still tractable without the frame counter modifications)?

Please submit patch to the tracker

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2012-02-28

Just wondering, what's the status on this patch?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

sphinx 0.7 improvements at vocally

Speech Recognition Toolkit

Forums

Help

sphinx 0.7 improvements at vocally

sphinx 0.7 improvements at vocally

Speech Recognition Toolkit

Forums

Help

sphinx 0.7 improvements at vocally document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

sphinx 0.7 improvements at vocally