CMU Sphinx / Forums / Help: Senones

Madhav Kishore - 2010-09-02

I have some doubts on training

1.In sphinx-3FAQ
http://www.speech.cs.cmu.edu/sphinxman/FAQ.html

It is mentioned some of the Thumb rule figures for setting senones ..

My training set contains 200 sentences (~1 hour data) for 15 speakers
so ,Amount of training data for setting SENONES should be 1 hour or 15
hours....

2.If I create a model for Command and Control Application ,is there need for
composite triphones..How these composite triphones are trained if it is not in
the transcript of training ...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-09-02

,Amount of training data for setting number of senones should be 1 hour or
15 hours..

Total amount of your training data is 15 hours. So you should choose number of
senones for 15 hours. 4000 would be a good guess

If I create a model for Command and Control Application ,is there need for
composite triphones..How these composite triphones are trained if it is not in
the transcript of training ...

It's not clear what composite triphones are you asking about. My suggestion is
if you don't know how to train such "composite triphones" don't train them.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Madhav Kishore - 2010-09-03

I am confused with senones...I think senone is a sub phonetic unit .
if that is the case,why senones for 1 hour and 15 hour(same training data so
same phonetic sentences )is different ?

2.from dictionary to triphones.c file

\brief Building triphones for a dictionary.

This is one of the more complicated parts of a cross-word

triphone model decoder. The first and last phones of each word

get their left and right contexts, respectively, from other

words. For single-phone words, both its contexts are from other

words, simultaneously. As these words are not known beforehand,

life gets complicated. In this implementation, when we do not

wish to distinguish between distinct contexts, we use a COMPOSITE

triphone (a bit like BBN's fast-match implementation), by

clubbing together all possible contexts
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-09-06

I am confused with senones...I think senone is a sub phonetic unit .
if that is the case,why senones for 1 hour and 15 hour(same training data so
same phonetic sentences )is different ?

No, they aren't. Senone as triphone is just a collection of probablistic
models to match specific phone in specific context. They are different from
phones or diphones which correspond to actual audio chunk. The amount of
context in your 15 hours recording is enough to train 4000 senones. Even if
there are same phonetic content for different speakers, the amount of contexts
in 1 hour is enough. The different situation will be of course if you have
1000 recordings of 1 minutes long reading the same small sentence. Then number
of contexts to train will be way smaller.

2.from dictionary to triphones.c file

\brief Building triphones for a dictionary.

This is one of the more complicated parts of a cross-word

triphone model decoder. The first and last phones of each word

get their left and right contexts, respectively, from other

words. For single-phone words, both its contexts are from other

words, simultaneously. As these words are not known beforehand,

life gets complicated. In this implementation, when we do not

wish to distinguish between distinct contexts, we use a COMPOSITE

triphone (a bit like BBN's fast-match implementation), by

clubbing together all possible contexts

Those composite senones are internals of sphinx3 large vocabulary decoding
used to optimize speed on word boundaries where most lextree expansion
happens. You can read about lextree in ASR textbook if you are interested, but
such composite senones arent' visible to the user and you shouldn't care about
them.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Madhav Kishore - 2010-09-06

The amount of context in your 15 hours recording is enough to train 4000
senones. Even if there are same phonetic content for different speakers, the
amount of contexts in 1 hour is enough
then you are suggesting me to use 4000 senones...
but such composite senones arent' visible to the user and you shouldn't care
about them.
I see such a composite triphones in my MDEF file(default training
settings),whether it will affect my system (Command and control
app)performance

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-09-06

I see such a composite triphones in my MDEF file(default training
settings),whether it will affect my system (Command and control
app)performance

No, you don't see them. Model definition file lists known triphones and senone
sequencies for them. You have some issues with terminology it seems.

whether it will affect my system (Command and control app)performance

Sorry, I don't understand your question here.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Madhav Kishore - 2010-09-07

No, you don't see them. Model definition file lists known triphones and
senone sequencies for them. You have some issues with terminology it seems.

since my mdef file is huge file , I am pasting few lines

a SIL v b n/a 2 390 626 682 N
a SIL y b n/a 2 393 629 767 N
a SIL yy b n/a 2 393 629 767 N
a SIL z b n/a 2 393 629 729 N
a a dd b n/a 2 360 485 797 N
a a h b n/a 2 350 485 838 N
a a j b n/a 2 360 485 797 N
a a k b n/a 2 360 503 777 N
a a l b n/a 2 353 502 685 N
a a m b n/a 2 360 485 667 N
a a n b n/a 2 360 554 667 N
a a n' b n/a 2 361 514 753 N
a a n1 b n/a 2 353 502 685 N
a a ng' b n/a 2 353 502 685 N
a a ng'ng' b n/a 2 353 502 685 N
a a nj' b n/a 2 353 502 685 N

these triphones (Bold ) are not in my training dictionary...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-09-07

these triphones (Bold ) are not in my training dictionary...

Triphones are taken from the transcription of the training prompts, not from
the dictionary. All triphones above are present in your prompts.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Madhav Kishore - 2010-09-07

Triphones are taken from the transcription of the training prompts, not from
the dictionary. All triphones above are present in your prompts.
I checked with dict2tri exe which generates triphones from dictionary.It
generates between word triphones (with default option
-btwtri yes Compute between-word triphone set )it lists all ( 10742)triphones which are seen in the MDEF file.

0.3
80 n_base
10742 n_tri
43288 n_state_map
740 n_tied_state
240 n_tied_ci_state
80 n_tied_tmat

note: my transcript contains only words no sentences

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-09-07

Great, now we have found the truth as well as the proper name for triphones :)
Any other question?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Madhav Kishore - 2010-09-07

Any other question?
then how such triphones will be trained if it is not in the training
transcript.....(plz correct me if I am wrong)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-09-07

They aren't trained. First of all on untied stage they will be just ignored.
Later on cd stage when states will be tied, they will have same senone
sequence as word-internal triphones. And this tied senone-sequence like

A I CH e n/a 0 137 227 255 N A I CH i n/a 0 137 227 255 N

will be trained from word-internal material. If there will be no word-internal
material as well you'll get a warning:

if (wt_var_ < 0) {
_

E_ERROR("Variance (mgau= %u, feat= %u, " "density=%u, component=%u) is less then 0. " "Most probably the number of senones is " "too high for such a small training " "database. Use smaller $CFG_N_TIED_STATES.\n",

_

in norm log on stage 50._
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Madhav Kishore - 2010-09-08

I think
1. even if there is no word internal triphone,it is tied with other similar triphones...
2.then,the Maximum number of senones will be 3*number of triphones listed by
dictotri.exe
3.whether it is possible to train only the internal word triphones and reduce
the number of senones (for my command and control application to increase the
speed and accuracy)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-09-08

whether it is possible to train only the internal word triphones and reduce
the number of senones (for my command and control application to increase the
speed and accuracy)

Did you try to change the script to run dict2tri with -btwtri no?

Anyway, I think you have way more effective way to reduce amount of senones -
N_TIED_STATES configuration in sphinx_train.cfg. Why don't you want to set it
properly and this way get the amount of senones you want. I think if cross-
word triphones will not be in training transcription, model will not have
separate senones for them. Even more, they will not be considered in decoder
if your grammar doesn't have self-loops.

If you have only limited amount of word-internal triphones, set the number of
states so that only they will be in final model. Yes, documentation doesn't
consider that in detail, we'll update it accordingly to explain this.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Madhav Kishore - 2010-09-09

Did you try to change the script to run dict2tri with -btwtri no?
I tried my level,I could't point out where the dic2tri is called...
(when I removed dic2tri from bin folder of training ,still it is working
.....)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-09-09

Hello

Sorry for confusion. I've checked the source and let me try to state
everything as it is

dict2tri is not used at all

Untied mdef is created with mk_mdef_get with -ountiedmdef flag

Untied mdef only counts triphones that are present in transcription and contain only them

If your transcription doesn't have cross-word triphones, untied mdef will not have them as well.

Correct me if I'm wrong
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Madhav Kishore - 2010-09-13

Sorry for late response...
All the above said are correct....

I think in final mdef , all the triphones in dictionary is listed and
clustered with trained triphones

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Senones

Speech Recognition Toolkit

Forums

Help

Senones document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Senones