I've trained acoustic models with SphinxTrain and I've decode with livepretend command. By the execution of livepretend, I get a filed called "log.txt", where the results has the following type (where conlavenia01 is the name of the audio file tested):
My problem is that I don't understand the meaning of the parameters that I get, such as SFrm, EFrm, AScr(UnNorm), LMScore, AScr+LScr , AScale, FWDVIT, FWDXCT,...
Anyway, I would like to know if there's any other ways to perform recognition, so if I want to build an application based on Sphinx, I guess that livepretend is not the best way to do so...
Thanks !!!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
SFrm - start frame (time in 1/100 seconds)
EFrm - end frame
AScr - acoustic score (used during search)
LMScr - language model score
AScale - rescoring cofficient used during search.
FWDVIT: (conlavenia01) - hypothesis (silence in your case)
FWDXCT: conlavenia01 S -3166552 T -21380540 A -21372439 L -8101 0 -21372439 -8101 <sil> 64 - more information about match
Your problem is that you incorrectly trained the database and in options you are using. For example there is no sense to use -agc emax, it will only make recognition worse. For more details we need samples actually.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi everybody,
I still have some questions related:
1) What samples do you need ?
2) Why is my database incorrectly trained ? Would it be because there's a small amount of data ?
3) My purpose is to create a speech recognition system for castilian-spanish language. Then, how can I know the correct transcription for every word from my dictionary ? I'm not a phonetist...
4) I've read in the FAQ, that I need to specify the size of my speech corpus in hours. Why do I have to do it and where do I have to specify it ?
5) Any other consideration ??
Thanks a lot !!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I've trained acoustic models with SphinxTrain and I've decode with livepretend command. By the execution of livepretend, I get a filed called "log.txt", where the results has the following type (where conlavenia01 is the name of the audio file tested):
Backtrace(conlavenia01)
FV:conlavenia01> WORD SFrm EFrm AScr(UnNorm) LMScore AScr+LScr AScale
fv:conlavenia01> <sil> 0 63 -21372439 -74100 -21446539 -3118963
FV:conlavenia01> TOTAL -21372439 -74100
FWDVIT: (conlavenia01)
FWDXCT: conlavenia01 S -3166552 T -21380540 A -21372439 L -8101 0 -21372439 -8101 <sil> 64
INFO: stat.c(154): 64 frm; 9 cdsen/fr, 54 cisen/fr, 10 cdgau/fr, 79 cigau/fr, Sen 0.01, CPU 0.01 Clk [Ovrhd 0.01 CPU 0.01 Clk]; 6 hmm/fr, 1 wd/fr, Search: -0.00 CPU 0.00 Clk (conlavenia01)
INFO: corpus.c(647): conlavenia01: 0.0 sec CPU, 0.0 sec Clk; TOT: 0.0 sec CPU, 0.0 sec Clk
INFO: main_livepretend.c(142): PARTIAL_HYP:
INFO: main_livepretend.c(142): PARTIAL_HYP:
INFO: main_livepretend.c(142): PARTIAL_HYP:
INFO: cmn_prior.c(121): cmn_prior_update: from < 19.87 -1.49 -0.44 -0.39 -0.28 -0.27 -0.24 -0.22 -0.20 -0.20 -0.19 -0.20 -0.19 >
INFO: cmn_prior.c(139): cmn_prior_update: to < 19.87 -1.50 -0.44 -0.39 -0.28 -0.26 -0.24 -0.23 -0.21 -0.20 -0.19 -0.21 -0.19 >
INFO: agc.c(172): AGCEMax: obs= 0.27, new= 4.17
INFO: fast_algo_struct.c(398): HMMHist0..0: 71(100)
INFO: lm.c(951): 0 tg(), 0 tgcache, 0 bo; 0 fills, 0 in mem (0.0%)
INFO: lm.c(955): 5 bg(), 5 bo; 0 fills, 3 in mem (42.9%)
My problem is that I don't understand the meaning of the parameters that I get, such as SFrm, EFrm, AScr(UnNorm), LMScore, AScr+LScr , AScale, FWDVIT, FWDXCT,...
Anyway, I would like to know if there's any other ways to perform recognition, so if I want to build an application based on Sphinx, I guess that livepretend is not the best way to do so...
Thanks !!!
> 1) What samples do you need ?
Samples you are training model on.
> 2) Why is my database incorrectly trained ? Would it be because there's a small amount of data ?
No idea yet, but it doesn't recognize your test utterance. Most likely you made a mistake. Small amount of data is also a problem of course.
> 3) My purpose is to create a speech recognition system for castilian-spanish language.
There are spanish phonetic dictionaries as well as rules. Spanish LTS rules are rather precise too unlike english ones.
> 4) I've read in the FAQ, that I need to specify the size of my speech corpus in hours. Why do I have to do it and where do I have to specify it ?
There is no such statement in the FAQ, reread it again and prove me I'm wrong here.
> 5) Any other consideration ??
Depends on the subject we must consider :)
SFrm - start frame (time in 1/100 seconds)
EFrm - end frame
AScr - acoustic score (used during search)
LMScr - language model score
AScale - rescoring cofficient used during search.
FWDVIT: (conlavenia01) - hypothesis (silence in your case)
FWDXCT: conlavenia01 S -3166552 T -21380540 A -21372439 L -8101 0 -21372439 -8101 <sil> 64 - more information about match
For more information check the FAQ:
http://www.speech.cs.cmu.edu/sphinxman/FAQ.html#21
Your problem is that you incorrectly trained the database and in options you are using. For example there is no sense to use -agc emax, it will only make recognition worse. For more details we need samples actually.
Hi everybody,
I still have some questions related:
1) What samples do you need ?
2) Why is my database incorrectly trained ? Would it be because there's a small amount of data ?
3) My purpose is to create a speech recognition system for castilian-spanish language. Then, how can I know the correct transcription for every word from my dictionary ? I'm not a phonetist...
4) I've read in the FAQ, that I need to specify the size of my speech corpus in hours. Why do I have to do it and where do I have to specify it ?
5) Any other consideration ??
Thanks a lot !!