You can subscribe to this list here.
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(4) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(1) |
Dec
(14) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2012 |
Jan
(1) |
Feb
(8) |
Mar
|
Apr
(1) |
May
(3) |
Jun
(13) |
Jul
(7) |
Aug
(11) |
Sep
(6) |
Oct
(14) |
Nov
(16) |
Dec
(1) |
2013 |
Jan
(3) |
Feb
(8) |
Mar
(17) |
Apr
(21) |
May
(27) |
Jun
(11) |
Jul
(11) |
Aug
(21) |
Sep
(39) |
Oct
(17) |
Nov
(39) |
Dec
(28) |
2014 |
Jan
(36) |
Feb
(30) |
Mar
(35) |
Apr
(17) |
May
(22) |
Jun
(28) |
Jul
(23) |
Aug
(41) |
Sep
(17) |
Oct
(10) |
Nov
(22) |
Dec
(56) |
2015 |
Jan
(30) |
Feb
(32) |
Mar
(37) |
Apr
(28) |
May
(79) |
Jun
(18) |
Jul
(35) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: <jen...@a2...> - 2014-03-29 01:38:27
|
Kaldi - Build # 508 - Failure: See the build log in attachment for the details. |
From: Eamonn K. <Eam...@cs...> - 2014-03-26 15:55:07
|
On 26/03/14 15:48, Daniel Povey wrote: > > Is it possible to perform recognition confined to a small grammar > given > that you have trained on a large grammar that includes the small > grammar > as a subset? > > I ask because I attempted to follow the recipe of > http://vpanayotov.blogspot.ie/2012/06/kaldi-decoding-graph-construction.html > to do this but to no avail. > > Then I attempted to take egs/voxforge/s5/run.sh and strip out the > training section and change the corpus.txt file to obtain the small > grammar. The idea was that I would generate L and G using the existing > run script but then combine it with Ha and C to get the reduced > fst. It > all compiles and looks like it should work, but there must be a > mismatch > between the Ha of the existing large grammar model and the path > through > the model that uses the smaller G. The recogniser will respond to the > speaker but produces completely wrong results and in many cases just > produces the same word output every time. > > > This should work. I suspect you have done something like giving it > the wrong sample-rate audio. The features are not comparable between > different sample rates. Check the log-likelihoods you get on decoding > (caution: they may or may not be printed out multiplied by the > acoustic scale)- if these are very different from your "matched" > decoding, then likely the acoustics are wrong. Also see the fMLLR > objective function improvement, if you're using fMLLR- if the > acoustics are mismatched it will be very large, e.g. >5. I used the online-gmm-decode-faster which is what I usually use. Giving the large grammar the same words it recognises them some of the time. Obviously my point of then using the small grammar is to improve the accuracy. I don't tend to use online-wav-gmm-decode-faster and I've found that if I use the sample rate of 44100 it gets flagged as an error anyway. > > I've attempted also to see how I might use HCLG o G_s^- o G_l > where G_s > is the small grammar and G_l is the large grammar, but I see no > documentation on who this is actually performed using a script. > > > This is implemented, it's called "biglm" in the code and scripts, > there is an example in the WSJ scripts, egs/wsj/s5/. Thanks I'll have a look at it again. I was looking for it in the egs/wsj/s5 because I found a s3 version in bitbucket. Maybe I just missed it. > Dan > > > > > Learn Graph Databases - Download FREE O'Reilly Book > "Graph Databases" is the definitive new guide to graph databases > and their > applications. Written by three acclaimed leaders in the field, > this first edition is now available. Download your free book today! > http://p.sf.net/sfu/13534_NeoTech > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > <mailto:Kal...@li...> > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > -- Best Regards, Eamonn + + + Eamonn Kenny B.A., M.Sc., Ph.D. CNGL/Speech Communication Lab, Tel: 00+353-1-8961797 Dept. of Computer Science, Email: Eam...@sc... F.34, O'Reilly Institute, http://www.cs.tcd.ie/Eamonn.Kenny Trinity College Dublin, http://eamonnmkenny.wordpress.com Dublin 2, Ireland. + + + |
From: Daniel P. <dp...@gm...> - 2014-03-26 15:48:41
|
> Is it possible to perform recognition confined to a small grammar given > that you have trained on a large grammar that includes the small grammar > as a subset? > > I ask because I attempted to follow the recipe of > > http://vpanayotov.blogspot.ie/2012/06/kaldi-decoding-graph-construction.html > to do this but to no avail. > > Then I attempted to take egs/voxforge/s5/run.sh and strip out the > training section and change the corpus.txt file to obtain the small > grammar. The idea was that I would generate L and G using the existing > run script but then combine it with Ha and C to get the reduced fst. It > all compiles and looks like it should work, but there must be a mismatch > between the Ha of the existing large grammar model and the path through > the model that uses the smaller G. The recogniser will respond to the > speaker but produces completely wrong results and in many cases just > produces the same word output every time. > This should work. I suspect you have done something like giving it the wrong sample-rate audio. The features are not comparable between different sample rates. Check the log-likelihoods you get on decoding (caution: they may or may not be printed out multiplied by the acoustic scale)- if these are very different from your "matched" decoding, then likely the acoustics are wrong. Also see the fMLLR objective function improvement, if you're using fMLLR- if the acoustics are mismatched it will be very large, e.g. >5. > I've attempted also to see how I might use HCLG o G_s^- o G_l where G_s > is the small grammar and G_l is the large grammar, but I see no > documentation on who this is actually performed using a script. > > This is implemented, it's called "biglm" in the code and scripts, there is an example in the WSJ scripts, egs/wsj/s5/. Dan > Learn Graph Databases - Download FREE O'Reilly Book > "Graph Databases" is the definitive new guide to graph databases and their > applications. Written by three acclaimed leaders in the field, > this first edition is now available. Download your free book today! > http://p.sf.net/sfu/13534_NeoTech > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > |
From: Eamonn K. <Eam...@cs...> - 2014-03-26 14:29:32
|
Dear Kaldi Developers, Is it possible to perform recognition confined to a small grammar given that you have trained on a large grammar that includes the small grammar as a subset? I ask because I attempted to follow the recipe of http://vpanayotov.blogspot.ie/2012/06/kaldi-decoding-graph-construction.html to do this but to no avail. Then I attempted to take egs/voxforge/s5/run.sh and strip out the training section and change the corpus.txt file to obtain the small grammar. The idea was that I would generate L and G using the existing run script but then combine it with Ha and C to get the reduced fst. It all compiles and looks like it should work, but there must be a mismatch between the Ha of the existing large grammar model and the path through the model that uses the smaller G. The recogniser will respond to the speaker but produces completely wrong results and in many cases just produces the same word output every time. I've attempted also to see how I might use HCLG o G_s^- o G_l where G_s is the small grammar and G_l is the large grammar, but I see no documentation on who this is actually performed using a script. any help or pointers would be greatly appreciated. -- Best Regards, Eamonn |
From: Vesely K. <ive...@fi...> - 2014-03-25 16:03:39
|
Hi Feiteng, thanks for noticing, it's fixed now, yes there should be used (vec-mean) everywhere. I added another auxiliary vector, as needed to avoid restricting the skewness to positive values. Surprisingly the variance values did not change, but after writing down the formulas, it has become clear why this should be the correct case. E((x-mu)^2) = E(x^2 -2*x*mu + mu^2) = E(x^2) -2*mu*E(x) +mu^2 = E(x^2) - 2*mu^2 +mu^2 = E((x-mu)*x) Karel. On 03/24/2014 04:29 PM, Daniel Povey wrote: > I think you are right, because "vec" does not have the mean subtracted. > This is Karel's code, so he'll decide the best way to proceed. It may > be easiest for him to fix it himself. > Thanks for noticing! > > Dan > > > > On Sun, Mar 23, 2014 at 5:11 AM, 李飞腾 <fei...@yo... > <mailto:fei...@yo...>> wrote: > > Hi: > I think the way to computer variance in MomentStatistics() is not > right. > variance-wiki http://en.wikipedia.org/wiki/Variance > usingApplyPow() to replace vec_aux.MulElements(vec); > std::string MomentStatistics(const Vector<Real> &vec) { > // we use an auxiliary vector for the higher order powers > Vector<Real> vec_aux(vec); > // mean > Real mean = vec.Sum() / vec.Dim(); > // variance > vec_aux.Add(-mean); > vec_aux.ApplyPow(2.0); //vec_aux.MulElements(vec); // (vec-mean)^2 > Real variance = vec_aux.Sum() / vec.Dim(); > vec_aux.ApplyPow(3.0/2.0); //vec_aux.MulElements(vec); // (vec-mean)^3 > Real skewness = vec_aux.Sum() / pow(variance, 3.0/2.0) / vec.Dim(); > vec_aux.ApplyPow(4.0/3.0); //vec_aux.MulElements(vec); // (vec-mean)^4 > Real kurtosis = vec_aux.Sum() / (variance * variance) / vec.Dim() > - 3.0; > // send the statistics to stream, > std::ostringstream ostr; > ostr << " ( min " << vec.Min() << ", max " << vec.Max() > << ", mean " << mean > << ", variance " << variance > << ", skewness " << skewness > << ", kurtosis " << kurtosis > << " ) "; > return ostr.str(); > } > Am I right? > What is the easiest way to contribute to kaldi? > Best! > feiteng li > > ------------------------------------------------------------------------------ > Learn Graph Databases - Download FREE O'Reilly Book > "Graph Databases" is the definitive new guide to graph databases > and their > applications. Written by three acclaimed leaders in the field, > this first edition is now available. Download your free book today! > http://p.sf.net/sfu/13534_NeoTech > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > <mailto:Kal...@li...> > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > |
From: Daniel P. <dp...@gm...> - 2014-03-24 15:29:35
|
I think you are right, because "vec" does not have the mean subtracted. This is Karel's code, so he'll decide the best way to proceed. It may be easiest for him to fix it himself. Thanks for noticing! Dan On Sun, Mar 23, 2014 at 5:11 AM, 李飞腾 <fei...@yo...> wrote: > Hi: > > I think the way to computer variance in MomentStatistics() is not right. > variance-wiki http://en.wikipedia.org/wiki/Variance > using ApplyPow() to replace vec_aux.MulElements(vec); > > std::string MomentStatistics(const Vector<Real> &vec) { > // we use an auxiliary vector for the higher order powers > Vector<Real> vec_aux(vec); > // mean > Real mean = vec.Sum() / vec.Dim(); > // variance > vec_aux.Add(-mean); > vec_aux.ApplyPow(2.0); //vec_aux.MulElements(vec); // (vec-mean)^2 > Real variance = vec_aux.Sum() / vec.Dim(); > vec_aux.ApplyPow(3.0/2.0); //vec_aux.MulElements(vec); // (vec-mean)^3 > Real skewness = vec_aux.Sum() / pow(variance, 3.0/2.0) / vec.Dim(); > vec_aux.ApplyPow(4.0/3.0); //vec_aux.MulElements(vec); // (vec-mean)^4 > Real kurtosis = vec_aux.Sum() / (variance * variance) / vec.Dim() - 3.0; > // send the statistics to stream, > std::ostringstream ostr; > ostr << " ( min " << vec.Min() << ", max " << vec.Max() > << ", mean " << mean > << ", variance " << variance > << ", skewness " << skewness > << ", kurtosis " << kurtosis > << " ) "; > return ostr.str(); > } > > Am I right? > > What is the easiest way to contribute to kaldi? > > Best! > feiteng li > > > > ------------------------------------------------------------------------------ > Learn Graph Databases - Download FREE O'Reilly Book > "Graph Databases" is the definitive new guide to graph databases and their > applications. Written by three acclaimed leaders in the field, > this first edition is now available. Download your free book today! > http://p.sf.net/sfu/13534_NeoTech > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > |
From: 李飞腾 <fei...@yo...> - 2014-03-23 09:26:44
|
Hi: I think the way to computer variance in MomentStatistics() is not right. variance-wiki http://en.wikipedia.org/wiki/Variance using ApplyPow() to replace vec_aux.MulElements(vec); std::string MomentStatistics(const Vector<Real> &vec) { // we use an auxiliary vector for the higher order powers Vector<Real> vec_aux(vec); // mean Real mean = vec.Sum() / vec.Dim(); // variance vec_aux.Add(-mean); vec_aux.ApplyPow(2.0); //vec_aux.MulElements(vec); // (vec-mean)^2 Real variance = vec_aux.Sum() / vec.Dim(); vec_aux.ApplyPow(3.0/2.0); //vec_aux.MulElements(vec); // (vec-mean)^3 Real skewness = vec_aux.Sum() / pow(variance, 3.0/2.0) / vec.Dim(); vec_aux.ApplyPow(4.0/3.0); //vec_aux.MulElements(vec); // (vec-mean)^4 Real kurtosis = vec_aux.Sum() / (variance * variance) / vec.Dim() - 3.0; // send the statistics to stream, std::ostringstream ostr; ostr << " ( min " << vec.Min() << ", max " << vec.Max() << ", mean " << mean << ", variance " << variance << ", skewness " << skewness << ", kurtosis " << kurtosis << " ) "; return ostr.str(); } Am I right? What is the easiest way to contribute to kaldi? Best! feiteng li |
From: Nagendra G. <nag...@go...> - 2014-03-21 16:57:45
|
To slightly correct what Dan said - Kaldi does have capability to adjust for the vocal tract length, but needs scripts for estimating very precisely what the adjustment should be. Also It does not matter if we use DTW of the results of HMM forced alignments, but some work is needed to come up with that measure of "objective difference" once the alignment is done. It looks like a good amount of work to make a Masters or even PhD. thesis, even if Kaldi is used as a starting point. I am personally not aware of any other open source that can be a better starting point. You will need a signal processing graduate to do that work. Nagendra From: Daniel Povey [mailto:dp...@gm...] Sent: Friday, March 21, 2014 12:30 PM To: Nagendra Goel Cc: Jessica Horst; kal...@li... Subject: Re: [Kaldi-developers] Kaldi question I think for these purposes, it might be easier to use Dynamic Time Warping (DTW) on speech features (e.g. MFCC features) computed from the signals. This is something that isn't really used any more for speech recognition any more, but it directly gives you a measure of distance between two signals. You could divide by the maximum length of the two utterances to get a normalized distance independent of length. Actually, it might be a good idea to do vocal tract length normalization (VTLN) to normalize the child's speech to be more adult-like, before doing that. Kaldi is not really oriented towards this kind of use, though. It might be necessary to find someone who can help you in a more detailed way. Dan On Fri, Mar 21, 2014 at 8:47 AM, Nagendra Goel <nag...@go...> wrote: I am not sure how many in this audience are psychologically oriented, but I felt like giving one response. Others may add. One difference between Brain and Speech recognition systems is that the brain first develops a self-organized map of sounds. A Male "apple" and female "apple" and A child's "apple" all sound different to the brain, but it knows how to differentiate between the pronunciation and the, speaker. This map is probably developed before the child starts to develop a vocabulary. On the other hand, although speech recognition systems do develop an "adaptation framework" that allows the system to find a closely sounding word given different kind of voices, there is no simultaneous output of the "voice type" and the "spoken text" at the same time. In fact its at the moment pretty hard for the system to differentiate between the "voice type" and the "channel type" which is the quality of the recording medium. Simultaneous output of these parameters is still a research question, and it's not known how well the system will adapt to a child's voice when trained by an adult voice. So I am wondering how this study will help you, because if you are able to output any measure of closeness in your study, it may be far from a human subjective measure of closeness. Nagendra From: Jessica Horst [mailto:je...@su...] Sent: Thursday, March 20, 2014 8:49 AM To: kal...@li... Subject: [Kaldi-developers] Kaldi question Dear Kaldi team, I am a faculty member at the University of Sussex. I am planing a study and would like to know if Kaldi is the right software to use. I study child word learning. I am planning to record an adult and child talking about objects. (The adult will be a lab member and know not to speak at the same time as the child and there will be as little background noise as possible.) The child will be about 3 years old, but I can go up to 4 years if that will be better for the software. I would like to train the software on the adult input to the child (what the adult said) and then give it the child speech. I would like an index of how similar the child speech was to the adult speech. For example, if the adult is teaching the child the word "apple" and says "apple" 12 times, when the child finally says "apple" how similar is that word to the adult speech the child heard? My colleague told me that speech recognition software works by having a threshold of similarity. For example, when I tell my mobile phone "call home" the software compares what I said to what I have said before and if it is similar enough (above threshold) it will recognise my speech. I'm hopeful that I could use the same kind of principle here (how similar is the child's speech to the adult speech (what was said before), but I would want a numerical value instead of just knowing if it was above or below threshold. Can Kaldi handle child and adult speech in this way? Thank you for your time. ~Jessica ******************************** Dr. Jessica S. Horst Senior Lecturer in Psychology University of Sussex School of Psychology Brighton BN1 9QH United Kingdom Email: je...@su... Tel: +44 (0)1273 87 3084 <tel:%2B44%20%280%291273%2087%203084> Lab: http://www.sussex.ac.uk/wordlab ---------------------------------------------------------------------------- -- Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech _______________________________________________ Kaldi-developers mailing list Kal...@li... https://lists.sourceforge.net/lists/listinfo/kaldi-developers |
From: Daniel P. <dp...@gm...> - 2014-03-21 16:30:12
|
I think for these purposes, it might be easier to use Dynamic Time Warping (DTW) on speech features (e.g. MFCC features) computed from the signals. This is something that isn't really used any more for speech recognition any more, but it directly gives you a measure of distance between two signals. You could divide by the maximum length of the two utterances to get a normalized distance independent of length. Actually, it might be a good idea to do vocal tract length normalization (VTLN) to normalize the child's speech to be more adult-like, before doing that. Kaldi is not really oriented towards this kind of use, though. It might be necessary to find someone who can help you in a more detailed way. Dan On Fri, Mar 21, 2014 at 8:47 AM, Nagendra Goel <nag...@go...>wrote: > I am not sure how many in this audience are psychologically oriented, but > I felt like giving one response. Others may add. > > > > One difference between Brain and Speech recognition systems is that the > brain first develops a self-organized map of sounds. A Male "apple" and > female "apple" and > > A child's "apple" all sound different to the brain, but it knows how to > differentiate between the pronunciation and the, speaker. > > > > This map is probably developed before the child starts to develop a > vocabulary. On the other hand, although speech recognition systems do > develop an "adaptation framework" that allows the system to find a closely > sounding word given different kind of voices, there is no simultaneous > output of the "voice type" and the "spoken text" at the same time. In fact > its at the moment pretty hard for the system to differentiate between the > "voice type" and the "channel type" which is the quality of the recording > medium. Simultaneous output of these parameters is still a research > question, and it's not known how well the system will adapt to a child's > voice when trained by an adult voice. > > > > So I am wondering how this study will help you, because if you are able to > output any measure of closeness in your study, it may be far from a human > subjective measure of closeness. > > > > Nagendra > > *From:* Jessica Horst [mailto:je...@su...] > *Sent:* Thursday, March 20, 2014 8:49 AM > *To:* kal...@li... > *Subject:* [Kaldi-developers] Kaldi question > > > > Dear Kaldi team, > > > > I am a faculty member at the University of Sussex. I am planing a study > and would like to know if Kaldi is the right software to use. > > > > I study child word learning. I am planning to record an adult and child > talking about objects. (The adult will be a lab member and know not to > speak at the same time as the child and there will be as little background > noise as possible.) The child will be about 3 years old, but I can go up to > 4 years if that will be better for the software. I would like to train the > software on the adult input to the child (what the adult said) and then > give it the child speech. I would like an index of how similar the child > speech was to the adult speech. For example, if the adult is teaching the > child the word "apple" and says "apple" 12 times, when the child finally > says "apple" how similar is that word to the adult speech the child heard? > > > > My colleague told me that speech recognition software works by having a > threshold of similarity. For example, when I tell my mobile phone "call > home" the software compares what I said to what I have said before and if > it is similar enough (above threshold) it will recognise my speech. I'm > hopeful that I could use the same kind of principle here (how similar is > the child's speech to the adult speech (what was said before), but I would > want a numerical value instead of just knowing if it was above or below > threshold. > > > > Can Kaldi handle child and adult speech in this way? > > > > Thank you for your time. > > ~Jessica > > > > ******************************** > > Dr. Jessica S. Horst > > Senior Lecturer in Psychology > > > > University of Sussex > > School of Psychology > > Brighton BN1 9QH > > United Kingdom > > > > Email: je...@su... > > Tel: +44 (0)1273 87 3084 > > Lab: http://www.sussex.ac.uk/wordlab > > > > > ------------------------------------------------------------------------------ > Learn Graph Databases - Download FREE O'Reilly Book > "Graph Databases" is the definitive new guide to graph databases and their > applications. Written by three acclaimed leaders in the field, > this first edition is now available. Download your free book today! > http://p.sf.net/sfu/13534_NeoTech > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > |
From: Nagendra G. <nag...@go...> - 2014-03-21 13:19:06
|
I am not sure how many in this audience are psychologically oriented, but I felt like giving one response. Others may add. One difference between Brain and Speech recognition systems is that the brain first develops a self-organized map of sounds. A Male "apple" and female "apple" and A child's "apple" all sound different to the brain, but it knows how to differentiate between the pronunciation and the, speaker. This map is probably developed before the child starts to develop a vocabulary. On the other hand, although speech recognition systems do develop an "adaptation framework" that allows the system to find a closely sounding word given different kind of voices, there is no simultaneous output of the "voice type" and the "spoken text" at the same time. In fact its at the moment pretty hard for the system to differentiate between the "voice type" and the "channel type" which is the quality of the recording medium. Simultaneous output of these parameters is still a research question, and it's not known how well the system will adapt to a child's voice when trained by an adult voice. So I am wondering how this study will help you, because if you are able to output any measure of closeness in your study, it may be far from a human subjective measure of closeness. Nagendra From: Jessica Horst [mailto:je...@su...] Sent: Thursday, March 20, 2014 8:49 AM To: kal...@li... Subject: [Kaldi-developers] Kaldi question Dear Kaldi team, I am a faculty member at the University of Sussex. I am planing a study and would like to know if Kaldi is the right software to use. I study child word learning. I am planning to record an adult and child talking about objects. (The adult will be a lab member and know not to speak at the same time as the child and there will be as little background noise as possible.) The child will be about 3 years old, but I can go up to 4 years if that will be better for the software. I would like to train the software on the adult input to the child (what the adult said) and then give it the child speech. I would like an index of how similar the child speech was to the adult speech. For example, if the adult is teaching the child the word "apple" and says "apple" 12 times, when the child finally says "apple" how similar is that word to the adult speech the child heard? My colleague told me that speech recognition software works by having a threshold of similarity. For example, when I tell my mobile phone "call home" the software compares what I said to what I have said before and if it is similar enough (above threshold) it will recognise my speech. I'm hopeful that I could use the same kind of principle here (how similar is the child's speech to the adult speech (what was said before), but I would want a numerical value instead of just knowing if it was above or below threshold. Can Kaldi handle child and adult speech in this way? Thank you for your time. ~Jessica ******************************** Dr. Jessica S. Horst Senior Lecturer in Psychology University of Sussex School of Psychology Brighton BN1 9QH United Kingdom Email: je...@su... Tel: +44 (0)1273 87 3084 Lab: http://www.sussex.ac.uk/wordlab |
From: Jessica H. <je...@su...> - 2014-03-21 12:55:31
|
Dear Nagendra, Thank you. I am actually looking for an objective measure of closeness. For example, I want to know if when a child says (for example) ³apple² after hearing the adult say it 12 times, is that token of ³apple² closer to the adult than when the child said ³apple² after only hearing the adult say it 4 times. The child is more likely to actually say something like AW-pull than app-UL. Other questions include: do children who know more words have better objective similarity than those who know fewer words? I know children (and adults) can learn words across multiple speakers and even within the same speaker there is variability. I want an objective way of looking at how similar the tokens are as the words become more familiar to the child. Does this make more sense? A colleague suggested I should be able to compare the spectrograms or something similar, but I don¹t know where to start. ~Jessica From: Nagendra Goel <nag...@go...> Date: Friday, 21 March 2014 12:47 To: Jessica Horst <je...@su...>, <kal...@li...> Subject: RE: [Kaldi-developers] Kaldi question I am not sure how many in this audience are psychologically oriented, but I felt like giving one response. Others may add. One difference between Brain and Speech recognition systems is that the brain first develops a self-organized map of sounds. A Male ³apple² and female ³apple² and A child¹s ³apple² all sound different to the brain, but it knows how to differentiate between the pronunciation and the, speaker. This map is probably developed before the child starts to develop a vocabulary. On the other hand, although speech recognition systems do develop an ³adaptation framework² that allows the system to find a closely sounding word given different kind of voices, there is no simultaneous output of the ³voice type² and the ³spoken text² at the same time. In fact its at the moment pretty hard for the system to differentiate between the ³voice type² and the ³channel type² which is the quality of the recording medium. Simultaneous output of these parameters is still a research question, and it¹s not known how well the system will adapt to a child¹s voice when trained by an adult voice. So I am wondering how this study will help you, because if you are able to output any measure of closeness in your study, it may be far from a human subjective measure of closeness. Nagendra From: Jessica Horst [mailto:je...@su...] Sent: Thursday, March 20, 2014 8:49 AM To: kal...@li... Subject: [Kaldi-developers] Kaldi question Dear Kaldi team, I am a faculty member at the University of Sussex. I am planing a study and would like to know if Kaldi is the right software to use. I study child word learning. I am planning to record an adult and child talking about objects. (The adult will be a lab member and know not to speak at the same time as the child and there will be as little background noise as possible.) The child will be about 3 years old, but I can go up to 4 years if that will be better for the software. I would like to train the software on the adult input to the child (what the adult said) and then give it the child speech. I would like an index of how similar the child speech was to the adult speech. For example, if the adult is teaching the child the word ³apple² and says ³apple² 12 times, when the child finally says ³apple² how similar is that word to the adult speech the child heard? My colleague told me that speech recognition software works by having a threshold of similarity. For example, when I tell my mobile phone ³call home² the software compares what I said to what I have said before and if it is similar enough (above threshold) it will recognise my speech. I¹m hopeful that I could use the same kind of principle here (how similar is the child¹s speech to the adult speech (what was said before), but I would want a numerical value instead of just knowing if it was above or below threshold. Can Kaldi handle child and adult speech in this way? Thank you for your time. ~Jessica ******************************** Dr. Jessica S. Horst Senior Lecturer in Psychology University of Sussex School of Psychology Brighton BN1 9QH United Kingdom Email: je...@su... Tel: +44 (0)1273 87 3084 Lab: http://www.sussex.ac.uk/wordlab |
From: <jen...@a2...> - 2014-03-20 21:56:14
|
Kaldi - Build # 500 - Failure: See the build log in attachment for the details. |
From: Daniel P. <dp...@gm...> - 2014-03-20 19:56:16
|
Hi everyone, I think it would be nice to have tools that would convert back and forth between my and Karel's neural nets, for testing purposes. [note: it might not just be a question of converting the network itself, since I think we may use different conventions on how the splicing is done]. But anyway, converting the network would be a start. Does anyone want to help with this? Covering the "common cases", or at least the easy cases, would be sufficient-- the two versions don't support exactly the same set of nonlinearities. Dan |
From: Daniel P. <dp...@gm...> - 2014-03-20 16:03:46
|
> > > First, I appreciate your contribution in this toolkit. It certainly make > people be able to build their tools/research on top of the state of the art > algorithms. I am pretty naive to the toolkit. I started running through the > egs scripts for rm and timit. Perhaps, the example scripts were created a > long time ago. Have anyone verified them again when other parts of the > codes are updated? > Originally, I hope to be able to run through the run.sh by changing the > data path only. However, it is easy to get stuck in some lines (ex: without > generating suitable output). > (E.g. for timit, I stuck in the "steps/make_mfcc.sh --cmd "$train_cmd" > --nj 30 data/$x exp/make_mfcc/$x $mfccdir || exit 1;" for timit --> > "Mal-formed spk2gender file") > I see what happened here. Recently I modified make_mfcc.sh to validate the input directory by calling utils/validate_data_dir.sh. It failed because there was an error in data preparation. We rarely use TIMIT so we didn't notice. > The other question is about extracting fmllr features. Although I spend > some time, I couldn't find the right solution. I want to extract fmllr > features using timit. The DNN scripts in Kaldi is based on the fmllr > features. > Is the script below (found in local/run_dnn.sh) the right answer?' > steps/nnet/make_fmllr_feats.sh --nj 10 --cmd "$train_cmd" \ > --transform-dir $gmmdir/decode \ > $dir data/test $gmmdir $dir/log $dir/data || exit 1 > " > where $gmmdir is obtained in the normal gmm-hmm framework? > I don't think you will have much luck training DNNs on timit. If this is for training data then the --transform-dir option should probably just be set to $gmmdir, where $gmmdir is probably tri3 in the TIMIT case. For test data it would be a subdirectory where you have decoded. Dan |
From: Xavier A. <xan...@gm...> - 2014-03-20 16:01:16
|
Thanks, I will try to make one. X. On Thu, Mar 20, 2014 at 4:44 PM, Daniel Povey <dp...@gm...> wrote: > There is not a tool for this. > I doubt it will make a difference after training is done, but it's > possible... > Dan > > > > On Thu, Mar 20, 2014 at 7:14 AM, Xavier Anguera <xan...@gm...>wrote: > >> Hi, >> I am following the swdb recipe. In the first training step >> (steps/train_mono.sh) I see that it calls align-equal-compiled to obtain a >> uniform alignment between the phoneme transcription and the audio. >> Would it be possible to insert my own phoneme alignment instead of using >> a uniform setup? I have followed the code and it seems plausible to modify >> it to insert my alignment, but I wonder whether there is already a tool >> for this in kaldi? >> >> Also, do you think it would potentially improve the final ASR results if >> I had access to high-quality alignments? >> >> Thanks, >> >> Xavier Anguera >> >> >> ------------------------------------------------------------------------------ >> Learn Graph Databases - Download FREE O'Reilly Book >> "Graph Databases" is the definitive new guide to graph databases and their >> applications. Written by three acclaimed leaders in the field, >> this first edition is now available. Download your free book today! >> http://p.sf.net/sfu/13534_NeoTech >> _______________________________________________ >> Kaldi-developers mailing list >> Kal...@li... >> https://lists.sourceforge.net/lists/listinfo/kaldi-developers >> >> > |
From: Daniel P. <dp...@gm...> - 2014-03-20 15:44:13
|
There is not a tool for this. I doubt it will make a difference after training is done, but it's possible... Dan On Thu, Mar 20, 2014 at 7:14 AM, Xavier Anguera <xan...@gm...> wrote: > Hi, > I am following the swdb recipe. In the first training step > (steps/train_mono.sh) I see that it calls align-equal-compiled to obtain a > uniform alignment between the phoneme transcription and the audio. > Would it be possible to insert my own phoneme alignment instead of using a > uniform setup? I have followed the code and it seems plausible to modify it > to insert my alignment, but I wonder whether there is already a tool for > this in kaldi? > > Also, do you think it would potentially improve the final ASR results if > I had access to high-quality alignments? > > Thanks, > > Xavier Anguera > > > ------------------------------------------------------------------------------ > Learn Graph Databases - Download FREE O'Reilly Book > "Graph Databases" is the definitive new guide to graph databases and their > applications. Written by three acclaimed leaders in the field, > this first edition is now available. Download your free book today! > http://p.sf.net/sfu/13534_NeoTech > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > |
From: Jessica H. <je...@su...> - 2014-03-20 12:51:20
|
Dear Kaldi team, I am a faculty member at the University of Sussex. I am planing a study and would like to know if Kaldi is the right software to use. I study child word learning. I am planning to record an adult and child talking about objects. (The adult will be a lab member and know not to speak at the same time as the child and there will be as little background noise as possible.) The child will be about 3 years old, but I can go up to 4 years if that will be better for the software. I would like to train the software on the adult input to the child (what the adult said) and then give it the child speech. I would like an index of how similar the child speech was to the adult speech. For example, if the adult is teaching the child the word ³apple² and says ³apple² 12 times, when the child finally says ³apple² how similar is that word to the adult speech the child heard? My colleague told me that speech recognition software works by having a threshold of similarity. For example, when I tell my mobile phone ³call home² the software compares what I said to what I have said before and if it is similar enough (above threshold) it will recognise my speech. I¹m hopeful that I could use the same kind of principle here (how similar is the child¹s speech to the adult speech (what was said before), but I would want a numerical value instead of just knowing if it was above or below threshold. Can Kaldi handle child and adult speech in this way? Thank you for your time. ~Jessica ******************************** Dr. Jessica S. Horst Senior Lecturer in Psychology University of Sussex School of Psychology Brighton BN1 9QH United Kingdom Email: je...@su... Tel: +44 (0)1273 87 3084 Lab: http://www.sussex.ac.uk/wordlab |
From: Xavier A. <xan...@gm...> - 2014-03-20 11:14:17
|
Hi, I am following the swdb recipe. In the first training step (steps/train_mono.sh) I see that it calls align-equal-compiled to obtain a uniform alignment between the phoneme transcription and the audio. Would it be possible to insert my own phoneme alignment instead of using a uniform setup? I have followed the code and it seems plausible to modify it to insert my alignment, but I wonder whether there is already a tool for this in kaldi? Also, do you think it would potentially improve the final ASR results if I had access to high-quality alignments? Thanks, Xavier Anguera |
From: <fe...@in...> - 2014-03-19 12:59:40
|
Hi Daniel, Thanks, we are working on that. Best, Felipe Espic Quoting Daniel Povey <dp...@gm...>: > The scoring is normally based on words so the confusion matrix is output as > a sequence of words. There are ways to do what you want, involving the > program ali-to-phones, that would involve aligning the training data with > steps/align_fmllr.sh or align_si.sh and comparing with the best alignment > from the decode, then putting it into compute-wer and asking it to output > the detailed information. But I don't have time right now to explain it in > detail. > Dan > > > > On Mon, Mar 17, 2014 at 1:39 PM, <fe...@in...> wrote: > >> Hi Daniel, >> >> Thanks for you quick reply. >> >> We want to use confusion matrices to see which phonemes (or types of >> phonemes) are misclassified. >> >> Is there any other way you can suggest to do this? >> >> Thanks, >> >> Felipe Espic >> >> >> >> >> Quoting Daniel Povey <dp...@gm...>: >> >> Hi, >>> There is no explicit support for multi-stream ASR in Kaldi, you'll have to >>> try to understand the codebase and code something yourself [although if >>> you >>> build separate models with the same tree, you can use the DecodableSum >>> class to help you decode with scores summed over the models; you'll need >>> to >>> write code for this though.] >>> Regarding a phone confusion matrix- if you build a system to decode >>> phones, >>> I think the program compute-wer has an option to output confusion data, >>> but >>> I doubt it is in the format you want. However, I would advise against >>> this. Phone confusion matrices are a little old fashioned. >>> Dan >>> >>> >>> >>> On Mon, Mar 17, 2014 at 9:20 AM, <fe...@in...> wrote: >>> >>> Dear Sirs, >>>> >>>> I am with the Speech Processing and Transmission Lab at the University >>>> of Chile. >>>> We are working on multistream speech recognition in Kaldi, then we >>>> have a couple of questions: >>>> >>>> - We want to create a confusion matrix by phoneme to assess the >>>> performance of only acoustic features. How we could address this in >>>> Kaldi? I think we have to make a phoneme recognizer (w/o word position >>>> dependency), thus we read these posts >>>> http://sourceforge.net/p/kaldi/discussion/1355348/thread/51258bf4/ >>>> and http://sourceforge.net/p/kaldi/discussion/1355348/thread/2294d269/ >>>> from 2013, but we did not find any specific solution. >>>> >>>> - Is there any recipe for multistream ASR in Kaldi ? Any help with this? >>>> >>>> >>>> Best Regards, >>>> >>>> Felipe Espic >>>> >>>> >>>> >>>> ------------------------------------------------------------ >>>> ------------------ >>>> Learn Graph Databases - Download FREE O'Reilly Book >>>> "Graph Databases" is the definitive new guide to graph databases and >>>> their >>>> applications. Written by three acclaimed leaders in the field, >>>> this first edition is now available. Download your free book today! >>>> http://p.sf.net/sfu/13534_NeoTech >>>> _______________________________________________ >>>> Kaldi-developers mailing list >>>> Kal...@li... >>>> https://lists.sourceforge.net/lists/listinfo/kaldi-developers >>>> >>>> >> >> |
From: Po-Sen H. <hua...@gm...> - 2014-03-19 05:37:32
|
Dear Kaldi developers, First, I appreciate your contribution in this toolkit. It certainly make people be able to build their tools/research on top of the state of the art algorithms. I am pretty naive to the toolkit. I started running through the egs scripts for rm and timit. Perhaps, the example scripts were created a long time ago. Have anyone verified them again when other parts of the codes are updated? Originally, I hope to be able to run through the run.sh by changing the data path only. However, it is easy to get stuck in some lines (ex: without generating suitable output). (E.g. for timit, I stuck in the "steps/make_mfcc.sh --cmd "$train_cmd" --nj 30 data/$x exp/make_mfcc/$x $mfccdir || exit 1;" for timit --> "Mal-formed spk2gender file") The other question is about extracting fmllr features. Although I spend some time, I couldn't find the right solution. I want to extract fmllr features using timit. The DNN scripts in Kaldi is based on the fmllr features. Is the script below (found in local/run_dnn.sh) the right answer?' steps/nnet/make_fmllr_feats.sh --nj 10 --cmd "$train_cmd" \ --transform-dir $gmmdir/decode \ $dir data/test $gmmdir $dir/log $dir/data || exit 1 " where $gmmdir is obtained in the normal gmm-hmm framework? All the example scripts are nice, but they seems not connecting to each other and I couldn't find the right answer by searching the forum. Thanks again for your time and help! Best, Po-Sen |
From: Daniel P. <dp...@gm...> - 2014-03-17 17:46:07
|
The scoring is normally based on words so the confusion matrix is output as a sequence of words. There are ways to do what you want, involving the program ali-to-phones, that would involve aligning the training data with steps/align_fmllr.sh or align_si.sh and comparing with the best alignment from the decode, then putting it into compute-wer and asking it to output the detailed information. But I don't have time right now to explain it in detail. Dan On Mon, Mar 17, 2014 at 1:39 PM, <fe...@in...> wrote: > Hi Daniel, > > Thanks for you quick reply. > > We want to use confusion matrices to see which phonemes (or types of > phonemes) are misclassified. > > Is there any other way you can suggest to do this? > > Thanks, > > Felipe Espic > > > > > Quoting Daniel Povey <dp...@gm...>: > > Hi, >> There is no explicit support for multi-stream ASR in Kaldi, you'll have to >> try to understand the codebase and code something yourself [although if >> you >> build separate models with the same tree, you can use the DecodableSum >> class to help you decode with scores summed over the models; you'll need >> to >> write code for this though.] >> Regarding a phone confusion matrix- if you build a system to decode >> phones, >> I think the program compute-wer has an option to output confusion data, >> but >> I doubt it is in the format you want. However, I would advise against >> this. Phone confusion matrices are a little old fashioned. >> Dan >> >> >> >> On Mon, Mar 17, 2014 at 9:20 AM, <fe...@in...> wrote: >> >> Dear Sirs, >>> >>> I am with the Speech Processing and Transmission Lab at the University >>> of Chile. >>> We are working on multistream speech recognition in Kaldi, then we >>> have a couple of questions: >>> >>> - We want to create a confusion matrix by phoneme to assess the >>> performance of only acoustic features. How we could address this in >>> Kaldi? I think we have to make a phoneme recognizer (w/o word position >>> dependency), thus we read these posts >>> http://sourceforge.net/p/kaldi/discussion/1355348/thread/51258bf4/ >>> and http://sourceforge.net/p/kaldi/discussion/1355348/thread/2294d269/ >>> from 2013, but we did not find any specific solution. >>> >>> - Is there any recipe for multistream ASR in Kaldi ? Any help with this? >>> >>> >>> Best Regards, >>> >>> Felipe Espic >>> >>> >>> >>> ------------------------------------------------------------ >>> ------------------ >>> Learn Graph Databases - Download FREE O'Reilly Book >>> "Graph Databases" is the definitive new guide to graph databases and >>> their >>> applications. Written by three acclaimed leaders in the field, >>> this first edition is now available. Download your free book today! >>> http://p.sf.net/sfu/13534_NeoTech >>> _______________________________________________ >>> Kaldi-developers mailing list >>> Kal...@li... >>> https://lists.sourceforge.net/lists/listinfo/kaldi-developers >>> >>> > > |
From: <fe...@in...> - 2014-03-17 17:39:56
|
Hi Daniel, Thanks for you quick reply. We want to use confusion matrices to see which phonemes (or types of phonemes) are misclassified. Is there any other way you can suggest to do this? Thanks, Felipe Espic Quoting Daniel Povey <dp...@gm...>: > Hi, > There is no explicit support for multi-stream ASR in Kaldi, you'll have to > try to understand the codebase and code something yourself [although if you > build separate models with the same tree, you can use the DecodableSum > class to help you decode with scores summed over the models; you'll need to > write code for this though.] > Regarding a phone confusion matrix- if you build a system to decode phones, > I think the program compute-wer has an option to output confusion data, but > I doubt it is in the format you want. However, I would advise against > this. Phone confusion matrices are a little old fashioned. > Dan > > > > On Mon, Mar 17, 2014 at 9:20 AM, <fe...@in...> wrote: > >> Dear Sirs, >> >> I am with the Speech Processing and Transmission Lab at the University >> of Chile. >> We are working on multistream speech recognition in Kaldi, then we >> have a couple of questions: >> >> - We want to create a confusion matrix by phoneme to assess the >> performance of only acoustic features. How we could address this in >> Kaldi? I think we have to make a phoneme recognizer (w/o word position >> dependency), thus we read these posts >> http://sourceforge.net/p/kaldi/discussion/1355348/thread/51258bf4/ >> and http://sourceforge.net/p/kaldi/discussion/1355348/thread/2294d269/ >> from 2013, but we did not find any specific solution. >> >> - Is there any recipe for multistream ASR in Kaldi ? Any help with this? >> >> >> Best Regards, >> >> Felipe Espic >> >> >> >> ------------------------------------------------------------------------------ >> Learn Graph Databases - Download FREE O'Reilly Book >> "Graph Databases" is the definitive new guide to graph databases and their >> applications. Written by three acclaimed leaders in the field, >> this first edition is now available. Download your free book today! >> http://p.sf.net/sfu/13534_NeoTech >> _______________________________________________ >> Kaldi-developers mailing list >> Kal...@li... >> https://lists.sourceforge.net/lists/listinfo/kaldi-developers >> |
From: Xavier A. <xan...@gm...> - 2014-03-17 16:59:05
|
Yes, I made sure not to have any. Thanks for the quick answer. I got worried about the message. X. On Mon, Mar 17, 2014 at 5:57 PM, Daniel Povey <dp...@gm...> wrote: > Probably in your setup you had no OOV words in training so nothing got > mapped to OOV. > I wouldn't worry about this. > Dan > > > > On Mon, Mar 17, 2014 at 12:54 PM, Xavier Anguera <xan...@gm...>wrote: > >> Hi, >> When training with a recipe adapted from switchboard I am getting the >> following warning: >> WARNING (gmm-init-model:InitAmGmm():gmm-init-model.cc:55) Tree has pdf-id >> 1 with no stats; corresponding phone list: 6 7 8 9 10 >> This is a bad warning. >> >> Checking in ./data/lang/phones.txt I see that these correspond to the >> phoneme <unk> which, I guess, deals with the OOVs >> >> In my recipe I add the unk to data/local/dict/lexicon.txt and to >> data/local/dict/silence_phones.txt in addition to calling the prepare_lang >> script with >> utils/prepare_lang.sh data/local/dict "<unk>" data/local/lang data/lang >> >> Should I worry about this warning? if so, what should I check/change? >> >> Thanks >> >> Xavier Anguera >> >> >> >> ------------------------------------------------------------------------------ >> Learn Graph Databases - Download FREE O'Reilly Book >> "Graph Databases" is the definitive new guide to graph databases and their >> applications. Written by three acclaimed leaders in the field, >> this first edition is now available. Download your free book today! >> http://p.sf.net/sfu/13534_NeoTech >> _______________________________________________ >> Kaldi-developers mailing list >> Kal...@li... >> https://lists.sourceforge.net/lists/listinfo/kaldi-developers >> >> > |
From: Daniel P. <dp...@gm...> - 2014-03-17 16:57:28
|
Probably in your setup you had no OOV words in training so nothing got mapped to OOV. I wouldn't worry about this. Dan On Mon, Mar 17, 2014 at 12:54 PM, Xavier Anguera <xan...@gm...> wrote: > Hi, > When training with a recipe adapted from switchboard I am getting the > following warning: > WARNING (gmm-init-model:InitAmGmm():gmm-init-model.cc:55) Tree has pdf-id > 1 with no stats; corresponding phone list: 6 7 8 9 10 > This is a bad warning. > > Checking in ./data/lang/phones.txt I see that these correspond to the > phoneme <unk> which, I guess, deals with the OOVs > > In my recipe I add the unk to data/local/dict/lexicon.txt and to > data/local/dict/silence_phones.txt in addition to calling the prepare_lang > script with > utils/prepare_lang.sh data/local/dict "<unk>" data/local/lang data/lang > > Should I worry about this warning? if so, what should I check/change? > > Thanks > > Xavier Anguera > > > > ------------------------------------------------------------------------------ > Learn Graph Databases - Download FREE O'Reilly Book > "Graph Databases" is the definitive new guide to graph databases and their > applications. Written by three acclaimed leaders in the field, > this first edition is now available. Download your free book today! > http://p.sf.net/sfu/13534_NeoTech > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > |
From: Xavier A. <xan...@gm...> - 2014-03-17 16:54:36
|
Hi, When training with a recipe adapted from switchboard I am getting the following warning: WARNING (gmm-init-model:InitAmGmm():gmm-init-model.cc:55) Tree has pdf-id 1 with no stats; corresponding phone list: 6 7 8 9 10 This is a bad warning. Checking in ./data/lang/phones.txt I see that these correspond to the phoneme <unk> which, I guess, deals with the OOVs In my recipe I add the unk to data/local/dict/lexicon.txt and to data/local/dict/silence_phones.txt in addition to calling the prepare_lang script with utils/prepare_lang.sh data/local/dict "<unk>" data/local/lang data/lang Should I worry about this warning? if so, what should I check/change? Thanks Xavier Anguera |