kaldi-developers Mailing List for Kaldi (Page 20)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Kaldi - Build # 508 - Failure:

See the build log in attachment for the details.

On 26/03/14 15:48, Daniel Povey wrote:
>
>     Is it possible to perform recognition confined to a small grammar
>     given
>     that you have trained on a large grammar that includes the small
>     grammar
>     as a subset?
>
>     I ask because I attempted to follow the recipe of
>     http://vpanayotov.blogspot.ie/2012/06/kaldi-decoding-graph-construction.html
>     to do this but to no avail.
>
>     Then I attempted to take egs/voxforge/s5/run.sh and strip out the
>     training section and change the corpus.txt file to obtain the small
>     grammar. The idea was that I would generate L and G using the existing
>     run script but then combine it with Ha and C to get the reduced
>     fst. It
>     all compiles and looks like it should work, but there must be a
>     mismatch
>     between the Ha of the existing large grammar model and the path
>     through
>     the model that uses the smaller G. The recogniser will respond to the
>     speaker but produces completely wrong results and in many cases just
>     produces the same word output every time.
>
>
> This should work.  I suspect you have done something like giving it 
> the wrong sample-rate audio.  The features are not comparable between 
> different sample rates. Check the log-likelihoods you get on decoding 
> (caution: they may or may not be printed out multiplied by the 
> acoustic scale)- if these are very different from your "matched" 
> decoding, then likely the acoustics are wrong.  Also see the fMLLR 
> objective function improvement, if you're using fMLLR- if the 
> acoustics are mismatched it will be very large, e.g. >5.
I used the online-gmm-decode-faster which is what I usually use. Giving 
the large grammar the same words it recognises them some of the time. 
Obviously my point of then using the small grammar is to improve the 
accuracy. I don't tend to use online-wav-gmm-decode-faster and I've 
found that if I use the sample rate of 44100 it gets flagged as an error 
anyway.

>
>     I've attempted also to see how I might use HCLG o G_s^- o G_l
>     where G_s
>     is the small grammar and G_l is the large grammar, but I see no
>     documentation on who this is actually performed using a script.
>
>
> This is implemented, it's called "biglm" in the code and scripts, 
> there is an example in the WSJ scripts, egs/wsj/s5/.
Thanks I'll have a look at it again. I was looking for it in the 
egs/wsj/s5 because I found a s3 version in bitbucket. Maybe I just 
missed it.
> Dan
>
>
>
>
>     Learn Graph Databases - Download FREE O'Reilly Book
>     "Graph Databases" is the definitive new guide to graph databases
>     and their
>     applications. Written by three acclaimed leaders in the field,
>     this first edition is now available. Download your free book today!
>     http://p.sf.net/sfu/13534_NeoTech
>     _______________________________________________
>     Kaldi-developers mailing list
>     Kal...@li...
>     <mailto:Kal...@li...>
>     https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>
>

-- 
Best Regards,
    Eamonn

+                                   +                              +
   Eamonn Kenny B.A., M.Sc., Ph.D.     CNGL/Speech Communication Lab,
   Tel:   00+353-1-8961797             Dept. of Computer Science,
   Email: Eam...@sc...     F.34, O'Reilly Institute,
   http://www.cs.tcd.ie/Eamonn.Kenny   Trinity College Dublin,
   http://eamonnmkenny.wordpress.com   Dublin 2, Ireland.
+                                   +                              +

> Is it possible to perform recognition confined to a small grammar given
> that you have trained on a large grammar that includes the small grammar
> as a subset?
>
> I ask because I attempted to follow the recipe of
>
> http://vpanayotov.blogspot.ie/2012/06/kaldi-decoding-graph-construction.html
> to do this but to no avail.
>
> Then I attempted to take egs/voxforge/s5/run.sh and strip out the
> training section and change the corpus.txt file to obtain the small
> grammar. The idea was that I would generate L and G using the existing
> run script but then combine it with Ha and C to get the reduced fst. It
> all compiles and looks like it should work, but there must be a mismatch
> between the Ha of the existing large grammar model and the path through
> the model that uses the smaller G. The recogniser will respond to the
> speaker but produces completely wrong results and in many cases just
> produces the same word output every time.
>

This should work.  I suspect you have done something like giving it the
wrong sample-rate audio.  The features are not comparable between different
sample rates. Check the log-likelihoods you get on decoding (caution: they
may or may not be printed out multiplied by the acoustic scale)- if these
are very different from your "matched" decoding, then likely the acoustics
are wrong.  Also see the fMLLR objective function improvement, if you're
using fMLLR- if the acoustics are mismatched it will be very large, e.g. >5.

> I've attempted also to see how I might use HCLG o G_s^- o G_l where G_s
> is the small grammar and G_l is the large grammar, but I see no
> documentation on who this is actually performed using a script.
>
>
This is implemented, it's called "biglm" in the code and scripts, there is
an example in the WSJ scripts, egs/wsj/s5/.
Dan

> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Kaldi-developers mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>

Dear Kaldi Developers,

Is it possible to perform recognition confined to a small grammar given 
that you have trained on a large grammar that includes the small grammar 
as a subset?

I ask because I attempted to follow the recipe of
http://vpanayotov.blogspot.ie/2012/06/kaldi-decoding-graph-construction.html
to do this but to no avail.

Then I attempted to take egs/voxforge/s5/run.sh and strip out the 
training section and change the corpus.txt file to obtain the small 
grammar. The idea was that I would generate L and G using the existing 
run script but then combine it with Ha and C to get the reduced fst. It 
all compiles and looks like it should work, but there must be a mismatch 
between the Ha of the existing large grammar model and the path through 
the model that uses the smaller G. The recogniser will respond to the 
speaker but produces completely wrong results and in many cases just 
produces the same word output every time.

I've attempted also to see how I might use HCLG o G_s^- o G_l where G_s 
is the small grammar and G_l is the large grammar, but I see no 
documentation on who this is actually performed using a script.

any help or pointers would be greatly appreciated.

-- 
Best Regards,
    Eamonn

Hi Feiteng,
thanks for noticing, it's fixed now, yes there should be used (vec-mean)
everywhere.
I added another auxiliary vector, as needed to avoid restricting the
skewness to positive values.

Surprisingly the variance values did not change, but after writing down
the formulas, it has become clear why this should be the correct case.

E((x-mu)^2) = E(x^2 -2*x*mu + mu^2) = E(x^2) -2*mu*E(x) +mu^2 = E(x^2) -
2*mu^2 +mu^2 = E((x-mu)*x)

Karel.

On 03/24/2014 04:29 PM, Daniel Povey wrote:
> I think you are right, because "vec" does not have the mean subtracted.
> This is Karel's code, so he'll decide the best way to proceed. It may
> be easiest for him to fix it himself.
> Thanks for noticing!
>
> Dan
>
>
>
> On Sun, Mar 23, 2014 at 5:11 AM, 李飞腾 <fei...@yo...
> <mailto:fei...@yo...>> wrote:
>
>     Hi:
>     I think the way to computer variance in MomentStatistics() is not
>     right.
>     variance-wiki http://en.wikipedia.org/wiki/Variance
>     usingApplyPow() to replace vec_aux.MulElements(vec);
>     std::string MomentStatistics(const Vector<Real> &vec) {
>     // we use an auxiliary vector for the higher order powers
>     Vector<Real> vec_aux(vec);
>     // mean
>     Real mean = vec.Sum() / vec.Dim();
>     // variance
>     vec_aux.Add(-mean);
>     vec_aux.ApplyPow(2.0); //vec_aux.MulElements(vec); // (vec-mean)^2
>     Real variance = vec_aux.Sum() / vec.Dim();
>     vec_aux.ApplyPow(3.0/2.0); //vec_aux.MulElements(vec); // (vec-mean)^3
>     Real skewness = vec_aux.Sum() / pow(variance, 3.0/2.0) / vec.Dim();
>     vec_aux.ApplyPow(4.0/3.0); //vec_aux.MulElements(vec); // (vec-mean)^4
>     Real kurtosis = vec_aux.Sum() / (variance * variance) / vec.Dim()
>     - 3.0;
>     // send the statistics to stream,
>     std::ostringstream ostr;
>     ostr << " ( min " << vec.Min() << ", max " << vec.Max()
>     << ", mean " << mean
>     << ", variance " << variance
>     << ", skewness " << skewness
>     << ", kurtosis " << kurtosis
>     << " ) ";
>     return ostr.str();
>     }
>     Am I right?
>     What is the easiest way to contribute to kaldi?
>     Best！
>     feiteng li
>
>     ------------------------------------------------------------------------------
>     Learn Graph Databases - Download FREE O'Reilly Book
>     "Graph Databases" is the definitive new guide to graph databases
>     and their
>     applications. Written by three acclaimed leaders in the field,
>     this first edition is now available. Download your free book today!
>     http://p.sf.net/sfu/13534_NeoTech
>     _______________________________________________
>     Kaldi-developers mailing list
>     Kal...@li...
>     <mailto:Kal...@li...>
>     https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>
>

I think you are right, because "vec" does not have the mean subtracted.
This is Karel's code, so he'll decide the best way to proceed.  It may be
easiest for him to fix it himself.
Thanks for noticing!

Dan

On Sun, Mar 23, 2014 at 5:11 AM, 李飞腾 <fei...@yo...> wrote:

>   Hi:
>
> I think the way to computer variance in MomentStatistics() is not right.
> variance-wiki http://en.wikipedia.org/wiki/Variance
> using ApplyPow() to replace vec_aux.MulElements(vec);
>
> std::string MomentStatistics(const Vector<Real> &vec) {
>   // we use an auxiliary vector for the higher order powers
>   Vector<Real> vec_aux(vec);
>   // mean
>   Real mean = vec.Sum() / vec.Dim();
>   // variance
>   vec_aux.Add(-mean);
>   vec_aux.ApplyPow(2.0); //vec_aux.MulElements(vec); // (vec-mean)^2
>   Real variance = vec_aux.Sum() / vec.Dim();
>   vec_aux.ApplyPow(3.0/2.0); //vec_aux.MulElements(vec); // (vec-mean)^3
>   Real skewness = vec_aux.Sum() / pow(variance, 3.0/2.0) / vec.Dim();
>   vec_aux.ApplyPow(4.0/3.0); //vec_aux.MulElements(vec); // (vec-mean)^4
>   Real kurtosis = vec_aux.Sum() / (variance * variance) / vec.Dim() - 3.0;
>   // send the statistics to stream,
>   std::ostringstream ostr;
>   ostr << " ( min " << vec.Min() << ", max " << vec.Max()
>        << ", mean " << mean
>        << ", variance " << variance
>        << ", skewness " << skewness
>        << ", kurtosis " << kurtosis
>        << " ) ";
>   return ostr.str();
> }
>
> Am I right?
>
> What is the easiest way to contribute to kaldi?
>
> Best！
> feiteng li
>
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Kaldi-developers mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>
>

Hi:

I think the way to computer variance in MomentStatistics() is not right. 
variance-wiki http://en.wikipedia.org/wiki/Variance 
using ApplyPow() to replace vec_aux.MulElements(vec); 

std::string MomentStatistics(const Vector<Real> &vec) {
  // we use an auxiliary vector for the higher order powers
  Vector<Real> vec_aux(vec);
  // mean
  Real mean = vec.Sum() / vec.Dim();
  // variance
  vec_aux.Add(-mean);
  vec_aux.ApplyPow(2.0); //vec_aux.MulElements(vec); // (vec-mean)^2
  Real variance = vec_aux.Sum() / vec.Dim();
  vec_aux.ApplyPow(3.0/2.0); //vec_aux.MulElements(vec); // (vec-mean)^3
  Real skewness = vec_aux.Sum() / pow(variance, 3.0/2.0) / vec.Dim();
  vec_aux.ApplyPow(4.0/3.0); //vec_aux.MulElements(vec); // (vec-mean)^4
  Real kurtosis = vec_aux.Sum() / (variance * variance) / vec.Dim() - 3.0;
  // send the statistics to stream,
  std::ostringstream ostr;
  ostr << " ( min " << vec.Min() << ", max " << vec.Max()
       << ", mean " << mean 
       << ", variance " << variance 
       << ", skewness " << skewness
       << ", kurtosis " << kurtosis
       << " ) ";
  return ostr.str();
}

Am I right?

What is the easiest way to contribute to kaldi?

Best！
feiteng li 

To slightly correct what Dan said - Kaldi does have capability to adjust for
the vocal tract length, but needs scripts for estimating very precisely what
the adjustment should be.

Also It does not matter if we use DTW of the results of HMM forced
alignments, but some work is needed to come up with that measure of
"objective difference" once the alignment is done.

It looks like a good amount of work to make a Masters or even PhD. thesis,
even if Kaldi is used as a starting point. I am personally not aware of any
other open source that can be a better starting point. You will need a
signal processing graduate to do that work.

Nagendra

From: Daniel Povey [mailto:dp...@gm...] 
Sent: Friday, March 21, 2014 12:30 PM
To: Nagendra Goel
Cc: Jessica Horst; kal...@li...
Subject: Re: [Kaldi-developers] Kaldi question

I think for these purposes, it might be easier to use Dynamic Time Warping
(DTW) on speech features (e.g. MFCC features) computed from the signals.
This is something that isn't really used any more for speech recognition any
more, but it directly gives you a measure of distance between two signals.
You could divide by the maximum length of the two utterances to get a
normalized distance independent of length.  Actually, it might be a good
idea to do vocal tract length normalization (VTLN) to normalize the child's
speech to be more adult-like, before doing that.  Kaldi is not really
oriented towards this kind of use, though.  It might be necessary to find
someone who can help you in a more detailed way.

Dan

On Fri, Mar 21, 2014 at 8:47 AM, Nagendra Goel <nag...@go...>
wrote:

I am not sure how many in this audience are psychologically oriented, but I
felt like giving one response. Others may add. 

One difference between Brain and Speech recognition systems is that the
brain first develops a self-organized map of sounds. A Male "apple" and
female "apple" and

A child's "apple" all sound different to the brain, but it knows how to
differentiate between the pronunciation and the, speaker.

This map is probably developed before the child starts to develop a
vocabulary. On the other hand, although speech recognition systems do
develop an "adaptation framework" that allows the system to find a closely
sounding word given different kind of voices, there is no simultaneous
output of the "voice type" and the "spoken text" at the same time. In fact
its at the moment pretty hard for the system to differentiate between the
"voice type" and the "channel type" which is the quality of the recording
medium. Simultaneous output of these parameters is still a research
question, and it's not known how well the system will adapt to a child's
voice when trained by an adult voice.

So I am wondering how this study will help you, because if you are able to
output any measure of closeness in your study, it may be far from a human
subjective measure of closeness. 

Nagendra

From: Jessica Horst [mailto:je...@su...] 
Sent: Thursday, March 20, 2014 8:49 AM
To: kal...@li...
Subject: [Kaldi-developers] Kaldi question

Dear Kaldi team,

I am a faculty member at the University of Sussex. I am planing a study and
would like to know if Kaldi is the right software to use. 

I study child word learning. I am planning to record an adult and child
talking about objects. (The adult will be a lab member and know not to speak
at the same time as the child and there will be as little background noise
as possible.) The child will be about 3 years old, but I can go up to 4
years if that will be better for the software. I would like to train the
software on the adult input to the child (what the adult said) and then give
it the child speech. I would like an index of how similar the child speech
was to the adult speech. For example, if the adult is teaching the child the
word "apple" and says "apple" 12 times, when the child finally says "apple"
how similar is that word to the adult speech the child heard? 

My colleague told me that speech recognition software works by having a
threshold of similarity. For example, when I tell my mobile phone "call
home" the software compares what I said to what I have said before and if it
is similar enough (above threshold) it will recognise my speech. I'm hopeful
that I could use the same kind of principle here (how similar is the child's
speech to the adult speech (what was said before), but I would want a
numerical value instead of just knowing if it was above or below threshold.

Can Kaldi handle child and adult speech in this way?

Thank you for your time.

~Jessica

********************************

Dr. Jessica S. Horst

Senior Lecturer in Psychology

University of Sussex

School of Psychology

Brighton BN1 9QH

United Kingdom

Email: je...@su... 

Tel: +44 (0)1273 87 3084 <tel:%2B44%20%280%291273%2087%203084> 

Lab: http://www.sussex.ac.uk/wordlab

----------------------------------------------------------------------------
--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Kaldi-developers mailing list
Kal...@li...
https://lists.sourceforge.net/lists/listinfo/kaldi-developers

I think for these purposes, it might be easier to use Dynamic Time Warping
(DTW) on speech features (e.g. MFCC features) computed from the signals.
 This is something that isn't really used any more for speech recognition
any more, but it directly gives you a measure of distance between two
signals.  You could divide by the maximum length of the two utterances to
get a normalized distance independent of length.  Actually, it might be a
good idea to do vocal tract length normalization (VTLN) to normalize the
child's speech to be more adult-like, before doing that.  Kaldi is not
really oriented towards this kind of use, though.  It might be necessary to
find someone who can help you in a more detailed way.

Dan

On Fri, Mar 21, 2014 at 8:47 AM, Nagendra Goel
<nag...@go...>wrote:

> I am not sure how many in this audience are psychologically oriented, but
> I felt like giving one response. Others may add.
>
>
>
> One difference between Brain and Speech recognition systems is that the
> brain first develops a self-organized map of sounds. A Male "apple" and
> female "apple" and
>
> A child's "apple" all sound different to the brain, but it knows how to
> differentiate between the pronunciation and the, speaker.
>
>
>
> This map is probably developed before the child starts to develop a
> vocabulary. On the other hand, although speech recognition systems do
> develop an "adaptation framework" that allows the system to find a closely
> sounding word given different kind of voices, there is no simultaneous
> output of the "voice type" and the "spoken text" at the same time. In fact
> its at the moment pretty hard for the system to differentiate between the
> "voice type" and the "channel type" which is the quality of the recording
> medium. Simultaneous output of these parameters is still a research
> question, and it's not known how well the system will adapt to a child's
> voice when trained by an adult voice.
>
>
>
> So I am wondering how this study will help you, because if you are able to
> output any measure of closeness in your study, it may be far from a human
> subjective measure of closeness.
>
>
>
> Nagendra
>
> *From:* Jessica Horst [mailto:je...@su...]
> *Sent:* Thursday, March 20, 2014 8:49 AM
> *To:* kal...@li...
> *Subject:* [Kaldi-developers] Kaldi question
>
>
>
> Dear Kaldi team,
>
>
>
> I am a faculty member at the University of Sussex. I am planing a study
> and would like to know if Kaldi is the right software to use.
>
>
>
> I study child word learning. I am planning to record an adult and child
> talking about objects. (The adult will be a lab member and know not to
> speak at the same time as the child and there will be as little background
> noise as possible.) The child will be about 3 years old, but I can go up to
> 4 years if that will be better for the software. I would like to train the
> software on the adult input to the child (what the adult said) and then
> give it the child speech. I would like an index of how similar the child
> speech was to the adult speech. For example, if the adult is teaching the
> child the word "apple" and says "apple" 12 times, when the child finally
> says "apple" how similar is that word to the adult speech the child heard?
>
>
>
> My colleague told me that speech recognition software works by having a
> threshold of similarity. For example, when I tell my mobile phone "call
> home" the software compares what I said to what I have said before and if
> it is similar enough (above threshold) it will recognise my speech. I'm
> hopeful that I could use the same kind of principle here (how similar is
> the child's speech to the adult speech (what was said before), but I would
> want a numerical value instead of just knowing if it was above or below
> threshold.
>
>
>
> Can Kaldi handle child and adult speech in this way?
>
>
>
> Thank you for your time.
>
> ~Jessica
>
>
>
> ********************************
>
> Dr. Jessica S. Horst
>
> Senior Lecturer in Psychology
>
>
>
> University of Sussex
>
> School of Psychology
>
> Brighton BN1 9QH
>
> United Kingdom
>
>
>
> Email: je...@su...
>
> Tel: +44 (0)1273 87 3084
>
> Lab: http://www.sussex.ac.uk/wordlab
>
>
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Kaldi-developers mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>
>

I am not sure how many in this audience are psychologically oriented, but I
felt like giving one response. Others may add. 

One difference between Brain and Speech recognition systems is that the
brain first develops a self-organized map of sounds. A Male "apple" and
female "apple" and

A child's "apple" all sound different to the brain, but it knows how to
differentiate between the pronunciation and the, speaker.

This map is probably developed before the child starts to develop a
vocabulary. On the other hand, although speech recognition systems do
develop an "adaptation framework" that allows the system to find a closely
sounding word given different kind of voices, there is no simultaneous
output of the "voice type" and the "spoken text" at the same time. In fact
its at the moment pretty hard for the system to differentiate between the
"voice type" and the "channel type" which is the quality of the recording
medium. Simultaneous output of these parameters is still a research
question, and it's not known how well the system will adapt to a child's
voice when trained by an adult voice.

So I am wondering how this study will help you, because if you are able to
output any measure of closeness in your study, it may be far from a human
subjective measure of closeness. 

Nagendra

From: Jessica Horst [mailto:je...@su...] 
Sent: Thursday, March 20, 2014 8:49 AM
To: kal...@li...
Subject: [Kaldi-developers] Kaldi question

Dear Kaldi team,

I am a faculty member at the University of Sussex. I am planing a study and
would like to know if Kaldi is the right software to use. 

I study child word learning. I am planning to record an adult and child
talking about objects. (The adult will be a lab member and know not to speak
at the same time as the child and there will be as little background noise
as possible.) The child will be about 3 years old, but I can go up to 4
years if that will be better for the software. I would like to train the
software on the adult input to the child (what the adult said) and then give
it the child speech. I would like an index of how similar the child speech
was to the adult speech. For example, if the adult is teaching the child the
word "apple" and says "apple" 12 times, when the child finally says "apple"
how similar is that word to the adult speech the child heard? 

My colleague told me that speech recognition software works by having a
threshold of similarity. For example, when I tell my mobile phone "call
home" the software compares what I said to what I have said before and if it
is similar enough (above threshold) it will recognise my speech. I'm hopeful
that I could use the same kind of principle here (how similar is the child's
speech to the adult speech (what was said before), but I would want a
numerical value instead of just knowing if it was above or below threshold.

Can Kaldi handle child and adult speech in this way?

Thank you for your time.

~Jessica

********************************

Dr. Jessica S. Horst

Senior Lecturer in Psychology

University of Sussex

School of Psychology

Brighton BN1 9QH

United Kingdom

Email: je...@su... 

Tel: +44 (0)1273 87 3084

Lab: http://www.sussex.ac.uk/wordlab

Dear Nagendra,

Thank you. I am actually looking for an objective measure of closeness. For
example, I want to know if when a child says (for example) ³apple² after
hearing the adult say it 12 times, is that token of ³apple² closer to the
adult than when the child said ³apple² after only hearing the adult say it 4
times. The child is more likely to actually say something like AW-pull than
app-UL. Other questions include: do children who know more words have better
objective similarity than those who know fewer words?

I know children (and adults) can learn words across multiple speakers and
even within the same speaker there is variability. I want an objective way
of looking at how similar the tokens are as the words become more familiar
to the child. Does this make more sense?

A colleague suggested I should be able to compare the spectrograms or
something similar, but I don¹t know where to start.
~Jessica

From:  Nagendra Goel <nag...@go...>
Date:  Friday, 21 March 2014 12:47
To:  Jessica Horst <je...@su...>,
<kal...@li...>
Subject:  RE: [Kaldi-developers] Kaldi question

I am not sure how many in this audience are psychologically oriented, but I
felt like giving one response. Others may add.

One difference between Brain and Speech recognition systems is that the
brain first develops a self-organized map of sounds. A Male ³apple² and
female ³apple² and
A child¹s ³apple² all sound different to the brain, but it knows how to
differentiate between the pronunciation and the, speaker.

This map is probably developed before the child starts to develop a
vocabulary. On the other hand, although speech recognition systems do
develop an ³adaptation framework² that allows the system to find a closely
sounding word given different kind of voices, there is no simultaneous
output of the ³voice type² and the ³spoken text² at the same time. In fact
its at the moment pretty hard for the system to differentiate between the
³voice type² and the ³channel type² which is the quality of the recording
medium. Simultaneous output of these parameters is still a research
question, and it¹s not known how well the system will adapt to a child¹s
voice when trained by an adult voice.

So I am wondering how this study will help you, because if you are able to
output any measure of closeness in your study, it may be far from a human
subjective measure of closeness.

Nagendra

From: Jessica Horst [mailto:je...@su...]
Sent: Thursday, March 20, 2014 8:49 AM
To: kal...@li...
Subject: [Kaldi-developers] Kaldi question

Dear Kaldi team,

I am a faculty member at the University of Sussex. I am planing a study and
would like to know if Kaldi is the right software to use.

I study child word learning. I am planning to record an adult and child
talking about objects. (The adult will be a lab member and know not to speak
at the same time as the child and there will be as little background noise
as possible.) The child will be about 3 years old, but I can go up to 4
years if that will be better for the software. I would like to train the
software on the adult input to the child (what the adult said) and then give
it the child speech. I would like an index of how similar the child speech
was to the adult speech. For example, if the adult is teaching the child the
word ³apple² and says ³apple² 12 times, when the child finally says ³apple²
how similar is that word to the adult speech the child heard?

My colleague told me that speech recognition software works by having a
threshold of similarity. For example, when I tell my mobile phone ³call
home² the software compares what I said to what I have said before and if it
is similar enough (above threshold) it will recognise my speech. I¹m hopeful
that I could use the same kind of principle here (how similar is the child¹s
speech to the adult speech (what was said before), but I would want a
numerical value instead of just knowing if it was above or below threshold.

Can Kaldi handle child and adult speech in this way?

Thank you for your time.

~Jessica

********************************

Dr. Jessica S. Horst

Senior Lecturer in Psychology

University of Sussex

School of Psychology

Brighton BN1 9QH

United Kingdom

Email: je...@su...

Tel: +44 (0)1273 87 3084

Lab: http://www.sussex.ac.uk/wordlab

Kaldi - Build # 500 - Failure:

See the build log in attachment for the details.

Hi everyone,
I think it would be nice to have tools that would convert back and forth
between my and Karel's neural nets, for testing purposes.  [note: it might
not just be a question of converting the network itself, since I think we
may use different conventions on how the splicing is done].   But anyway,
converting the network would be a start.
Does anyone want to help with this?
Covering the "common cases", or at least the easy cases, would be
sufficient-- the two versions don't support exactly the same set of
nonlinearities.
Dan

>
>
> First, I appreciate your contribution in this toolkit. It certainly make
> people be able to build their tools/research on top of the state of the art
> algorithms. I am pretty naive to the toolkit. I started running through the
> egs scripts for rm and timit. Perhaps, the example scripts were created a
> long time ago. Have anyone verified them again when other parts of the
> codes are updated?
> Originally, I hope to be able to run through the run.sh by changing the
> data path only. However, it is easy to get stuck in some lines (ex: without
> generating suitable output).
> (E.g. for timit, I stuck in the "steps/make_mfcc.sh --cmd "$train_cmd"
> --nj 30 data/$x exp/make_mfcc/$x $mfccdir || exit 1;" for timit -->
> "Mal-formed spk2gender file")
>

I see what happened here.  Recently I modified make_mfcc.sh to validate the
input directory by calling utils/validate_data_dir.sh.  It failed because
there was an error in data preparation.  We rarely use TIMIT so we didn't
notice.

> The other question is about extracting fmllr features. Although I spend
> some time, I couldn't find the right solution. I want to extract fmllr
> features using timit. The DNN scripts in Kaldi is based on the fmllr
> features.
> Is the script below (found in local/run_dnn.sh) the right answer?'
>  steps/nnet/make_fmllr_feats.sh --nj 10 --cmd "$train_cmd" \
>      --transform-dir $gmmdir/decode \
>      $dir data/test $gmmdir $dir/log $dir/data || exit 1
> "
> where $gmmdir is obtained in the normal gmm-hmm framework?
>

I don't think you will have much luck training DNNs on timit.
If this is for training data then the --transform-dir option should
probably just be set to $gmmdir, where $gmmdir is probably tri3 in the
TIMIT case.  For test data it would be a subdirectory where you have
decoded.

Dan

Thanks,
I will try to make one.

X.

On Thu, Mar 20, 2014 at 4:44 PM, Daniel Povey <dp...@gm...> wrote:

> There is not a tool for this.
> I doubt it will make a difference after training is done, but it's
> possible...
> Dan
>
>
>
> On Thu, Mar 20, 2014 at 7:14 AM, Xavier Anguera <xan...@gm...>wrote:
>
>> Hi,
>> I am following the swdb recipe. In the first training step
>>  (steps/train_mono.sh) I see that it calls align-equal-compiled to obtain a
>> uniform alignment between the phoneme transcription and the audio.
>> Would it be possible to insert my own phoneme alignment instead of using
>> a uniform setup? I have followed the code and it seems plausible to modify
>> it to insert my alignment, but I wonder whether there is already a  tool
>> for this in kaldi?
>>
>> Also, do you think it would potentially improve the  final ASR results if
>> I had access to high-quality alignments?
>>
>> Thanks,
>>
>> Xavier Anguera
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and their
>> applications. Written by three acclaimed leaders in the field,
>> this first edition is now available. Download your free book today!
>> http://p.sf.net/sfu/13534_NeoTech
>> _______________________________________________
>> Kaldi-developers mailing list
>> Kal...@li...
>> https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>>
>>
>

There is not a tool for this.
I doubt it will make a difference after training is done, but it's
possible...
Dan

On Thu, Mar 20, 2014 at 7:14 AM, Xavier Anguera <xan...@gm...> wrote:

> Hi,
> I am following the swdb recipe. In the first training step
>  (steps/train_mono.sh) I see that it calls align-equal-compiled to obtain a
> uniform alignment between the phoneme transcription and the audio.
> Would it be possible to insert my own phoneme alignment instead of using a
> uniform setup? I have followed the code and it seems plausible to modify it
> to insert my alignment, but I wonder whether there is already a  tool for
> this in kaldi?
>
> Also, do you think it would potentially improve the  final ASR results if
> I had access to high-quality alignments?
>
> Thanks,
>
> Xavier Anguera
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Kaldi-developers mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>
>

Dear Kaldi team,

I am a faculty member at the University of Sussex. I am planing a study and
would like to know if Kaldi is the right software to use.

I study child word learning. I am planning to record an adult and child
talking about objects. (The adult will be a lab member and know not to speak
at the same time as the child and there will be as little background noise
as possible.) The child will be about 3 years old, but I can go up to 4
years if that will be better for the software. I would like to train the
software on the adult input to the child (what the adult said) and then give
it the child speech. I would like an index of how similar the child speech
was to the adult speech. For example, if the adult is teaching the child the
word ³apple² and says ³apple² 12 times, when the child finally says ³apple²
how similar is that word to the adult speech the child heard?

My colleague told me that speech recognition software works by having a
threshold of similarity. For example, when I tell my mobile phone ³call
home² the software compares what I said to what I have said before and if it
is similar enough (above threshold) it will recognise my speech. I¹m hopeful
that I could use the same kind of principle here (how similar is the child¹s
speech to the adult speech (what was said before), but I would want a
numerical value instead of just knowing if it was above or below threshold.

Can Kaldi handle child and adult speech in this way?

Thank you for your time.
~Jessica

********************************
Dr. Jessica S. Horst
Senior Lecturer in Psychology

University of Sussex
School of Psychology
Brighton BN1 9QH
United Kingdom

Email: je...@su...
Tel: +44 (0)1273 87 3084
Lab: http://www.sussex.ac.uk/wordlab

Hi,
I am following the swdb recipe. In the first training step
 (steps/train_mono.sh) I see that it calls align-equal-compiled to obtain a
uniform alignment between the phoneme transcription and the audio.
Would it be possible to insert my own phoneme alignment instead of using a
uniform setup? I have followed the code and it seems plausible to modify it
to insert my alignment, but I wonder whether there is already a  tool for
this in kaldi?

Also, do you think it would potentially improve the  final ASR results if I
had access to high-quality alignments?

Thanks,

Xavier Anguera

Hi Daniel,

Thanks, we are working on that.

Best,

Felipe Espic

Quoting Daniel Povey <dp...@gm...>:

> The scoring is normally based on words so the confusion matrix is output as
> a sequence of words.  There are ways to do what you want, involving the
> program ali-to-phones, that would involve aligning the training data with
> steps/align_fmllr.sh or align_si.sh and comparing with the best alignment
> from the decode, then putting it into compute-wer and asking it to output
> the detailed information.  But I don't have time right now to explain it in
> detail.
> Dan
>
>
>
> On Mon, Mar 17, 2014 at 1:39 PM, <fe...@in...> wrote:
>
>> Hi Daniel,
>>
>> Thanks for you quick reply.
>>
>> We want to use confusion matrices to see which phonemes (or types of
>> phonemes) are misclassified.
>>
>> Is there any other way you can suggest to do this?
>>
>> Thanks,
>>
>> Felipe Espic
>>
>>
>>
>>
>> Quoting Daniel Povey <dp...@gm...>:
>>
>>  Hi,
>>> There is no explicit support for multi-stream ASR in Kaldi, you'll have to
>>> try to understand the codebase and code something yourself [although if
>>> you
>>> build separate models with the same tree, you can use the DecodableSum
>>> class to help you decode with scores summed over the models; you'll need
>>> to
>>> write code for this though.]
>>> Regarding a phone confusion matrix- if you build a system to decode
>>> phones,
>>> I think the program compute-wer has an option to output confusion data,
>>> but
>>> I doubt it is in the format you want.  However, I would advise against
>>> this.  Phone confusion matrices are a little old fashioned.
>>> Dan
>>>
>>>
>>>
>>> On Mon, Mar 17, 2014 at 9:20 AM, <fe...@in...> wrote:
>>>
>>>  Dear Sirs,
>>>>
>>>> I am with the Speech Processing and Transmission Lab at the University
>>>> of Chile.
>>>> We are working on multistream speech recognition in Kaldi, then we
>>>> have a couple of questions:
>>>>
>>>> - We want to create a confusion matrix by phoneme to assess the
>>>> performance of only acoustic features. How we could address this in
>>>> Kaldi? I think we have to make a phoneme recognizer (w/o word position
>>>> dependency), thus we read these posts
>>>> http://sourceforge.net/p/kaldi/discussion/1355348/thread/51258bf4/
>>>> and http://sourceforge.net/p/kaldi/discussion/1355348/thread/2294d269/
>>>> from 2013, but we did not find any specific solution.
>>>>
>>>> - Is there any recipe for multistream ASR in Kaldi ? Any help with this?
>>>>
>>>>
>>>> Best Regards,
>>>>
>>>> Felipe Espic
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------
>>>> ------------------
>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>> their
>>>> applications. Written by three acclaimed leaders in the field,
>>>> this first edition is now available. Download your free book today!
>>>> http://p.sf.net/sfu/13534_NeoTech
>>>> _______________________________________________
>>>> Kaldi-developers mailing list
>>>> Kal...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>>>>
>>>>
>>
>>

Dear Kaldi developers,

First, I appreciate your contribution in this toolkit. It certainly make
people be able to build their tools/research on top of the state of the art
algorithms. I am pretty naive to the toolkit. I started running through the
egs scripts for rm and timit. Perhaps, the example scripts were created a
long time ago. Have anyone verified them again when other parts of the
codes are updated?
Originally, I hope to be able to run through the run.sh by changing the
data path only. However, it is easy to get stuck in some lines (ex: without
generating suitable output).
(E.g. for timit, I stuck in the "steps/make_mfcc.sh --cmd "$train_cmd" --nj
30 data/$x exp/make_mfcc/$x $mfccdir || exit 1;" for timit --> "Mal-formed
spk2gender file")

The other question is about extracting fmllr features. Although I spend
some time, I couldn't find the right solution. I want to extract fmllr
features using timit. The DNN scripts in Kaldi is based on the fmllr
features.
Is the script below (found in local/run_dnn.sh) the right answer?'
 steps/nnet/make_fmllr_feats.sh --nj 10 --cmd "$train_cmd" \
     --transform-dir $gmmdir/decode \
     $dir data/test $gmmdir $dir/log $dir/data || exit 1
"
where $gmmdir is obtained in the normal gmm-hmm framework?
All the example scripts are nice, but they seems not connecting to each
other and I couldn't find the right answer by searching the forum.

Thanks again for your time and help!

Best,
Po-Sen

The scoring is normally based on words so the confusion matrix is output as
a sequence of words.  There are ways to do what you want, involving the
program ali-to-phones, that would involve aligning the training data with
steps/align_fmllr.sh or align_si.sh and comparing with the best alignment
from the decode, then putting it into compute-wer and asking it to output
the detailed information.  But I don't have time right now to explain it in
detail.
Dan

On Mon, Mar 17, 2014 at 1:39 PM, <fe...@in...> wrote:

> Hi Daniel,
>
> Thanks for you quick reply.
>
> We want to use confusion matrices to see which phonemes (or types of
> phonemes) are misclassified.
>
> Is there any other way you can suggest to do this?
>
> Thanks,
>
> Felipe Espic
>
>
>
>
> Quoting Daniel Povey <dp...@gm...>:
>
>  Hi,
>> There is no explicit support for multi-stream ASR in Kaldi, you'll have to
>> try to understand the codebase and code something yourself [although if
>> you
>> build separate models with the same tree, you can use the DecodableSum
>> class to help you decode with scores summed over the models; you'll need
>> to
>> write code for this though.]
>> Regarding a phone confusion matrix- if you build a system to decode
>> phones,
>> I think the program compute-wer has an option to output confusion data,
>> but
>> I doubt it is in the format you want.  However, I would advise against
>> this.  Phone confusion matrices are a little old fashioned.
>> Dan
>>
>>
>>
>> On Mon, Mar 17, 2014 at 9:20 AM, <fe...@in...> wrote:
>>
>>  Dear Sirs,
>>>
>>> I am with the Speech Processing and Transmission Lab at the University
>>> of Chile.
>>> We are working on multistream speech recognition in Kaldi, then we
>>> have a couple of questions:
>>>
>>> - We want to create a confusion matrix by phoneme to assess the
>>> performance of only acoustic features. How we could address this in
>>> Kaldi? I think we have to make a phoneme recognizer (w/o word position
>>> dependency), thus we read these posts
>>> http://sourceforge.net/p/kaldi/discussion/1355348/thread/51258bf4/
>>> and http://sourceforge.net/p/kaldi/discussion/1355348/thread/2294d269/
>>> from 2013, but we did not find any specific solution.
>>>
>>> - Is there any recipe for multistream ASR in Kaldi ? Any help with this?
>>>
>>>
>>> Best Regards,
>>>
>>> Felipe Espic
>>>
>>>
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> "Graph Databases" is the definitive new guide to graph databases and
>>> their
>>> applications. Written by three acclaimed leaders in the field,
>>> this first edition is now available. Download your free book today!
>>> http://p.sf.net/sfu/13534_NeoTech
>>> _______________________________________________
>>> Kaldi-developers mailing list
>>> Kal...@li...
>>> https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>>>
>>>
>
>

Hi Daniel,

Thanks for you quick reply.

We want to use confusion matrices to see which phonemes (or types of  
phonemes) are misclassified.

Is there any other way you can suggest to do this?

Thanks,

Felipe Espic

Quoting Daniel Povey <dp...@gm...>:

> Hi,
> There is no explicit support for multi-stream ASR in Kaldi, you'll have to
> try to understand the codebase and code something yourself [although if you
> build separate models with the same tree, you can use the DecodableSum
> class to help you decode with scores summed over the models; you'll need to
> write code for this though.]
> Regarding a phone confusion matrix- if you build a system to decode phones,
> I think the program compute-wer has an option to output confusion data, but
> I doubt it is in the format you want.  However, I would advise against
> this.  Phone confusion matrices are a little old fashioned.
> Dan
>
>
>
> On Mon, Mar 17, 2014 at 9:20 AM, <fe...@in...> wrote:
>
>> Dear Sirs,
>>
>> I am with the Speech Processing and Transmission Lab at the University
>> of Chile.
>> We are working on multistream speech recognition in Kaldi, then we
>> have a couple of questions:
>>
>> - We want to create a confusion matrix by phoneme to assess the
>> performance of only acoustic features. How we could address this in
>> Kaldi? I think we have to make a phoneme recognizer (w/o word position
>> dependency), thus we read these posts
>> http://sourceforge.net/p/kaldi/discussion/1355348/thread/51258bf4/
>> and http://sourceforge.net/p/kaldi/discussion/1355348/thread/2294d269/
>> from 2013, but we did not find any specific solution.
>>
>> - Is there any recipe for multistream ASR in Kaldi ? Any help with this?
>>
>>
>> Best Regards,
>>
>> Felipe Espic
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and their
>> applications. Written by three acclaimed leaders in the field,
>> this first edition is now available. Download your free book today!
>> http://p.sf.net/sfu/13534_NeoTech
>> _______________________________________________
>> Kaldi-developers mailing list
>> Kal...@li...
>> https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>>

Yes, I made sure not to have any.
Thanks for the quick answer. I got worried about the message.

X.

On Mon, Mar 17, 2014 at 5:57 PM, Daniel Povey <dp...@gm...> wrote:

> Probably in your setup you had no OOV words in training so nothing got
> mapped to OOV.
> I wouldn't worry about this.
> Dan
>
>
>
> On Mon, Mar 17, 2014 at 12:54 PM, Xavier Anguera <xan...@gm...>wrote:
>
>> Hi,
>> When training with a recipe adapted from switchboard I am getting the
>> following warning:
>> WARNING (gmm-init-model:InitAmGmm():gmm-init-model.cc:55) Tree has pdf-id
>> 1 with no stats; corresponding phone list: 6 7 8 9 10
>> This is a bad warning.
>>
>> Checking in ./data/lang/phones.txt I see that these correspond to the
>> phoneme <unk> which, I guess, deals with the OOVs
>>
>> In my recipe I add the unk to data/local/dict/lexicon.txt and to
>> data/local/dict/silence_phones.txt in addition to calling the prepare_lang
>> script with
>> utils/prepare_lang.sh data/local/dict "<unk>" data/local/lang data/lang
>>
>> Should I worry about this warning? if so, what should I check/change?
>>
>> Thanks
>>
>> Xavier Anguera
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and their
>> applications. Written by three acclaimed leaders in the field,
>> this first edition is now available. Download your free book today!
>> http://p.sf.net/sfu/13534_NeoTech
>> _______________________________________________
>> Kaldi-developers mailing list
>> Kal...@li...
>> https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>>
>>
>

Probably in your setup you had no OOV words in training so nothing got
mapped to OOV.
I wouldn't worry about this.
Dan

On Mon, Mar 17, 2014 at 12:54 PM, Xavier Anguera <xan...@gm...> wrote:

> Hi,
> When training with a recipe adapted from switchboard I am getting the
> following warning:
> WARNING (gmm-init-model:InitAmGmm():gmm-init-model.cc:55) Tree has pdf-id
> 1 with no stats; corresponding phone list: 6 7 8 9 10
> This is a bad warning.
>
> Checking in ./data/lang/phones.txt I see that these correspond to the
> phoneme <unk> which, I guess, deals with the OOVs
>
> In my recipe I add the unk to data/local/dict/lexicon.txt and to
> data/local/dict/silence_phones.txt in addition to calling the prepare_lang
> script with
> utils/prepare_lang.sh data/local/dict "<unk>" data/local/lang data/lang
>
> Should I worry about this warning? if so, what should I check/change?
>
> Thanks
>
> Xavier Anguera
>
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Kaldi-developers mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>
>

Hi,
When training with a recipe adapted from switchboard I am getting the
following warning:
WARNING (gmm-init-model:InitAmGmm():gmm-init-model.cc:55) Tree has pdf-id 1
with no stats; corresponding phone list: 6 7 8 9 10
This is a bad warning.

Checking in ./data/lang/phones.txt I see that these correspond to the
phoneme <unk> which, I guess, deals with the OOVs

In my recipe I add the unk to data/local/dict/lexicon.txt and to
data/local/dict/silence_phones.txt in addition to calling the prepare_lang
script with
utils/prepare_lang.sh data/local/dict "<unk>" data/local/lang data/lang

Should I worry about this warning? if so, what should I check/change?

Thanks

Xavier Anguera

2011	Jan	Feb	Mar	Apr	May	Jun (4)	Jul	Aug	Sep (1)	Oct (4)	Nov (1)	Dec (14)
2012	Jan (1)	Feb (8)	Mar	Apr (1)	May (3)	Jun (13)	Jul (7)	Aug (11)	Sep (6)	Oct (14)	Nov (16)	Dec (1)
2013	Jan (3)	Feb (8)	Mar (17)	Apr (21)	May (27)	Jun (11)	Jul (11)	Aug (21)	Sep (39)	Oct (17)	Nov (39)	Dec (28)
2014	Jan (36)	Feb (30)	Mar (35)	Apr (17)	May (22)	Jun (28)	Jul (23)	Aug (41)	Sep (17)	Oct (10)	Nov (22)	Dec (56)
2015	Jan (30)	Feb (32)	Mar (37)	Apr (28)	May (79)	Jun (18)	Jul (35)	Aug	Sep (1)	Oct	Nov	Dec

kaldi-developers Mailing List for Kaldi (Page 20)

kaldi-developers — Kaldi Developers