You can subscribe to this list here.
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
(2) |
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2012 |
Jan
|
Feb
|
Mar
(8) |
Apr
(4) |
May
(2) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
(2) |
Mar
(2) |
Apr
(7) |
May
(31) |
Jun
(40) |
Jul
(65) |
Aug
(37) |
Sep
(12) |
Oct
(57) |
Nov
(15) |
Dec
(35) |
2014 |
Jan
(3) |
Feb
(30) |
Mar
(57) |
Apr
(26) |
May
(49) |
Jun
(26) |
Jul
(63) |
Aug
(33) |
Sep
(20) |
Oct
(153) |
Nov
(62) |
Dec
(20) |
2015 |
Jan
(6) |
Feb
(21) |
Mar
(42) |
Apr
(33) |
May
(76) |
Jun
(102) |
Jul
(39) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Daniel P. <dp...@gm...> - 2015-06-23 03:59:00
|
Usually if there is a lot of acoustic context in your model you will require a larger LM weight. Also, if for some reason there tend to be a lot of insertions in decoding (e.g. something weird went wrong in training, or there is some kind of normalization problem), a large LM weight can help reduce insertions and so improve the WER. Dan On Mon, Jun 22, 2015 at 11:36 PM, Kirill Katsnelson <kir...@sm...> wrote: > I am getting the same ratio on both small and more targeted, and a quite large general LM. I do not understand what to make out if it! > > -kkm > >> -----Original Message----- >> From: Nagendra Goel [mailto:nag...@go...] >> Sent: 2015-06-22 2032 >> To: Kirill Katsnelson; kal...@li... >> Subject: RE: [Kaldi-users] LM weight >> >> Or maybe your domain is limited and LM very nicely matched to the task >> at hand? >> >> -----Original Message----- >> From: Kirill Katsnelson [mailto:kir...@sm...] >> Sent: Monday, June 22, 2015 11:29 PM >> To: kal...@li... >> Subject: [Kaldi-users] LM weight >> >> I my test sets I am getting the best WER at LM/acoustic weight in the >> range of 18-19, with multiple LMs of different size and origin. I was >> usually thinking the usual ballpark figure about 10, give or take. From >> your experience, does this larger LM weight mean anything, and what if >> it does? I am guessing an inadequate acoustic model, requiring more LM >> "pull"--am I making sense? >> >> -kkm >> >> ----------------------------------------------------------------------- >> ----- >> -- >> Monitor 25 network devices or servers for free with OpManager! >> OpManager is web-based network management software that monitors >> network devices and physical & virtual servers, alerts via email & sms >> for fault. >> Monitor 25 devices for free with no restriction. Download now >> http://ad.doubleclick.net/ddm/clk/292181274;119417398;o >> _______________________________________________ >> Kaldi-users mailing list >> Kal...@li... >> https://lists.sourceforge.net/lists/listinfo/kaldi-users > > ------------------------------------------------------------------------------ > Monitor 25 network devices or servers for free with OpManager! > OpManager is web-based network management software that monitors > network devices and physical & virtual servers, alerts via email & sms > for fault. Monitor 25 devices for free with no restriction. Download now > http://ad.doubleclick.net/ddm/clk/292181274;119417398;o > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users |
From: Kirill K. <kir...@sm...> - 2015-06-23 03:36:34
|
I am getting the same ratio on both small and more targeted, and a quite large general LM. I do not understand what to make out if it! -kkm > -----Original Message----- > From: Nagendra Goel [mailto:nag...@go...] > Sent: 2015-06-22 2032 > To: Kirill Katsnelson; kal...@li... > Subject: RE: [Kaldi-users] LM weight > > Or maybe your domain is limited and LM very nicely matched to the task > at hand? > > -----Original Message----- > From: Kirill Katsnelson [mailto:kir...@sm...] > Sent: Monday, June 22, 2015 11:29 PM > To: kal...@li... > Subject: [Kaldi-users] LM weight > > I my test sets I am getting the best WER at LM/acoustic weight in the > range of 18-19, with multiple LMs of different size and origin. I was > usually thinking the usual ballpark figure about 10, give or take. From > your experience, does this larger LM weight mean anything, and what if > it does? I am guessing an inadequate acoustic model, requiring more LM > "pull"--am I making sense? > > -kkm > > ----------------------------------------------------------------------- > ----- > -- > Monitor 25 network devices or servers for free with OpManager! > OpManager is web-based network management software that monitors > network devices and physical & virtual servers, alerts via email & sms > for fault. > Monitor 25 devices for free with no restriction. Download now > http://ad.doubleclick.net/ddm/clk/292181274;119417398;o > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users |
From: Nagendra G. <nag...@go...> - 2015-06-23 03:31:37
|
Or maybe your domain is limited and LM very nicely matched to the task at hand? -----Original Message----- From: Kirill Katsnelson [mailto:kir...@sm...] Sent: Monday, June 22, 2015 11:29 PM To: kal...@li... Subject: [Kaldi-users] LM weight I my test sets I am getting the best WER at LM/acoustic weight in the range of 18-19, with multiple LMs of different size and origin. I was usually thinking the usual ballpark figure about 10, give or take. From your experience, does this larger LM weight mean anything, and what if it does? I am guessing an inadequate acoustic model, requiring more LM "pull"--am I making sense? -kkm ---------------------------------------------------------------------------- -- Monitor 25 network devices or servers for free with OpManager! OpManager is web-based network management software that monitors network devices and physical & virtual servers, alerts via email & sms for fault. Monitor 25 devices for free with no restriction. Download now http://ad.doubleclick.net/ddm/clk/292181274;119417398;o _______________________________________________ Kaldi-users mailing list Kal...@li... https://lists.sourceforge.net/lists/listinfo/kaldi-users |
From: Kirill K. <kir...@sm...> - 2015-06-23 03:29:26
|
I my test sets I am getting the best WER at LM/acoustic weight in the range of 18-19, with multiple LMs of different size and origin. I was usually thinking the usual ballpark figure about 10, give or take. From your experience, does this larger LM weight mean anything, and what if it does? I am guessing an inadequate acoustic model, requiring more LM "pull"--am I making sense? -kkm |
From: Daniel P. <dp...@gm...> - 2015-06-18 21:08:31
|
The lack of length normalization is actually on purpose. It is the only way to make it so the system can be in principle completely invariant to data offsets. It also enables more robust backoff to when you have no adaptation data at all, because it smoothly approaches the zero ivector (due to the prior term in the iVector estimation objective function). I think you should just not use the iVectors at all if your utterances are very short. For the CTS task, you can always use previous utterances of the same speaker in the iVector estimation. The setup that's checked in does that unless you decode with --per-utt. Dan On Thu, Jun 18, 2015 at 9:29 AM, Nagendra Goel <nag...@go...> wrote: > I think it would make sense. Would you like to contribute that to the > recipe. > > -----Original Message----- > From: David van Leeuwen [mailto:dav...@gm...] > Sent: Thursday, June 18, 2015 5:18 AM > To: kal...@li... > Subject: [Kaldi-users] nnet2-online i-vector sensibility with short > utterances > > Hello, > > We're using the nnet2-online setup in a CTS task. We have a good experience > with the same setup for a BN task. However, for the CTS task, where > utterances can be very short ("yes", "mmm", etc), and we observe a very > strong dependence of the ivector length on duration (which makes sense) a > very strong dependence of ASR performance on ivector length (which also > makes sense). > > It seems that in the nnet2-online setup the ivectors are not normalized to > length as is customary in speaker recognition. The nnet doesn't seem to > like the duration dependence---what would be an approach to deal with this? > Would it make sense to train the nnet with length-normalized ivectors? > > Cheers, > > ---david > > > -- > David van Leeuwen > > ---------------------------------------------------------------------------- > -- > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > > ------------------------------------------------------------------------------ > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users |
From: Nagendra G. <nag...@go...> - 2015-06-18 14:00:21
|
I think it would make sense. Would you like to contribute that to the recipe. -----Original Message----- From: David van Leeuwen [mailto:dav...@gm...] Sent: Thursday, June 18, 2015 5:18 AM To: kal...@li... Subject: [Kaldi-users] nnet2-online i-vector sensibility with short utterances Hello, We're using the nnet2-online setup in a CTS task. We have a good experience with the same setup for a BN task. However, for the CTS task, where utterances can be very short ("yes", "mmm", etc), and we observe a very strong dependence of the ivector length on duration (which makes sense) a very strong dependence of ASR performance on ivector length (which also makes sense). It seems that in the nnet2-online setup the ivectors are not normalized to length as is customary in speaker recognition. The nnet doesn't seem to like the duration dependence---what would be an approach to deal with this? Would it make sense to train the nnet with length-normalized ivectors? Cheers, ---david -- David van Leeuwen ---------------------------------------------------------------------------- -- _______________________________________________ Kaldi-users mailing list Kal...@li... https://lists.sourceforge.net/lists/listinfo/kaldi-users |
From: David v. L. <dav...@gm...> - 2015-06-18 09:18:43
|
Hello, We're using the nnet2-online setup in a CTS task. We have a good experience with the same setup for a BN task. However, for the CTS task, where utterances can be very short ("yes", "mmm", etc), and we observe a very strong dependence of the ivector length on duration (which makes sense) a very strong dependence of ASR performance on ivector length (which also makes sense). It seems that in the nnet2-online setup the ivectors are not normalized to length as is customary in speaker recognition. The nnet doesn't seem to like the duration dependence---what would be an approach to deal with this? Would it make sense to train the nnet with length-normalized ivectors? Cheers, ---david -- David van Leeuwen |
From: Sandeep R. <san...@go...> - 2015-06-17 21:00:47
|
Ondrej, I'll run the Vystadial recipe and see what opportunities are there. Did somebody already make a class LM on it or at least define what potential classes are? I hadn't looked into it earlier. Thanks Nagendra On Wed, Jun 17, 2015 at 3:42 AM, Ondrej Platek <ond...@gm...> wrote: > Dear all, > > thanks to reminder of Dimitris, I realized that the Vystadial dataset is > very convenient for Class based LM/ LM grafting. > As the scripts for Vystadial Cs & En are already in Kaldi it may be > convenient starting data because > they contain transcription of user utterances from communication with > spoken dialogue system where we have the classes defined. > > See scritps: > https://github.com/kaldi-asr/kaldi/tree/master/egs/vystadial_en > https://github.com/kaldi-asr/kaldi/tree/master/egs/vystadial_cz > > See data (scroll to the bottom to download the datasets): > http://hdl.handle.net/11858/00-097C-0000-0023-4671-4 (en) > http://hdl.handle.net/11858/00-097C-0000-0023-4670-6 (cs) > > > We can probably recreate / find the list of words in the classes for > English if there is interest. > For Czech this should be no problem at all. > > Please, let me know if you are interested in these datasets and the lists > of classes and their members. > > Ondra > > PS: Currently, we used classed based (CB) LM which we later expand to full > LM in arpa format than create G.fst as in standard use case. > It is not optimal attitude but it works for us. > If you want to know how we are modeling the CBLM just let me know, I am > working on slight improvement of it right now, > so I am interested in improving it. > > > On Tue, May 26, 2015 at 8:11 PM, Kirill Katsnelson < > kir...@sm...> wrote: > >> Speaking about data set preprocessing only, will Stanford NLP POS tagger >> pull the trick? >> >> -kkm >> >> > -----Original Message----- >> > From: Nagendra Goel [mailto:nag...@go...] >> > Sent: 2015-05-24 1511 >> > To: Matthew Aylett >> > Cc: Dimitris Vassos; kal...@li... >> > Subject: Re: [Kaldi-users] LM grafting >> > >> > A systematic way for identifying special elements in text will be very >> > useful. Currently NSW-EXPAND from festival conflicts with this sub- >> > grammar approach although otherwise it's a good lm pre-processing step. >> > >> > Nagendra Kumar Goel >> > >> > On May 24, 2015 4:45 PM, "Matthew Aylett" <mat...@gm...> >> > wrote: >> > >> > >> > Not sure if this is relevant to this thread. But in the speech >> > synthesis system branch we have a very early text normaliser which >> > (when >> > complete) will detect things like phone numbers addresses, currencies >> > etc. The output form this could then be used to inform language model >> > building. Currently it deals with symbols and tokenisations in English. >> > >> > Potentially `(although I wasn't currently planning on this), the >> > text normaliser could be written in thrax - based on openfst - authored >> > by Richard Sproat I believe). However if this approach would benefit >> > ASR as well then it might be worth doing it this way rather than my >> > plan of a simple greedy normaliser. >> > >> > >> > v best >> > >> > Matthew Aylett >> > >> > >> > On Sun, May 24, 2015 at 8:34 AM, Dimitris Vassos >> > <dva...@gm...> wrote: >> > >> > >> > We have access to several corpora and we are trying to put >> > together something appropriate. >> > >> > In the next couple of days, we will also volunteer a >> server >> > to set it all up and run the tests. >> > >> > Dimitris >> > >> > > On 24 Μαΐ 2015, at 02:06, Daniel Povey < >> dp...@gm...> >> > wrote: >> > > >> > > One possibility is to use a completely open-source >> setup, >> > e.g. >> > > Voxforge, and forget about the "has a clear advantage" >> > requirement. >> > > E.g. target anything that looks like a year, and make a >> > grammar for >> > > years. >> > > Dan >> > > >> > > >> > > On Fri, May 22, 2015 at 6:32 AM, Nagendra Goel >> > > <nag...@go...> wrote: >> > >> Since I cannot volunteer my enviornment, do you >> > recommend another >> > >> enviornment where this can be prototyped and where you >> > can check in some >> > >> class lm recipe that has advantage. >> > >> >> > >> Nagendra >> > >> >> > >> Nagendra Kumar Goel >> > >> >> > >>> On May 21, 2015 11:01 PM, "Dimitris Vassos" >> > <dva...@gm...> wrote: >> > >>> >> > >>> +1 for the class-based LMs. I have also been >> interested >> > in this >> > >>> functionality for some time now, so will be more than >> > happy to try out the >> > >>> current implementation, if possible. >> > >>> >> > >>> Thanks >> > >>> Dimitris >> > >>> >> > >>>> On 22 Μαΐ 2015, at 01:34, >> > kal...@li... >> > >>>> wrote: >> > >>>> >> > >>>> Send Kaldi-users mailing list submissions to >> > >>>> kal...@li... >> > >>>> >> > >>>> To subscribe or unsubscribe via the World Wide Web, >> > visit >> > >>>> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >>>> or, via email, send a message with subject or body >> > 'help' to >> > >>>> kal...@li... >> > >>>> >> > >>>> You can reach the person managing the list at >> > >>>> kal...@li... >> > >>>> >> > >>>> When replying, please edit your Subject line so it is >> > more specific >> > >>>> than "Re: Contents of Kaldi-users digest..." >> > >>>> >> > >>>> >> > >>>> Today's Topics: >> > >>>> >> > >>>> 1. Re: LM grafting (Daniel Povey) >> > >>>> 2. Re: LM grafting (Kirill Katsnelson) >> > >>>> 3. Re: LM grafting (Hainan Xu) >> > >>>> 4. Re: LM grafting (Sean True) >> > >>>> >> > >>>> >> > >>>> >> > ---------------------------------------------------------------------- >> > >>>> >> > >>>> Message: 1 >> > >>>> Date: Thu, 21 May 2015 15:04:04 -0400 >> > >>>> From: Daniel Povey <dp...@gm...> >> > >>>> Subject: Re: [Kaldi-users] LM grafting >> > >>>> To: Sean True <se...@se...> >> > >>>> Cc: Hainan Xu <hai...@gm...>, >> > >>>> "kal...@li..." >> > >>>> <kal...@li...>, Kirill >> > Katsnelson >> > >>>> <kir...@sm...> >> > >>>> Message-ID: >> > >>>> >> > <CAEWAuySHaXwdNJZAoL6CanzHth=k4Y...@ma... >> > <mailto:k4YJVsBiAfEuFDFMvY%2B...@ma...> > >> > >>>> Content-Type: text/plain; charset=UTF-8 >> > >>>> >> > >>>> The general approach is to create an FST for the >> > little language >> > >>>> model, and then to use fstreplace to replace >> instances >> > of a particular >> > >>>> symbol in the top-level language model, with that >> FST. >> > >>>> The tricky part is ensuring that the result is >> > determinizable after >> > >>>> composing with the lexicon. In general our solution >> > is to add special >> > >>>> disambiguation symbols at the beginning and end of >> > each of the >> > >>>> sub-FSTs, and of course making sure that the sub-FSTs >> > are themselves >> > >>>> determinizable. >> > >>>> Dan >> > >>>> >> > >>>> >> > >>>>> On Thu, May 21, 2015 at 3:01 PM, Sean True >> > <se...@se...> >> > >>>>> wrote: >> > >>>>> That's a subject of some general interest. Is there >> a >> > discussion of the >> > >>>>> general approach that was taken somewhere? >> > >>>>> >> > >>>>> -- Sean >> > >>>>> >> > >>>>> Sean True >> > >>>>> Semantic Machines >> > >>>>> >> > >>>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >> > <dp...@gm...> >> > >>>>>> wrote: >> > >>>>>> >> > >>>>>> Nagendra Goel has worked on some example scripts >> for >> > this type of >> > >>>>>> thing, and with Hainan we were working on trying to >> > get it cleaned up >> > >>>>>> and checked in, but he's going for an internship so >> > it will have to >> > >>>>>> wait. But Nagendra might be willing to share it >> > with you. >> > >>>>>> Dan >> > >>>>>> >> > >>>>>> >> > >>>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill Katsnelson >> > >>>>>> <kir...@sm...> wrote: >> > >>>>>>> Suppose I have a language model where one token (a >> > "word") is a >> > >>>>>>> pointer >> > >>>>>>> to a whole another LM. This is a practical case >> > when you expect an >> > >>>>>>> abrupt >> > >>>>>>> change in model, a clear example being "my phone >> > number is..." and >> > >>>>>>> then >> > >>>>>>> you'd expect them rattling a string of digits. >> > Is there any support >> > >>>>>>> in kaldi >> > >>>>>>> for this? >> > >>>>>>> >> > >>>>>>> Thanks, >> > >>>>>>> >> > >>>>>>> -kkm >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > ----------------------------------------------------------------------- >> > - >> > ------ >> > >>>>>>> One dashboard for servers and applications across >> > >>>>>>> Physical-Virtual-Cloud >> > >>>>>>> Widest out-of-the-box monitoring support with >> > 50+ applications >> > >>>>>>> Performance metrics, stats and reports that give >> > you Actionable >> > >>>>>>> Insights >> > >>>>>>> Deep dive visibility with transaction tracing >> using >> > APM Insight. >> > >>>>>>> >> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> > >>>>>>> _______________________________________________ >> > >>>>>>> Kaldi-users mailing list >> > >>>>>>> Kal...@li... >> > >>>>>>> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > ----------------------------------------------------------------------- >> > - >> > ------ >> > >>>>>> One dashboard for servers and applications across >> > >>>>>> Physical-Virtual-Cloud >> > >>>>>> Widest out-of-the-box monitoring support with 50+ >> > applications >> > >>>>>> Performance metrics, stats and reports that give >> you >> > Actionable >> > >>>>>> Insights >> > >>>>>> Deep dive visibility with transaction tracing using >> > APM Insight. >> > >>>>>> >> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> > >>>>>> _______________________________________________ >> > >>>>>> Kaldi-users mailing list >> > >>>>>> Kal...@li... >> > >>>>>> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >>>> >> > >>>> >> > >>>> >> > >>>> ------------------------------ >> > >>>> >> > >>>> Message: 2 >> > >>>> Date: Thu, 21 May 2015 19:24:38 +0000 >> > >>>> From: Kirill Katsnelson >> > <kir...@sm...> >> > >>>> Subject: Re: [Kaldi-users] LM grafting >> > >>>> To: "dp...@gm..." <dp...@gm...>, Sean True >> > >>>> <se...@se...> >> > >>>> Cc: Hainan Xu <hai...@gm...>, >> > >>>> "kal...@li..." >> > >>>> <kal...@li...> >> > >>>> Message-ID: >> > >>>> >> > >>>> >> > <CY1...@CY...d.out >> > l >> > ook.com> >> > >>>> >> > >>>> Content-Type: text/plain; charset="utf-8" >> > >>>> >> > >>>> Also, from the practical standpoint, >> > backoff/discounting weights usually >> > >>>> need to be massaged. Otherwise when the grafted LM is >> > small and the main LM >> > >>>> is large, the little model will tend to shoehorn an >> > utterance into itself >> > >>>> rather than let go of it. In my phone number example, >> > everything becomes >> > >>>> digits once the phone number starts. >> > >>>> >> > >>>> -kkm >> > >>>> >> > >>>>> -----Original Message----- >> > >>>>> From: Daniel Povey [mailto:dp...@gm...] >> > >>>>> Sent: 2015-05-21 1204 >> > >>>>> To: Sean True >> > >>>>> Cc: Kirill Katsnelson; Nagendra Goel; Hainan Xu; >> > kaldi- >> > >>>>> us...@li... >> > >>>>> Subject: Re: [Kaldi-users] LM grafting >> > >>>>> >> > >>>>> The general approach is to create an FST for the >> > little language model, >> > >>>>> and then to use fstreplace to replace instances of a >> > particular symbol >> > >>>>> in the top-level language model, with that FST. >> > >>>>> The tricky part is ensuring that the result is >> > determinizable after >> > >>>>> composing with the lexicon. In general our solution >> > is to add special >> > >>>>> disambiguation symbols at the beginning and end of >> > each of the sub- >> > >>>>> FSTs, and of course making sure that the sub-FSTs >> are >> > themselves >> > >>>>> determinizable. >> > >>>>> Dan >> > >>>>> >> > >>>>> >> > >>>>> On Thu, May 21, 2015 at 3:01 PM, Sean True >> > <se...@se...> >> > >>>>> wrote: >> > >>>>>> That's a subject of some general interest. Is there >> > a discussion of >> > >>>>>> the general approach that was taken somewhere? >> > >>>>>> >> > >>>>>> -- Sean >> > >>>>>> >> > >>>>>> Sean True >> > >>>>>> Semantic Machines >> > >>>>>> >> > >>>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >> > <dp...@gm...> >> > >>>>> wrote: >> > >>>>>>> >> > >>>>>>> Nagendra Goel has worked on some example scripts >> > for this type of >> > >>>>>>> thing, and with Hainan we were working on trying >> to >> > get it cleaned >> > >>>>> up >> > >>>>>>> and checked in, but he's going for an internship >> so >> > it will have to >> > >>>>>>> wait. But Nagendra might be willing to share it >> > with you. >> > >>>>>>> Dan >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill Katsnelson >> > >>>>>>> <kir...@sm...> wrote: >> > >>>>>>>> Suppose I have a language model where one token >> (a >> > "word") is a >> > >>>>>>>> pointer to a whole another LM. This is a >> practical >> > case when you >> > >>>>>>>> expect an abrupt change in model, a clear example >> > being "my phone >> > >>>>>>>> number is..." and then you'd expect them rattling >> > a string of >> > >>>>>>>> digits. Is there any support in kaldi for this? >> > >>>>>>>> >> > >>>>>>>> Thanks, >> > >>>>>>>> >> > >>>>>>>> -kkm >> > >>>>>>>> >> > >>>>>>>> >> > ------------------------------------------------------------------ >> > >>>>> - >> > >>>>>>>> ----------- One dashboard for servers and >> > applications across >> > >>>>>>>> Physical-Virtual-Cloud Widest out-of-the-box >> > monitoring support >> > >>>>>>>> with 50+ applications Performance metrics, stats >> > and reports that >> > >>>>>>>> give you Actionable Insights Deep dive visibility >> > with transaction >> > >>>>>>>> tracing using APM Insight. >> > >>>>>>>> >> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> > >>>>>>>> _______________________________________________ >> > >>>>>>>> Kaldi-users mailing list >> > >>>>>>>> Kal...@li... >> > >>>>>>>> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > -------------------------------------------------------------------- >> > >>>>> - >> > >>>>>>> --------- One dashboard for servers and >> > applications across >> > >>>>>>> Physical-Virtual-Cloud Widest out-of-the-box >> > monitoring support with >> > >>>>>>> 50+ applications Performance metrics, stats and >> > reports that give >> > >>>>> you >> > >>>>>>> Actionable Insights Deep dive visibility with >> > transaction tracing >> > >>>>>>> using APM Insight. >> > >>>>>>> >> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> > >>>>>>> _______________________________________________ >> > >>>>>>> Kaldi-users mailing list >> > >>>>>>> Kal...@li... >> > >>>>>>> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >>>> >> > >>>> ------------------------------ >> > >>>> >> > >>>> Message: 3 >> > >>>> Date: Thu, 21 May 2015 15:29:54 -0400 >> > >>>> From: Hainan Xu <hai...@gm...> >> > >>>> Subject: Re: [Kaldi-users] LM grafting >> > >>>> To: Daniel Povey <dp...@gm...> >> > >>>> Cc: Sean True <se...@se...>, >> > >>>> "kal...@li..." >> > >>>> <kal...@li...>, Kirill >> > Katsnelson >> > >>>> <kir...@sm...> >> > >>>> Message-ID: >> > >>>> >> > <CALP+BDZvJP-2cZ+fEJEXaMaVWzgy63mtc=J1E...@ma...> >> > >>>> Content-Type: text/plain; charset="utf-8" >> > >>>> >> > >>>> There is a paper in ICASSP 2015 that described some >> > very similar idea: >> > >>>> >> > >>>> Improved recognition of contact names in voice >> > commands >> > >>>> >> > >>>>> On Thu, May 21, 2015 at 3:04 PM, Daniel Povey >> > <dp...@gm...> wrote: >> > >>>>> >> > >>>>> The general approach is to create an FST for the >> > little language >> > >>>>> model, and then to use fstreplace to replace >> > instances of a particular >> > >>>>> symbol in the top-level language model, with that >> > FST. >> > >>>>> The tricky part is ensuring that the result is >> > determinizable after >> > >>>>> composing with the lexicon. In general our solution >> > is to add special >> > >>>>> disambiguation symbols at the beginning and end of >> > each of the >> > >>>>> sub-FSTs, and of course making sure that the >> sub-FSTs >> > are themselves >> > >>>>> determinizable. >> > >>>>> Dan >> > >>>>> >> > >>>>> >> > >>>>> On Thu, May 21, 2015 at 3:01 PM, Sean True >> > <se...@se...> >> > >>>>> wrote: >> > >>>>>> That's a subject of some general interest. Is there >> > a discussion of >> > >>>>>> the >> > >>>>>> general approach that was taken somewhere? >> > >>>>>> >> > >>>>>> -- Sean >> > >>>>>> >> > >>>>>> Sean True >> > >>>>>> Semantic Machines >> > >>>>>> >> > >>>>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >> > <dp...@gm...> >> > >>>>>>> wrote: >> > >>>>>>> >> > >>>>>>> Nagendra Goel has worked on some example scripts >> > for this type of >> > >>>>>>> thing, and with Hainan we were working on trying >> to >> > get it cleaned up >> > >>>>>>> and checked in, but he's going for an internship >> so >> > it will have to >> > >>>>>>> wait. But Nagendra might be willing to share it >> > with you. >> > >>>>>>> Dan >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill Katsnelson >> > >>>>>>> <kir...@sm...> wrote: >> > >>>>>>>> Suppose I have a language model where one token >> (a >> > "word") is a >> > >>>>> pointer >> > >>>>>>>> to a whole another LM. This is a practical case >> > when you expect an >> > >>>>> abrupt >> > >>>>>>>> change in model, a clear example being "my phone >> > number is..." and >> > >>>>> then >> > >>>>>>>> you'd expect them rattling a string of digits. >> > Is there any support >> > >>>>> in kaldi >> > >>>>>>>> for this? >> > >>>>>>>> >> > >>>>>>>> Thanks, >> > >>>>>>>> >> > >>>>>>>> -kkm >> > >>>>> >> > >>>>> >> > ----------------------------------------------------------------------- >> > - >> > ------ >> > >>>>>>>> One dashboard for servers and applications across >> > >>>>> Physical-Virtual-Cloud >> > >>>>>>>> Widest out-of-the-box monitoring support with >> > 50+ applications >> > >>>>>>>> Performance metrics, stats and reports that give >> > you Actionable >> > >>>>> Insights >> > >>>>>>>> Deep dive visibility with transaction tracing >> > using APM Insight. >> > >>>>>>>> >> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> > >>>>>>>> _______________________________________________ >> > >>>>>>>> Kaldi-users mailing list >> > >>>>>>>> Kal...@li... >> > >>>>>>>> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >>>>> >> > >>>>> >> > ----------------------------------------------------------------------- >> > - >> > ------ >> > >>>>>>> One dashboard for servers and applications across >> > >>>>>>> Physical-Virtual-Cloud >> > >>>>>>> Widest out-of-the-box monitoring support with >> > 50+ applications >> > >>>>>>> Performance metrics, stats and reports that give >> > you Actionable >> > >>>>>>> Insights >> > >>>>>>> Deep dive visibility with transaction tracing >> using >> > APM Insight. >> > >>>>>>> >> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> > >>>>>>> _______________________________________________ >> > >>>>>>> Kaldi-users mailing list >> > >>>>>>> Kal...@li... >> > >>>>>>> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >>>> >> > >>>> >> > >>>> >> > >>>> -- >> > >>>> - Hainan >> > >>>> -------------- next part -------------- >> > >>>> An HTML attachment was scrubbed... >> > >>>> >> > >>>> ------------------------------ >> > >>>> >> > >>>> Message: 4 >> > >>>> Date: Thu, 21 May 2015 15:01:51 -0400 >> > >>>> From: Sean True <se...@se...> >> > >>>> Subject: Re: [Kaldi-users] LM grafting >> > >>>> To: Daniel Povey <dp...@gm...> >> > >>>> Cc: Hainan Xu <hai...@gm...>, >> > >>>> "kal...@li..." >> > >>>> <kal...@li...>, Kirill >> > Katsnelson >> > >>>> <kir...@sm...> >> > >>>> Message-ID: >> > >>>> >> > <CALtEaHntdAcmO_Ji5dxsPnT8i9M_LVuGnY0UjkJUPp=pY...@ma...> >> > >>>> Content-Type: text/plain; charset="utf-8" >> > >>>> >> > >>>> That's a subject of some general interest. Is there a >> > discussion of the >> > >>>> general approach that was taken somewhere? >> > >>>> >> > >>>> -- Sean >> > >>>> >> > >>>> Sean True >> > >>>> Semantic Machines >> > >>>> >> > >>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >> > <dp...@gm...> wrote: >> > >>>>> >> > >>>>> Nagendra Goel has worked on some example scripts for >> > this type of >> > >>>>> thing, and with Hainan we were working on trying to >> > get it cleaned up >> > >>>>> and checked in, but he's going for an internship so >> > it will have to >> > >>>>> wait. But Nagendra might be willing to share it >> with >> > you. >> > >>>>> Dan >> > >>>>> >> > >>>>> >> > >>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill Katsnelson >> > >>>>> <kir...@sm...> wrote: >> > >>>>>> Suppose I have a language model where one token (a >> > "word") is a >> > >>>>>> pointer >> > >>>>> to a whole another LM. This is a practical case when >> > you expect an >> > >>>>> abrupt >> > >>>>> change in model, a clear example being "my phone >> > number is..." and then >> > >>>>> you'd expect them rattling a string of digits. Is >> > there any support in >> > >>>>> kaldi for this? >> > >>>>>> >> > >>>>>> Thanks, >> > >>>>>> >> > >>>>>> -kkm >> > >>>>> >> > >>>>> >> > ----------------------------------------------------------------------- >> > - >> > ------ >> > >>>>>> One dashboard for servers and applications across >> > >>>>>> Physical-Virtual-Cloud >> > >>>>>> Widest out-of-the-box monitoring support with 50+ >> > applications >> > >>>>>> Performance metrics, stats and reports that give >> you >> > Actionable >> > >>>>>> Insights >> > >>>>>> Deep dive visibility with transaction tracing using >> > APM Insight. >> > >>>>>> >> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> > >>>>>> _______________________________________________ >> > >>>>>> Kaldi-users mailing list >> > >>>>>> Kal...@li... >> > >>>>>> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > ----------------------------------------------------------------------- >> > - >> > ------ >> > >>>>> One dashboard for servers and applications across >> > >>>>> Physical-Virtual-Cloud >> > >>>>> Widest out-of-the-box monitoring support with 50+ >> > applications >> > >>>>> Performance metrics, stats and reports that give you >> > Actionable >> > >>>>> Insights >> > >>>>> Deep dive visibility with transaction tracing using >> > APM Insight. >> > >>>>> >> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> > >>>>> _______________________________________________ >> > >>>>> Kaldi-users mailing list >> > >>>>> Kal...@li... >> > >>>>> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >>>> -------------- next part -------------- >> > >>>> An HTML attachment was scrubbed... >> > >>>> >> > >>>> ------------------------------ >> > >>>> >> > >>>> >> > >>>> >> > ----------------------------------------------------------------------- >> > - >> > ------ >> > >>>> One dashboard for servers and applications across >> > Physical-Virtual-Cloud >> > >>>> Widest out-of-the-box monitoring support with 50+ >> > applications >> > >>>> Performance metrics, stats and reports that give you >> > Actionable Insights >> > >>>> Deep dive visibility with transaction tracing using >> > APM Insight. >> > >>>> >> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> > >>>> >> > >>>> ------------------------------ >> > >>>> >> > >>>> _______________________________________________ >> > >>>> Kaldi-users mailing list >> > >>>> Kal...@li... >> > >>>> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >>>> >> > >>>> >> > >>>> End of Kaldi-users Digest, Vol 29, Issue 15 >> > >>>> ******************************************* >> > >>> >> > >>> >> > >>> >> > ----------------------------------------------------------------------- >> > - >> > ------ >> > >>> One dashboard for servers and applications across >> > Physical-Virtual-Cloud >> > >>> Widest out-of-the-box monitoring support with 50+ >> > applications >> > >>> Performance metrics, stats and reports that give you >> > Actionable Insights >> > >>> Deep dive visibility with transaction tracing using >> APM >> > Insight. >> > >>> >> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> > >>> _______________________________________________ >> > >>> Kaldi-users mailing list >> > >>> Kal...@li... >> > >>> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >> >> > >> >> > >> >> > ----------------------------------------------------------------------- >> > - >> > ------ >> > >> One dashboard for servers and applications across >> > Physical-Virtual-Cloud >> > >> Widest out-of-the-box monitoring support with 50+ >> > applications >> > >> Performance metrics, stats and reports that give you >> > Actionable Insights >> > >> Deep dive visibility with transaction tracing using APM >> > Insight. >> > >> >> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> > >> _______________________________________________ >> > >> Kaldi-users mailing list >> > >> Kal...@li... >> > >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >> >> > >> > >> > ----------------------------------------------------------------------- >> > - >> > ------ >> > One dashboard for servers and applications across >> Physical- >> > Virtual-Cloud >> > Widest out-of-the-box monitoring support with 50+ >> > applications >> > Performance metrics, stats and reports that give you >> > Actionable Insights >> > Deep dive visibility with transaction tracing using APM >> > Insight. >> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> > _______________________________________________ >> > Kaldi-users mailing list >> > Kal...@li... >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >> > >> > >> > >> > >> > ----------------------------------------------------------------------- >> > - >> > ------ >> > One dashboard for servers and applications across Physical- >> > Virtual-Cloud >> > Widest out-of-the-box monitoring support with 50+ applications >> > Performance metrics, stats and reports that give you Actionable >> > Insights >> > Deep dive visibility with transaction tracing using APM Insight. >> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> > _______________________________________________ >> > Kaldi-users mailing list >> > Kal...@li... >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > >> > >> >> >> ------------------------------------------------------------------------------ >> One dashboard for servers and applications across Physical-Virtual-Cloud >> Widest out-of-the-box monitoring support with 50+ applications >> Performance metrics, stats and reports that give you Actionable Insights >> Deep dive visibility with transaction tracing using APM Insight. >> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> _______________________________________________ >> Kaldi-users mailing list >> Kal...@li... >> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> > > > > -- > Ondřej Plátek, +420 737 758 650, skype:ondrejplatek, > ond...@gm... > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > |
From: Sandeep R. <san...@go...> - 2015-06-17 20:43:21
|
Does the kaldi recipe do Class LM? Or can you add it to recipe? That would make the whole process so much easier. I don't mind if the words are Czetch. On Wed, Jun 17, 2015 at 4:08 PM, Ondrej Platek <ond...@gm...> wrote: > For the Czech data we are running the system live with Kaldi and we use > class LM. > For the English data I will give you few examples from top of my head: > > PRICE_RANGE - cheap, middle price-range... > FOOD_TYPE - Indian, Chinese, > LOCATION - city center, Chesterton area, .. > .... > > We will try to find the classes definition, since we are not running the > system. > > Ondrej > > On Wed, Jun 17, 2015 at 10:01 PM, Sandeep Reddy < > san...@go...> wrote: > >> Ondrej, >> I'll run the Vystadial recipe and see what opportunities are there. >> Did somebody already make a class LM on it or at least define what >> potential classes are? I hadn't looked into it earlier. >> Thanks >> Nagendra >> >> On Wed, Jun 17, 2015 at 3:42 AM, Ondrej Platek <ond...@gm...> >> wrote: >> >>> Dear all, >>> >>> thanks to reminder of Dimitris, I realized that the Vystadial dataset is >>> very convenient for Class based LM/ LM grafting. >>> As the scripts for Vystadial Cs & En are already in Kaldi it may be >>> convenient starting data because >>> they contain transcription of user utterances from communication with >>> spoken dialogue system where we have the classes defined. >>> >>> See scritps: >>> https://github.com/kaldi-asr/kaldi/tree/master/egs/vystadial_en >>> https://github.com/kaldi-asr/kaldi/tree/master/egs/vystadial_cz >>> >>> See data (scroll to the bottom to download the datasets): >>> http://hdl.handle.net/11858/00-097C-0000-0023-4671-4 (en) >>> http://hdl.handle.net/11858/00-097C-0000-0023-4670-6 (cs) >>> >>> >>> We can probably recreate / find the list of words in the classes for >>> English if there is interest. >>> For Czech this should be no problem at all. >>> >>> Please, let me know if you are interested in these datasets and the >>> lists of classes and their members. >>> >>> Ondra >>> >>> PS: Currently, we used classed based (CB) LM which we later expand to >>> full LM in arpa format than create G.fst as in standard use case. >>> It is not optimal attitude but it works for us. >>> If you want to know how we are modeling the CBLM just let me know, I am >>> working on slight improvement of it right now, >>> so I am interested in improving it. >>> >>> >>> On Tue, May 26, 2015 at 8:11 PM, Kirill Katsnelson < >>> kir...@sm...> wrote: >>> >>>> Speaking about data set preprocessing only, will Stanford NLP POS >>>> tagger pull the trick? >>>> >>>> -kkm >>>> >>>> > -----Original Message----- >>>> > From: Nagendra Goel [mailto:nag...@go...] >>>> > Sent: 2015-05-24 1511 >>>> > To: Matthew Aylett >>>> > Cc: Dimitris Vassos; kal...@li... >>>> > Subject: Re: [Kaldi-users] LM grafting >>>> > >>>> > A systematic way for identifying special elements in text will be very >>>> > useful. Currently NSW-EXPAND from festival conflicts with this sub- >>>> > grammar approach although otherwise it's a good lm pre-processing >>>> step. >>>> > >>>> > Nagendra Kumar Goel >>>> > >>>> > On May 24, 2015 4:45 PM, "Matthew Aylett" <mat...@gm...> >>>> > wrote: >>>> > >>>> > >>>> > Not sure if this is relevant to this thread. But in the speech >>>> > synthesis system branch we have a very early text normaliser which >>>> > (when >>>> > complete) will detect things like phone numbers addresses, currencies >>>> > etc. The output form this could then be used to inform language model >>>> > building. Currently it deals with symbols and tokenisations in >>>> English. >>>> > >>>> > Potentially `(although I wasn't currently planning on this), the >>>> > text normaliser could be written in thrax - based on openfst - >>>> authored >>>> > by Richard Sproat I believe). However if this approach would benefit >>>> > ASR as well then it might be worth doing it this way rather than my >>>> > plan of a simple greedy normaliser. >>>> > >>>> > >>>> > v best >>>> > >>>> > Matthew Aylett >>>> > >>>> > >>>> > On Sun, May 24, 2015 at 8:34 AM, Dimitris Vassos >>>> > <dva...@gm...> wrote: >>>> > >>>> > >>>> > We have access to several corpora and we are trying to >>>> put >>>> > together something appropriate. >>>> > >>>> > In the next couple of days, we will also volunteer a >>>> server >>>> > to set it all up and run the tests. >>>> > >>>> > Dimitris >>>> > >>>> > > On 24 Μαΐ 2015, at 02:06, Daniel Povey < >>>> dp...@gm...> >>>> > wrote: >>>> > > >>>> > > One possibility is to use a completely open-source >>>> setup, >>>> > e.g. >>>> > > Voxforge, and forget about the "has a clear advantage" >>>> > requirement. >>>> > > E.g. target anything that looks like a year, and make >>>> a >>>> > grammar for >>>> > > years. >>>> > > Dan >>>> > > >>>> > > >>>> > > On Fri, May 22, 2015 at 6:32 AM, Nagendra Goel >>>> > > <nag...@go...> wrote: >>>> > >> Since I cannot volunteer my enviornment, do you >>>> > recommend another >>>> > >> enviornment where this can be prototyped and where >>>> you >>>> > can check in some >>>> > >> class lm recipe that has advantage. >>>> > >> >>>> > >> Nagendra >>>> > >> >>>> > >> Nagendra Kumar Goel >>>> > >> >>>> > >>> On May 21, 2015 11:01 PM, "Dimitris Vassos" >>>> > <dva...@gm...> wrote: >>>> > >>> >>>> > >>> +1 for the class-based LMs. I have also been >>>> interested >>>> > in this >>>> > >>> functionality for some time now, so will be more >>>> than >>>> > happy to try out the >>>> > >>> current implementation, if possible. >>>> > >>> >>>> > >>> Thanks >>>> > >>> Dimitris >>>> > >>> >>>> > >>>> On 22 Μαΐ 2015, at 01:34, >>>> > kal...@li... >>>> > >>>> wrote: >>>> > >>>> >>>> > >>>> Send Kaldi-users mailing list submissions to >>>> > >>>> kal...@li... >>>> > >>>> >>>> > >>>> To subscribe or unsubscribe via the World Wide Web, >>>> > visit >>>> > >>>> >>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> > >>>> or, via email, send a message with subject or body >>>> > 'help' to >>>> > >>>> kal...@li... >>>> > >>>> >>>> > >>>> You can reach the person managing the list at >>>> > >>>> kal...@li... >>>> > >>>> >>>> > >>>> When replying, please edit your Subject line so it >>>> is >>>> > more specific >>>> > >>>> than "Re: Contents of Kaldi-users digest..." >>>> > >>>> >>>> > >>>> >>>> > >>>> Today's Topics: >>>> > >>>> >>>> > >>>> 1. Re: LM grafting (Daniel Povey) >>>> > >>>> 2. Re: LM grafting (Kirill Katsnelson) >>>> > >>>> 3. Re: LM grafting (Hainan Xu) >>>> > >>>> 4. Re: LM grafting (Sean True) >>>> > >>>> >>>> > >>>> >>>> > >>>> >>>> > ---------------------------------------------------------------------- >>>> > >>>> >>>> > >>>> Message: 1 >>>> > >>>> Date: Thu, 21 May 2015 15:04:04 -0400 >>>> > >>>> From: Daniel Povey <dp...@gm...> >>>> > >>>> Subject: Re: [Kaldi-users] LM grafting >>>> > >>>> To: Sean True <se...@se...> >>>> > >>>> Cc: Hainan Xu <hai...@gm...>, >>>> > >>>> "kal...@li..." >>>> > >>>> <kal...@li...>, Kirill >>>> > Katsnelson >>>> > >>>> <kir...@sm...> >>>> > >>>> Message-ID: >>>> > >>>> >>>> > <CAEWAuySHaXwdNJZAoL6CanzHth=k4Y...@ma... >>>> > <mailto:k4YJVsBiAfEuFDFMvY%2B...@ma...> > >>>> > >>>> Content-Type: text/plain; charset=UTF-8 >>>> > >>>> >>>> > >>>> The general approach is to create an FST for the >>>> > little language >>>> > >>>> model, and then to use fstreplace to replace >>>> instances >>>> > of a particular >>>> > >>>> symbol in the top-level language model, with that >>>> FST. >>>> > >>>> The tricky part is ensuring that the result is >>>> > determinizable after >>>> > >>>> composing with the lexicon. In general our >>>> solution >>>> > is to add special >>>> > >>>> disambiguation symbols at the beginning and end of >>>> > each of the >>>> > >>>> sub-FSTs, and of course making sure that the >>>> sub-FSTs >>>> > are themselves >>>> > >>>> determinizable. >>>> > >>>> Dan >>>> > >>>> >>>> > >>>> >>>> > >>>>> On Thu, May 21, 2015 at 3:01 PM, Sean True >>>> > <se...@se...> >>>> > >>>>> wrote: >>>> > >>>>> That's a subject of some general interest. Is >>>> there a >>>> > discussion of the >>>> > >>>>> general approach that was taken somewhere? >>>> > >>>>> >>>> > >>>>> -- Sean >>>> > >>>>> >>>> > >>>>> Sean True >>>> > >>>>> Semantic Machines >>>> > >>>>> >>>> > >>>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >>>> > <dp...@gm...> >>>> > >>>>>> wrote: >>>> > >>>>>> >>>> > >>>>>> Nagendra Goel has worked on some example scripts >>>> for >>>> > this type of >>>> > >>>>>> thing, and with Hainan we were working on trying >>>> to >>>> > get it cleaned up >>>> > >>>>>> and checked in, but he's going for an internship >>>> so >>>> > it will have to >>>> > >>>>>> wait. But Nagendra might be willing to share it >>>> > with you. >>>> > >>>>>> Dan >>>> > >>>>>> >>>> > >>>>>> >>>> > >>>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill >>>> Katsnelson >>>> > >>>>>> <kir...@sm...> wrote: >>>> > >>>>>>> Suppose I have a language model where one token >>>> (a >>>> > "word") is a >>>> > >>>>>>> pointer >>>> > >>>>>>> to a whole another LM. This is a practical case >>>> > when you expect an >>>> > >>>>>>> abrupt >>>> > >>>>>>> change in model, a clear example being "my phone >>>> > number is..." and >>>> > >>>>>>> then >>>> > >>>>>>> you'd expect them rattling a string of digits. >>>> > Is there any support >>>> > >>>>>>> in kaldi >>>> > >>>>>>> for this? >>>> > >>>>>>> >>>> > >>>>>>> Thanks, >>>> > >>>>>>> >>>> > >>>>>>> -kkm >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>> ----------------------------------------------------------------------- >>>> > - >>>> > ------ >>>> > >>>>>>> One dashboard for servers and applications >>>> across >>>> > >>>>>>> Physical-Virtual-Cloud >>>> > >>>>>>> Widest out-of-the-box monitoring support with >>>> > 50+ applications >>>> > >>>>>>> Performance metrics, stats and reports that give >>>> > you Actionable >>>> > >>>>>>> Insights >>>> > >>>>>>> Deep dive visibility with transaction tracing >>>> using >>>> > APM Insight. >>>> > >>>>>>> >>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>> > >>>>>>> _______________________________________________ >>>> > >>>>>>> Kaldi-users mailing list >>>> > >>>>>>> Kal...@li... >>>> > >>>>>>> >>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> > >>>>>> >>>> > >>>>>> >>>> > >>>>>> >>>> > >>>>>> >>>> > >>>> ----------------------------------------------------------------------- >>>> > - >>>> > ------ >>>> > >>>>>> One dashboard for servers and applications across >>>> > >>>>>> Physical-Virtual-Cloud >>>> > >>>>>> Widest out-of-the-box monitoring support with 50+ >>>> > applications >>>> > >>>>>> Performance metrics, stats and reports that give >>>> you >>>> > Actionable >>>> > >>>>>> Insights >>>> > >>>>>> Deep dive visibility with transaction tracing >>>> using >>>> > APM Insight. >>>> > >>>>>> >>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>> > >>>>>> _______________________________________________ >>>> > >>>>>> Kaldi-users mailing list >>>> > >>>>>> Kal...@li... >>>> > >>>>>> >>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> > >>>> >>>> > >>>> >>>> > >>>> >>>> > >>>> ------------------------------ >>>> > >>>> >>>> > >>>> Message: 2 >>>> > >>>> Date: Thu, 21 May 2015 19:24:38 +0000 >>>> > >>>> From: Kirill Katsnelson >>>> > <kir...@sm...> >>>> > >>>> Subject: Re: [Kaldi-users] LM grafting >>>> > >>>> To: "dp...@gm..." <dp...@gm...>, Sean >>>> True >>>> > >>>> <se...@se...> >>>> > >>>> Cc: Hainan Xu <hai...@gm...>, >>>> > >>>> "kal...@li..." >>>> > >>>> <kal...@li...> >>>> > >>>> Message-ID: >>>> > >>>> >>>> > >>>> >>>> > >>>> <CY1...@CY...d.out >>>> > l >>>> > ook.com> >>>> > >>>> >>>> > >>>> Content-Type: text/plain; charset="utf-8" >>>> > >>>> >>>> > >>>> Also, from the practical standpoint, >>>> > backoff/discounting weights usually >>>> > >>>> need to be massaged. Otherwise when the grafted LM >>>> is >>>> > small and the main LM >>>> > >>>> is large, the little model will tend to shoehorn an >>>> > utterance into itself >>>> > >>>> rather than let go of it. In my phone number >>>> example, >>>> > everything becomes >>>> > >>>> digits once the phone number starts. >>>> > >>>> >>>> > >>>> -kkm >>>> > >>>> >>>> > >>>>> -----Original Message----- >>>> > >>>>> From: Daniel Povey [mailto:dp...@gm...] >>>> > >>>>> Sent: 2015-05-21 1204 >>>> > >>>>> To: Sean True >>>> > >>>>> Cc: Kirill Katsnelson; Nagendra Goel; Hainan Xu; >>>> > kaldi- >>>> > >>>>> us...@li... >>>> > >>>>> Subject: Re: [Kaldi-users] LM grafting >>>> > >>>>> >>>> > >>>>> The general approach is to create an FST for the >>>> > little language model, >>>> > >>>>> and then to use fstreplace to replace instances >>>> of a >>>> > particular symbol >>>> > >>>>> in the top-level language model, with that FST. >>>> > >>>>> The tricky part is ensuring that the result is >>>> > determinizable after >>>> > >>>>> composing with the lexicon. In general our >>>> solution >>>> > is to add special >>>> > >>>>> disambiguation symbols at the beginning and end of >>>> > each of the sub- >>>> > >>>>> FSTs, and of course making sure that the sub-FSTs >>>> are >>>> > themselves >>>> > >>>>> determinizable. >>>> > >>>>> Dan >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> On Thu, May 21, 2015 at 3:01 PM, Sean True >>>> > <se...@se...> >>>> > >>>>> wrote: >>>> > >>>>>> That's a subject of some general interest. Is >>>> there >>>> > a discussion of >>>> > >>>>>> the general approach that was taken somewhere? >>>> > >>>>>> >>>> > >>>>>> -- Sean >>>> > >>>>>> >>>> > >>>>>> Sean True >>>> > >>>>>> Semantic Machines >>>> > >>>>>> >>>> > >>>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >>>> > <dp...@gm...> >>>> > >>>>> wrote: >>>> > >>>>>>> >>>> > >>>>>>> Nagendra Goel has worked on some example scripts >>>> > for this type of >>>> > >>>>>>> thing, and with Hainan we were working on >>>> trying to >>>> > get it cleaned >>>> > >>>>> up >>>> > >>>>>>> and checked in, but he's going for an >>>> internship so >>>> > it will have to >>>> > >>>>>>> wait. But Nagendra might be willing to share it >>>> > with you. >>>> > >>>>>>> Dan >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill >>>> Katsnelson >>>> > >>>>>>> <kir...@sm...> wrote: >>>> > >>>>>>>> Suppose I have a language model where one >>>> token (a >>>> > "word") is a >>>> > >>>>>>>> pointer to a whole another LM. This is a >>>> practical >>>> > case when you >>>> > >>>>>>>> expect an abrupt change in model, a clear >>>> example >>>> > being "my phone >>>> > >>>>>>>> number is..." and then you'd expect them >>>> rattling >>>> > a string of >>>> > >>>>>>>> digits. Is there any support in kaldi for this? >>>> > >>>>>>>> >>>> > >>>>>>>> Thanks, >>>> > >>>>>>>> >>>> > >>>>>>>> -kkm >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > ------------------------------------------------------------------ >>>> > >>>>> - >>>> > >>>>>>>> ----------- One dashboard for servers and >>>> > applications across >>>> > >>>>>>>> Physical-Virtual-Cloud Widest out-of-the-box >>>> > monitoring support >>>> > >>>>>>>> with 50+ applications Performance metrics, >>>> stats >>>> > and reports that >>>> > >>>>>>>> give you Actionable Insights Deep dive >>>> visibility >>>> > with transaction >>>> > >>>>>>>> tracing using APM Insight. >>>> > >>>>>>>> >>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>> > >>>>>>>> _______________________________________________ >>>> > >>>>>>>> Kaldi-users mailing list >>>> > >>>>>>>> Kal...@li... >>>> > >>>>>>>> >>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>>> >>>> > -------------------------------------------------------------------- >>>> > >>>>> - >>>> > >>>>>>> --------- One dashboard for servers and >>>> > applications across >>>> > >>>>>>> Physical-Virtual-Cloud Widest out-of-the-box >>>> > monitoring support with >>>> > >>>>>>> 50+ applications Performance metrics, stats and >>>> > reports that give >>>> > >>>>> you >>>> > >>>>>>> Actionable Insights Deep dive visibility with >>>> > transaction tracing >>>> > >>>>>>> using APM Insight. >>>> > >>>>>>> >>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>> > >>>>>>> _______________________________________________ >>>> > >>>>>>> Kaldi-users mailing list >>>> > >>>>>>> Kal...@li... >>>> > >>>>>>> >>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> > >>>> >>>> > >>>> ------------------------------ >>>> > >>>> >>>> > >>>> Message: 3 >>>> > >>>> Date: Thu, 21 May 2015 15:29:54 -0400 >>>> > >>>> From: Hainan Xu <hai...@gm...> >>>> > >>>> Subject: Re: [Kaldi-users] LM grafting >>>> > >>>> To: Daniel Povey <dp...@gm...> >>>> > >>>> Cc: Sean True <se...@se...>, >>>> > >>>> "kal...@li..." >>>> > >>>> <kal...@li...>, Kirill >>>> > Katsnelson >>>> > >>>> <kir...@sm...> >>>> > >>>> Message-ID: >>>> > >>>> >>>> > <CALP+BDZvJP-2cZ+fEJEXaMaVWzgy63mtc=J1E...@ma...> >>>> > >>>> Content-Type: text/plain; charset="utf-8" >>>> > >>>> >>>> > >>>> There is a paper in ICASSP 2015 that described some >>>> > very similar idea: >>>> > >>>> >>>> > >>>> Improved recognition of contact names in voice >>>> > commands >>>> > >>>> >>>> > >>>>> On Thu, May 21, 2015 at 3:04 PM, Daniel Povey >>>> > <dp...@gm...> wrote: >>>> > >>>>> >>>> > >>>>> The general approach is to create an FST for the >>>> > little language >>>> > >>>>> model, and then to use fstreplace to replace >>>> > instances of a particular >>>> > >>>>> symbol in the top-level language model, with that >>>> > FST. >>>> > >>>>> The tricky part is ensuring that the result is >>>> > determinizable after >>>> > >>>>> composing with the lexicon. In general our >>>> solution >>>> > is to add special >>>> > >>>>> disambiguation symbols at the beginning and end of >>>> > each of the >>>> > >>>>> sub-FSTs, and of course making sure that the >>>> sub-FSTs >>>> > are themselves >>>> > >>>>> determinizable. >>>> > >>>>> Dan >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> On Thu, May 21, 2015 at 3:01 PM, Sean True >>>> > <se...@se...> >>>> > >>>>> wrote: >>>> > >>>>>> That's a subject of some general interest. Is >>>> there >>>> > a discussion of >>>> > >>>>>> the >>>> > >>>>>> general approach that was taken somewhere? >>>> > >>>>>> >>>> > >>>>>> -- Sean >>>> > >>>>>> >>>> > >>>>>> Sean True >>>> > >>>>>> Semantic Machines >>>> > >>>>>> >>>> > >>>>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >>>> > <dp...@gm...> >>>> > >>>>>>> wrote: >>>> > >>>>>>> >>>> > >>>>>>> Nagendra Goel has worked on some example scripts >>>> > for this type of >>>> > >>>>>>> thing, and with Hainan we were working on >>>> trying to >>>> > get it cleaned up >>>> > >>>>>>> and checked in, but he's going for an >>>> internship so >>>> > it will have to >>>> > >>>>>>> wait. But Nagendra might be willing to share it >>>> > with you. >>>> > >>>>>>> Dan >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill >>>> Katsnelson >>>> > >>>>>>> <kir...@sm...> wrote: >>>> > >>>>>>>> Suppose I have a language model where one >>>> token (a >>>> > "word") is a >>>> > >>>>> pointer >>>> > >>>>>>>> to a whole another LM. This is a practical case >>>> > when you expect an >>>> > >>>>> abrupt >>>> > >>>>>>>> change in model, a clear example being "my >>>> phone >>>> > number is..." and >>>> > >>>>> then >>>> > >>>>>>>> you'd expect them rattling a string of digits. >>>> > Is there any support >>>> > >>>>> in kaldi >>>> > >>>>>>>> for this? >>>> > >>>>>>>> >>>> > >>>>>>>> Thanks, >>>> > >>>>>>>> >>>> > >>>>>>>> -kkm >>>> > >>>>> >>>> > >>>>> >>>> > >>>> ----------------------------------------------------------------------- >>>> > - >>>> > ------ >>>> > >>>>>>>> One dashboard for servers and applications >>>> across >>>> > >>>>> Physical-Virtual-Cloud >>>> > >>>>>>>> Widest out-of-the-box monitoring support with >>>> > 50+ applications >>>> > >>>>>>>> Performance metrics, stats and reports that >>>> give >>>> > you Actionable >>>> > >>>>> Insights >>>> > >>>>>>>> Deep dive visibility with transaction tracing >>>> > using APM Insight. >>>> > >>>>>>>> >>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>> > >>>>>>>> _______________________________________________ >>>> > >>>>>>>> Kaldi-users mailing list >>>> > >>>>>>>> Kal...@li... >>>> > >>>>>>>> >>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> > >>>>> >>>> > >>>>> >>>> > >>>> ----------------------------------------------------------------------- >>>> > - >>>> > ------ >>>> > >>>>>>> One dashboard for servers and applications >>>> across >>>> > >>>>>>> Physical-Virtual-Cloud >>>> > >>>>>>> Widest out-of-the-box monitoring support with >>>> > 50+ applications >>>> > >>>>>>> Performance metrics, stats and reports that give >>>> > you Actionable >>>> > >>>>>>> Insights >>>> > >>>>>>> Deep dive visibility with transaction tracing >>>> using >>>> > APM Insight. >>>> > >>>>>>> >>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>> > >>>>>>> _______________________________________________ >>>> > >>>>>>> Kaldi-users mailing list >>>> > >>>>>>> Kal...@li... >>>> > >>>>>>> >>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> > >>>> >>>> > >>>> >>>> > >>>> >>>> > >>>> -- >>>> > >>>> - Hainan >>>> > >>>> -------------- next part -------------- >>>> > >>>> An HTML attachment was scrubbed... >>>> > >>>> >>>> > >>>> ------------------------------ >>>> > >>>> >>>> > >>>> Message: 4 >>>> > >>>> Date: Thu, 21 May 2015 15:01:51 -0400 >>>> > >>>> From: Sean True <se...@se...> >>>> > >>>> Subject: Re: [Kaldi-users] LM grafting >>>> > >>>> To: Daniel Povey <dp...@gm...> >>>> > >>>> Cc: Hainan Xu <hai...@gm...>, >>>> > >>>> "kal...@li..." >>>> > >>>> <kal...@li...>, Kirill >>>> > Katsnelson >>>> > >>>> <kir...@sm...> >>>> > >>>> Message-ID: >>>> > >>>> >>>> > <CALtEaHntdAcmO_Ji5dxsPnT8i9M_LVuGnY0UjkJUPp=pY...@ma...> >>>> > >>>> Content-Type: text/plain; charset="utf-8" >>>> > >>>> >>>> > >>>> That's a subject of some general interest. Is >>>> there a >>>> > discussion of the >>>> > >>>> general approach that was taken somewhere? >>>> > >>>> >>>> > >>>> -- Sean >>>> > >>>> >>>> > >>>> Sean True >>>> > >>>> Semantic Machines >>>> > >>>> >>>> > >>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >>>> > <dp...@gm...> wrote: >>>> > >>>>> >>>> > >>>>> Nagendra Goel has worked on some example scripts >>>> for >>>> > this type of >>>> > >>>>> thing, and with Hainan we were working on trying >>>> to >>>> > get it cleaned up >>>> > >>>>> and checked in, but he's going for an internship >>>> so >>>> > it will have to >>>> > >>>>> wait. But Nagendra might be willing to share it >>>> with >>>> > you. >>>> > >>>>> Dan >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill Katsnelson >>>> > >>>>> <kir...@sm...> wrote: >>>> > >>>>>> Suppose I have a language model where one token >>>> (a >>>> > "word") is a >>>> > >>>>>> pointer >>>> > >>>>> to a whole another LM. This is a practical case >>>> when >>>> > you expect an >>>> > >>>>> abrupt >>>> > >>>>> change in model, a clear example being "my phone >>>> > number is..." and then >>>> > >>>>> you'd expect them rattling a string of digits. Is >>>> > there any support in >>>> > >>>>> kaldi for this? >>>> > >>>>>> >>>> > >>>>>> Thanks, >>>> > >>>>>> >>>> > >>>>>> -kkm >>>> > >>>>> >>>> > >>>>> >>>> > >>>> ----------------------------------------------------------------------- >>>> > - >>>> > ------ >>>> > >>>>>> One dashboard for servers and applications across >>>> > >>>>>> Physical-Virtual-Cloud >>>> > >>>>>> Widest out-of-the-box monitoring support with 50+ >>>> > applications >>>> > >>>>>> Performance metrics, stats and reports that give >>>> you >>>> > Actionable >>>> > >>>>>> Insights >>>> > >>>>>> Deep dive visibility with transaction tracing >>>> using >>>> > APM Insight. >>>> > >>>>>> >>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>> > >>>>>> _______________________________________________ >>>> > >>>>>> Kaldi-users mailing list >>>> > >>>>>> Kal...@li... >>>> > >>>>>> >>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> >>>> > >>>> ----------------------------------------------------------------------- >>>> > - >>>> > ------ >>>> > >>>>> One dashboard for servers and applications across >>>> > >>>>> Physical-Virtual-Cloud >>>> > >>>>> Widest out-of-the-box monitoring support with 50+ >>>> > applications >>>> > >>>>> Performance metrics, stats and reports that give >>>> you >>>> > Actionable >>>> > >>>>> Insights >>>> > >>>>> Deep dive visibility with transaction tracing >>>> using >>>> > APM Insight. >>>> > >>>>> >>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>> > >>>>> _______________________________________________ >>>> > >>>>> Kaldi-users mailing list >>>> > >>>>> Kal...@li... >>>> > >>>>> >>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> > >>>> -------------- next part -------------- >>>> > >>>> An HTML attachment was scrubbed... >>>> > >>>> >>>> > >>>> ------------------------------ >>>> > >>>> >>>> > >>>> >>>> > >>>> >>>> > >>>> ----------------------------------------------------------------------- >>>> > - >>>> > ------ >>>> > >>>> One dashboard for servers and applications across >>>> > Physical-Virtual-Cloud >>>> > >>>> Widest out-of-the-box monitoring support with 50+ >>>> > applications >>>> > >>>> Performance metrics, stats and reports that give >>>> you >>>> > Actionable Insights >>>> > >>>> Deep dive visibility with transaction tracing using >>>> > APM Insight. >>>> > >>>> >>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>> > >>>> >>>> > >>>> ------------------------------ >>>> > >>>> >>>> > >>>> _______________________________________________ >>>> > >>>> Kaldi-users mailing list >>>> > >>>> Kal...@li... >>>> > >>>> >>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> > >>>> >>>> > >>>> >>>> > >>>> End of Kaldi-users Digest, Vol 29, Issue 15 >>>> > >>>> ******************************************* >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>>> ----------------------------------------------------------------------- >>>> > - >>>> > ------ >>>> > >>> One dashboard for servers and applications across >>>> > Physical-Virtual-Cloud >>>> > >>> Widest out-of-the-box monitoring support with 50+ >>>> > applications >>>> > >>> Performance metrics, stats and reports that give you >>>> > Actionable Insights >>>> > >>> Deep dive visibility with transaction tracing using >>>> APM >>>> > Insight. >>>> > >>> >>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>> > >>> _______________________________________________ >>>> > >>> Kaldi-users mailing list >>>> > >>> Kal...@li... >>>> > >>> >>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> > >> >>>> > >> >>>> > >> >>>> > >>>> ----------------------------------------------------------------------- >>>> > - >>>> > ------ >>>> > >> One dashboard for servers and applications across >>>> > Physical-Virtual-Cloud >>>> > >> Widest out-of-the-box monitoring support with 50+ >>>> > applications >>>> > >> Performance metrics, stats and reports that give you >>>> > Actionable Insights >>>> > >> Deep dive visibility with transaction tracing using >>>> APM >>>> > Insight. >>>> > >> >>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>> > >> _______________________________________________ >>>> > >> Kaldi-users mailing list >>>> > >> Kal...@li... >>>> > >> >>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> > >> >>>> > >>>> > >>>> > >>>> ----------------------------------------------------------------------- >>>> > - >>>> > ------ >>>> > One dashboard for servers and applications across >>>> Physical- >>>> > Virtual-Cloud >>>> > Widest out-of-the-box monitoring support with 50+ >>>> > applications >>>> > Performance metrics, stats and reports that give you >>>> > Actionable Insights >>>> > Deep dive visibility with transaction tracing using APM >>>> > Insight. >>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>> > _______________________________________________ >>>> > Kaldi-users mailing list >>>> > Kal...@li... >>>> > >>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> ----------------------------------------------------------------------- >>>> > - >>>> > ------ >>>> > One dashboard for servers and applications across Physical- >>>> > Virtual-Cloud >>>> > Widest out-of-the-box monitoring support with 50+ applications >>>> > Performance metrics, stats and reports that give you Actionable >>>> > Insights >>>> > Deep dive visibility with transaction tracing using APM Insight. >>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>> > _______________________________________________ >>>> > Kaldi-users mailing list >>>> > Kal...@li... >>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> > >>>> > >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> One dashboard for servers and applications across Physical-Virtual-Cloud >>>> Widest out-of-the-box monitoring support with 50+ applications >>>> Performance metrics, stats and reports that give you Actionable Insights >>>> Deep dive visibility with transaction tracing using APM Insight. >>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>> _______________________________________________ >>>> Kaldi-users mailing list >>>> Kal...@li... >>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> >>> >>> >>> >>> -- >>> Ondřej Plátek, +420 737 758 650, skype:ondrejplatek, >>> ond...@gm... >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Kaldi-users mailing list >>> Kal...@li... >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >>> >> > > > -- > Ondřej Plátek, +420 737 758 650, skype:ondrejplatek, > ond...@gm... > |
From: Ondrej P. <ond...@gm...> - 2015-06-17 20:25:36
|
We currently use the script below for creating arpa LM based on CB LM and mixed from out of domain data and indomain data LM which are not classed based. Given the arpa file we convert it https://github.com/UFAL-DSG/alex/blob/master/alex/applications/PublicTransportInfoCS/lm/build.py Note, that this CB-MODEL estimation has several drawbacks. The biggest one that we do not compute bigrams (or higher ngrams) estimates if there two classes in one bigram e.g. I want connection CITY CITY. I am working on improving this as a side project. Another important problem is that we need to expand the LM with instances of the classes which significantly increase the size of the lexicon, and also the higher order ngrams in the LM. I was not sure if you do want to do it this way in Kaldi or if you want to do it on FST level. PS: I attached the Czech class file classes.txt.zip <https://drive.google.com/file/d/0B_cd-iN3UhaVOFpIWGlic0F5cUU/edit?usp=drive_web> On Wed, Jun 17, 2015 at 10:11 PM, Sandeep Reddy <san...@go...> wrote: > Does the kaldi recipe do Class LM? Or can you add it to recipe? That would > make the whole process so much easier. I don't mind if the words are Czetch. > > On Wed, Jun 17, 2015 at 4:08 PM, Ondrej Platek <ond...@gm...> > wrote: > >> For the Czech data we are running the system live with Kaldi and we use >> class LM. >> For the English data I will give you few examples from top of my head: >> >> PRICE_RANGE - cheap, middle price-range... >> FOOD_TYPE - Indian, Chinese, >> LOCATION - city center, Chesterton area, .. >> .... >> >> We will try to find the classes definition, since we are not running the >> system. >> >> Ondrej >> >> On Wed, Jun 17, 2015 at 10:01 PM, Sandeep Reddy < >> san...@go...> wrote: >> >>> Ondrej, >>> I'll run the Vystadial recipe and see what opportunities are there. >>> Did somebody already make a class LM on it or at least define what >>> potential classes are? I hadn't looked into it earlier. >>> Thanks >>> Nagendra >>> >>> On Wed, Jun 17, 2015 at 3:42 AM, Ondrej Platek <ond...@gm...> >>> wrote: >>> >>>> Dear all, >>>> >>>> thanks to reminder of Dimitris, I realized that the Vystadial dataset >>>> is very convenient for Class based LM/ LM grafting. >>>> As the scripts for Vystadial Cs & En are already in Kaldi it may be >>>> convenient starting data because >>>> they contain transcription of user utterances from communication with >>>> spoken dialogue system where we have the classes defined. >>>> >>>> See scritps: >>>> https://github.com/kaldi-asr/kaldi/tree/master/egs/vystadial_en >>>> https://github.com/kaldi-asr/kaldi/tree/master/egs/vystadial_cz >>>> >>>> See data (scroll to the bottom to download the datasets): >>>> http://hdl.handle.net/11858/00-097C-0000-0023-4671-4 (en) >>>> http://hdl.handle.net/11858/00-097C-0000-0023-4670-6 (cs) >>>> >>>> >>>> We can probably recreate / find the list of words in the classes for >>>> English if there is interest. >>>> For Czech this should be no problem at all. >>>> >>>> Please, let me know if you are interested in these datasets and the >>>> lists of classes and their members. >>>> >>>> Ondra >>>> >>>> PS: Currently, we used classed based (CB) LM which we later expand to >>>> full LM in arpa format than create G.fst as in standard use case. >>>> It is not optimal attitude but it works for us. >>>> If you want to know how we are modeling the CBLM just let me know, I >>>> am working on slight improvement of it right now, >>>> so I am interested in improving it. >>>> >>>> >>>> On Tue, May 26, 2015 at 8:11 PM, Kirill Katsnelson < >>>> kir...@sm...> wrote: >>>> >>>>> Speaking about data set preprocessing only, will Stanford NLP POS >>>>> tagger pull the trick? >>>>> >>>>> -kkm >>>>> >>>>> > -----Original Message----- >>>>> > From: Nagendra Goel [mailto:nag...@go...] >>>>> > Sent: 2015-05-24 1511 >>>>> > To: Matthew Aylett >>>>> > Cc: Dimitris Vassos; kal...@li... >>>>> > Subject: Re: [Kaldi-users] LM grafting >>>>> > >>>>> > A systematic way for identifying special elements in text will be >>>>> very >>>>> > useful. Currently NSW-EXPAND from festival conflicts with this sub- >>>>> > grammar approach although otherwise it's a good lm pre-processing >>>>> step. >>>>> > >>>>> > Nagendra Kumar Goel >>>>> > >>>>> > On May 24, 2015 4:45 PM, "Matthew Aylett" <mat...@gm...> >>>>> > wrote: >>>>> > >>>>> > >>>>> > Not sure if this is relevant to this thread. But in the speech >>>>> > synthesis system branch we have a very early text normaliser which >>>>> > (when >>>>> > complete) will detect things like phone numbers addresses, currencies >>>>> > etc. The output form this could then be used to inform language model >>>>> > building. Currently it deals with symbols and tokenisations in >>>>> English. >>>>> > >>>>> > Potentially `(although I wasn't currently planning on this), >>>>> the >>>>> > text normaliser could be written in thrax - based on openfst - >>>>> authored >>>>> > by Richard Sproat I believe). However if this approach would benefit >>>>> > ASR as well then it might be worth doing it this way rather than my >>>>> > plan of a simple greedy normaliser. >>>>> > >>>>> > >>>>> > v best >>>>> > >>>>> > Matthew Aylett >>>>> > >>>>> > >>>>> > On Sun, May 24, 2015 at 8:34 AM, Dimitris Vassos >>>>> > <dva...@gm...> wrote: >>>>> > >>>>> > >>>>> > We have access to several corpora and we are trying to >>>>> put >>>>> > together something appropriate. >>>>> > >>>>> > In the next couple of days, we will also volunteer a >>>>> server >>>>> > to set it all up and run the tests. >>>>> > >>>>> > Dimitris >>>>> > >>>>> > > On 24 Μαΐ 2015, at 02:06, Daniel Povey < >>>>> dp...@gm...> >>>>> > wrote: >>>>> > > >>>>> > > One possibility is to use a completely open-source >>>>> setup, >>>>> > e.g. >>>>> > > Voxforge, and forget about the "has a clear >>>>> advantage" >>>>> > requirement. >>>>> > > E.g. target anything that looks like a year, and >>>>> make a >>>>> > grammar for >>>>> > > years. >>>>> > > Dan >>>>> > > >>>>> > > >>>>> > > On Fri, May 22, 2015 at 6:32 AM, Nagendra Goel >>>>> > > <nag...@go...> wrote: >>>>> > >> Since I cannot volunteer my enviornment, do you >>>>> > recommend another >>>>> > >> enviornment where this can be prototyped and where >>>>> you >>>>> > can check in some >>>>> > >> class lm recipe that has advantage. >>>>> > >> >>>>> > >> Nagendra >>>>> > >> >>>>> > >> Nagendra Kumar Goel >>>>> > >> >>>>> > >>> On May 21, 2015 11:01 PM, "Dimitris Vassos" >>>>> > <dva...@gm...> wrote: >>>>> > >>> >>>>> > >>> +1 for the class-based LMs. I have also been >>>>> interested >>>>> > in this >>>>> > >>> functionality for some time now, so will be more >>>>> than >>>>> > happy to try out the >>>>> > >>> current implementation, if possible. >>>>> > >>> >>>>> > >>> Thanks >>>>> > >>> Dimitris >>>>> > >>> >>>>> > >>>> On 22 Μαΐ 2015, at 01:34, >>>>> > kal...@li... >>>>> > >>>> wrote: >>>>> > >>>> >>>>> > >>>> Send Kaldi-users mailing list submissions to >>>>> > >>>> kal...@li... >>>>> > >>>> >>>>> > >>>> To subscribe or unsubscribe via the World Wide >>>>> Web, >>>>> > visit >>>>> > >>>> >>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> > >>>> or, via email, send a message with subject or body >>>>> > 'help' to >>>>> > >>>> kal...@li... >>>>> > >>>> >>>>> > >>>> You can reach the person managing the list at >>>>> > >>>> kal...@li... >>>>> > >>>> >>>>> > >>>> When replying, please edit your Subject line so >>>>> it is >>>>> > more specific >>>>> > >>>> than "Re: Contents of Kaldi-users digest..." >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> Today's Topics: >>>>> > >>>> >>>>> > >>>> 1. Re: LM grafting (Daniel Povey) >>>>> > >>>> 2. Re: LM grafting (Kirill Katsnelson) >>>>> > >>>> 3. Re: LM grafting (Hainan Xu) >>>>> > >>>> 4. Re: LM grafting (Sean True) >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>>>> ---------------------------------------------------------------------- >>>>> > >>>> >>>>> > >>>> Message: 1 >>>>> > >>>> Date: Thu, 21 May 2015 15:04:04 -0400 >>>>> > >>>> From: Daniel Povey <dp...@gm...> >>>>> > >>>> Subject: Re: [Kaldi-users] LM grafting >>>>> > >>>> To: Sean True <se...@se...> >>>>> > >>>> Cc: Hainan Xu <hai...@gm...>, >>>>> > >>>> "kal...@li..." >>>>> > >>>> <kal...@li...>, Kirill >>>>> > Katsnelson >>>>> > >>>> <kir...@sm...> >>>>> > >>>> Message-ID: >>>>> > >>>> >>>>> > <CAEWAuySHaXwdNJZAoL6CanzHth=k4Y...@ma... >>>>> > <mailto:k4YJVsBiAfEuFDFMvY%2B...@ma...> > >>>>> > >>>> Content-Type: text/plain; charset=UTF-8 >>>>> > >>>> >>>>> > >>>> The general approach is to create an FST for the >>>>> > little language >>>>> > >>>> model, and then to use fstreplace to replace >>>>> instances >>>>> > of a particular >>>>> > >>>> symbol in the top-level language model, with that >>>>> FST. >>>>> > >>>> The tricky part is ensuring that the result is >>>>> > determinizable after >>>>> > >>>> composing with the lexicon. In general our >>>>> solution >>>>> > is to add special >>>>> > >>>> disambiguation symbols at the beginning and end of >>>>> > each of the >>>>> > >>>> sub-FSTs, and of course making sure that the >>>>> sub-FSTs >>>>> > are themselves >>>>> > >>>> determinizable. >>>>> > >>>> Dan >>>>> > >>>> >>>>> > >>>> >>>>> > >>>>> On Thu, May 21, 2015 at 3:01 PM, Sean True >>>>> > <se...@se...> >>>>> > >>>>> wrote: >>>>> > >>>>> That's a subject of some general interest. Is >>>>> there a >>>>> > discussion of the >>>>> > >>>>> general approach that was taken somewhere? >>>>> > >>>>> >>>>> > >>>>> -- Sean >>>>> > >>>>> >>>>> > >>>>> Sean True >>>>> > >>>>> Semantic Machines >>>>> > >>>>> >>>>> > >>>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >>>>> > <dp...@gm...> >>>>> > >>>>>> wrote: >>>>> > >>>>>> >>>>> > >>>>>> Nagendra Goel has worked on some example >>>>> scripts for >>>>> > this type of >>>>> > >>>>>> thing, and with Hainan we were working on >>>>> trying to >>>>> > get it cleaned up >>>>> > >>>>>> and checked in, but he's going for an >>>>> internship so >>>>> > it will have to >>>>> > >>>>>> wait. But Nagendra might be willing to share it >>>>> > with you. >>>>> > >>>>>> Dan >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill >>>>> Katsnelson >>>>> > >>>>>> <kir...@sm...> wrote: >>>>> > >>>>>>> Suppose I have a language model where one >>>>> token (a >>>>> > "word") is a >>>>> > >>>>>>> pointer >>>>> > >>>>>>> to a whole another LM. This is a practical case >>>>> > when you expect an >>>>> > >>>>>>> abrupt >>>>> > >>>>>>> change in model, a clear example being "my >>>>> phone >>>>> > number is..." and >>>>> > >>>>>>> then >>>>> > >>>>>>> you'd expect them rattling a string of digits. >>>>> > Is there any support >>>>> > >>>>>>> in kaldi >>>>> > >>>>>>> for this? >>>>> > >>>>>>> >>>>> > >>>>>>> Thanks, >>>>> > >>>>>>> >>>>> > >>>>>>> -kkm >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>> ----------------------------------------------------------------------- >>>>> > - >>>>> > ------ >>>>> > >>>>>>> One dashboard for servers and applications >>>>> across >>>>> > >>>>>>> Physical-Virtual-Cloud >>>>> > >>>>>>> Widest out-of-the-box monitoring support with >>>>> > 50+ applications >>>>> > >>>>>>> Performance metrics, stats and reports that >>>>> give >>>>> > you Actionable >>>>> > >>>>>>> Insights >>>>> > >>>>>>> Deep dive visibility with transaction tracing >>>>> using >>>>> > APM Insight. >>>>> > >>>>>>> >>>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>> > >>>>>>> _______________________________________________ >>>>> > >>>>>>> Kaldi-users mailing list >>>>> > >>>>>>> Kal...@li... >>>>> > >>>>>>> >>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>> ----------------------------------------------------------------------- >>>>> > - >>>>> > ------ >>>>> > >>>>>> One dashboard for servers and applications >>>>> across >>>>> > >>>>>> Physical-Virtual-Cloud >>>>> > >>>>>> Widest out-of-the-box monitoring support with >>>>> 50+ >>>>> > applications >>>>> > >>>>>> Performance metrics, stats and reports that >>>>> give you >>>>> > Actionable >>>>> > >>>>>> Insights >>>>> > >>>>>> Deep dive visibility with transaction tracing >>>>> using >>>>> > APM Insight. >>>>> > >>>>>> >>>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>> > >>>>>> _______________________________________________ >>>>> > >>>>>> Kaldi-users mailing list >>>>> > >>>>>> Kal...@li... >>>>> > >>>>>> >>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> ------------------------------ >>>>> > >>>> >>>>> > >>>> Message: 2 >>>>> > >>>> Date: Thu, 21 May 2015 19:24:38 +0000 >>>>> > >>>> From: Kirill Katsnelson >>>>> > <kir...@sm...> >>>>> > >>>> Subject: Re: [Kaldi-users] LM grafting >>>>> > >>>> To: "dp...@gm..." <dp...@gm...>, Sean >>>>> True >>>>> > >>>> <se...@se...> >>>>> > >>>> Cc: Hainan Xu <hai...@gm...>, >>>>> > >>>> "kal...@li..." >>>>> > >>>> <kal...@li...> >>>>> > >>>> Message-ID: >>>>> > >>>> >>>>> > >>>> >>>>> > >>>>> <CY1...@CY...d.out >>>>> > l >>>>> > ook.com> >>>>> > >>>> >>>>> > >>>> Content-Type: text/plain; charset="utf-8" >>>>> > >>>> >>>>> > >>>> Also, from the practical standpoint, >>>>> > backoff/discounting weights usually >>>>> > >>>> need to be massaged. Otherwise when the grafted >>>>> LM is >>>>> > small and the main LM >>>>> > >>>> is large, the little model will tend to shoehorn >>>>> an >>>>> > utterance into itself >>>>> > >>>> rather than let go of it. In my phone number >>>>> example, >>>>> > everything becomes >>>>> > >>>> digits once the phone number starts. >>>>> > >>>> >>>>> > >>>> -kkm >>>>> > >>>> >>>>> > >>>>> -----Original Message----- >>>>> > >>>>> From: Daniel Povey [mailto:dp...@gm...] >>>>> > >>>>> Sent: 2015-05-21 1204 >>>>> > >>>>> To: Sean True >>>>> > >>>>> Cc: Kirill Katsnelson; Nagendra Goel; Hainan Xu; >>>>> > kaldi- >>>>> > >>>>> us...@li... >>>>> > >>>>> Subject: Re: [Kaldi-users] LM grafting >>>>> > >>>>> >>>>> > >>>>> The general approach is to create an FST for the >>>>> > little language model, >>>>> > >>>>> and then to use fstreplace to replace instances >>>>> of a >>>>> > particular symbol >>>>> > >>>>> in the top-level language model, with that FST. >>>>> > >>>>> The tricky part is ensuring that the result is >>>>> > determinizable after >>>>> > >>>>> composing with the lexicon. In general our >>>>> solution >>>>> > is to add special >>>>> > >>>>> disambiguation symbols at the beginning and end >>>>> of >>>>> > each of the sub- >>>>> > >>>>> FSTs, and of course making sure that the >>>>> sub-FSTs are >>>>> > themselves >>>>> > >>>>> determinizable. >>>>> > >>>>> Dan >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> On Thu, May 21, 2015 at 3:01 PM, Sean True >>>>> > <se...@se...> >>>>> > >>>>> wrote: >>>>> > >>>>>> That's a subject of some general interest. Is >>>>> there >>>>> > a discussion of >>>>> > >>>>>> the general approach that was taken somewhere? >>>>> > >>>>>> >>>>> > >>>>>> -- Sean >>>>> > >>>>>> >>>>> > >>>>>> Sean True >>>>> > >>>>>> Semantic Machines >>>>> > >>>>>> >>>>> > >>>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >>>>> > <dp...@gm...> >>>>> > >>>>> wrote: >>>>> > >>>>>>> >>>>> > >>>>>>> Nagendra Goel has worked on some example >>>>> scripts >>>>> > for this type of >>>>> > >>>>>>> thing, and with Hainan we were working on >>>>> trying to >>>>> > get it cleaned >>>>> > >>>>> up >>>>> > >>>>>>> and checked in, but he's going for an >>>>> internship so >>>>> > it will have to >>>>> > >>>>>>> wait. But Nagendra might be willing to share >>>>> it >>>>> > with you. >>>>> > >>>>>>> Dan >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill >>>>> Katsnelson >>>>> > >>>>>>> <kir...@sm...> wrote: >>>>> > >>>>>>>> Suppose I have a language model where one >>>>> token (a >>>>> > "word") is a >>>>> > >>>>>>>> pointer to a whole another LM. This is a >>>>> practical >>>>> > case when you >>>>> > >>>>>>>> expect an abrupt change in model, a clear >>>>> example >>>>> > being "my phone >>>>> > >>>>>>>> number is..." and then you'd expect them >>>>> rattling >>>>> > a string of >>>>> > >>>>>>>> digits. Is there any support in kaldi for >>>>> this? >>>>> > >>>>>>>> >>>>> > >>>>>>>> Thanks, >>>>> > >>>>>>>> >>>>> > >>>>>>>> -kkm >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > ------------------------------------------------------------------ >>>>> > >>>>> - >>>>> > >>>>>>>> ----------- One dashboard for servers and >>>>> > applications across >>>>> > >>>>>>>> Physical-Virtual-Cloud Widest out-of-the-box >>>>> > monitoring support >>>>> > >>>>>>>> with 50+ applications Performance metrics, >>>>> stats >>>>> > and reports that >>>>> > >>>>>>>> give you Actionable Insights Deep dive >>>>> visibility >>>>> > with transaction >>>>> > >>>>>>>> tracing using APM Insight. >>>>> > >>>>>>>> >>>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>> > >>>>>>>> >>>>> _______________________________________________ >>>>> > >>>>>>>> Kaldi-users mailing list >>>>> > >>>>>>>> Kal...@li... >>>>> > >>>>>>>> >>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > -------------------------------------------------------------------- >>>>> > >>>>> - >>>>> > >>>>>>> --------- One dashboard for servers and >>>>> > applications across >>>>> > >>>>>>> Physical-Virtual-Cloud Widest out-of-the-box >>>>> > monitoring support with >>>>> > >>>>>>> 50+ applications Performance metrics, stats and >>>>> > reports that give >>>>> > >>>>> you >>>>> > >>>>>>> Actionable Insights Deep dive visibility with >>>>> > transaction tracing >>>>> > >>>>>>> using APM Insight. >>>>> > >>>>>>> >>>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>> > >>>>>>> _______________________________________________ >>>>> > >>>>>>> Kaldi-users mailing list >>>>> > >>>>>>> Kal...@li... >>>>> > >>>>>>> >>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> > >>>> >>>>> > >>>> ------------------------------ >>>>> > >>>> >>>>> > >>>> Message: 3 >>>>> > >>>> Date: Thu, 21 May 2015 15:29:54 -0400 >>>>> > >>>> From: Hainan Xu <hai...@gm...> >>>>> > >>>> Subject: Re: [Kaldi-users] LM grafting >>>>> > >>>> To: Daniel Povey <dp...@gm...> >>>>> > >>>> Cc: Sean True <se...@se...>, >>>>> > >>>> "kal...@li..." >>>>> > >>>> <kal...@li...>, Kirill >>>>> > Katsnelson >>>>> > >>>> <kir...@sm...> >>>>> > >>>> Message-ID: >>>>> > >>>> >>>>> > <CALP+BDZvJP-2cZ+fEJEXaMaVWzgy63mtc=J1E...@ma...> >>>>> > >>>> Content-Type: text/plain; charset="utf-8" >>>>> > >>>> >>>>> > >>>> There is a paper in ICASSP 2015 that described >>>>> some >>>>> > very similar idea: >>>>> > >>>> >>>>> > >>>> Improved recognition of contact names in voice >>>>> > commands >>>>> > >>>> >>>>> > >>>>> On Thu, May 21, 2015 at 3:04 PM, Daniel Povey >>>>> > <dp...@gm...> wrote: >>>>> > >>>>> >>>>> > >>>>> The general approach is to create an FST for the >>>>> > little language >>>>> > >>>>> model, and then to use fstreplace to replace >>>>> > instances of a particular >>>>> > >>>>> symbol in the top-level language model, with that >>>>> > FST. >>>>> > >>>>> The tricky part is ensuring that the result is >>>>> > determinizable after >>>>> > >>>>> composing with the lexicon. In general our >>>>> solution >>>>> > is to add special >>>>> > >>>>> disambiguation symbols at the beginning and end >>>>> of >>>>> > each of the >>>>> > >>>>> sub-FSTs, and of course making sure that the >>>>> sub-FSTs >>>>> > are themselves >>>>> > >>>>> determinizable. >>>>> > >>>>> Dan >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> On Thu, May 21, 2015 at 3:01 PM, Sean True >>>>> > <se...@se...> >>>>> > >>>>> wrote: >>>>> > >>>>>> That's a subject of some general interest. Is >>>>> there >>>>> > a discussion of >>>>> > >>>>>> the >>>>> > >>>>>> general approach that was taken somewhere? >>>>> > >>>>>> >>>>> > >>>>>> -- Sean >>>>> > >>>>>> >>>>> > >>>>>> Sean True >>>>> > >>>>>> Semantic Machines >>>>> > >>>>>> >>>>> > >>>>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >>>>> > <dp...@gm...> >>>>> > >>>>>>> wrote: >>>>> > >>>>>>> >>>>> > >>>>>>> Nagendra Goel has worked on some example >>>>> scripts >>>>> > for this type of >>>>> > >>>>>>> thing, and with Hainan we were working on >>>>> trying to >>>>> > get it cleaned up >>>>> > >>>>>>> and checked in, but he's going for an >>>>> internship so >>>>> > it will have to >>>>> > >>>>>>> wait. But Nagendra might be willing to share >>>>> it >>>>> > with you. >>>>> > >>>>>>> Dan >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill >>>>> Katsnelson >>>>> > >>>>>>> <kir...@sm...> wrote: >>>>> > >>>>>>>> Suppose I have a language model where one >>>>> token (a >>>>> > "word") is a >>>>> > >>>>> pointer >>>>> > >>>>>>>> to a whole another LM. This is a practical >>>>> case >>>>> > when you expect an >>>>> > >>>>> abrupt >>>>> > >>>>>>>> change in model, a clear example being "my >>>>> phone >>>>> > number is..." and >>>>> > >>>>> then >>>>> > >>>>>>>> you'd expect them rattling a string of digits. >>>>> > Is there any support >>>>> > >>>>> in kaldi >>>>> > >>>>>>>> for this? >>>>> > >>>>>>>> >>>>> > >>>>>>>> Thanks, >>>>> > >>>>>>>> >>>>> > >>>>>>>> -kkm >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> ----------------------------------------------------------------------- >>>>> > - >>>>> > ------ >>>>> > >>>>>>>> One dashboard for servers and applications >>>>> across >>>>> > >>>>> Physical-Virtual-Cloud >>>>> > >>>>>>>> Widest out-of-the-box monitoring support with >>>>> > 50+ applications >>>>> > >>>>>>>> Performance metrics, stats and reports that >>>>> give >>>>> > you Actionable >>>>> > >>>>> Insights >>>>> > >>>>>>>> Deep dive visibility with transaction tracing >>>>> > using APM Insight. >>>>> > >>>>>>>> >>>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>> > >>>>>>>> >>>>> _______________________________________________ >>>>> > >>>>>>>> Kaldi-users mailing list >>>>> > >>>>>>>> Kal...@li... >>>>> > >>>>>>>> >>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> ----------------------------------------------------------------------- >>>>> > - >>>>> > ------ >>>>> > >>>>>>> One dashboard for servers and applications >>>>> across >>>>> > >>>>>>> Physical-Virtual-Cloud >>>>> > >>>>>>> Widest out-of-the-box monitoring support with >>>>> > 50+ applications >>>>> > >>>>>>> Performance metrics, stats and reports that >>>>> give >>>>> > you Actionable >>>>> > >>>>>>> Insights >>>>> > >>>>>>> Deep dive visibility with transaction tracing >>>>> using >>>>> > APM Insight. >>>>> > >>>>>>> >>>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>> > >>>>>>> _______________________________________________ >>>>> > >>>>>>> Kaldi-users mailing list >>>>> > >>>>>>> Kal...@li... >>>>> > >>>>>>> >>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> -- >>>>> > >>>> - Hainan >>>>> > >>>> -------------- next part -------------- >>>>> > >>>> An HTML attachment was scrubbed... >>>>> > >>>> >>>>> > >>>> ------------------------------ >>>>> > >>>> >>>>> > >>>> Message: 4 >>>>> > >>>> Date: Thu, 21 May 2015 15:01:51 -0400 >>>>> > >>>> From: Sean True <se...@se...> >>>>> > >>>> Subject: Re: [Kaldi-users] LM grafting >>>>> > >>>> To: Daniel Povey <dp...@gm...> >>>>> > >>>> Cc: Hainan Xu <hai...@gm...>, >>>>> > >>>> "kal...@li..." >>>>> > >>>> <kal...@li...>, Kirill >>>>> > Katsnelson >>>>> > >>>> <kir...@sm...> >>>>> > >>>> Message-ID: >>>>> > >>>> >>>>> > <CALtEaHntdAcmO_Ji5dxsPnT8i9M_LVuGnY0UjkJUPp=pY...@ma...> >>>>> > >>>> Content-Type: text/plain; charset="utf-8" >>>>> > >>>> >>>>> > >>>> That's a subject of some general interest. Is >>>>> there a >>>>> > discussion of the >>>>> > >>>> general approach that was taken somewhere? >>>>> > >>>> >>>>> > >>>> -- Sean >>>>> > >>>> >>>>> > >>>> Sean True >>>>> > >>>> Semantic Machines >>>>> > >>>> >>>>> > >>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >>>>> > <dp...@gm...> wrote: >>>>> > >>>>> >>>>> > >>>>> Nagendra Goel has worked on some example scripts >>>>> for >>>>> > this type of >>>>> > >>>>> thing, and with Hainan we were working on trying >>>>> to >>>>> > get it cleaned up >>>>> > >>>>> and checked in, but he's going for an internship >>>>> so >>>>> > it will have to >>>>> > >>>>> wait. But Nagendra might be willing to share it >>>>> with >>>>> > you. >>>>> > >>>>> Dan >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill >>>>> Katsnelson >>>>> > >>>>> <kir...@sm...> wrote: >>>>> > >>>>>> Suppose I have a language model where one token >>>>> (a >>>>> > "word") is a >>>>> > >>>>>> pointer >>>>> > >>>>> to a whole another LM. This is a practical case >>>>> when >>>>> > you expect an >>>>> > >>>>> abrupt >>>>> > >>>>> change in model, a clear example being "my phone >>>>> > number is..." and then >>>>> > >>>>> you'd expect them rattling a string of digits. Is >>>>> > there any support in >>>>> > >>>>> kaldi for this? >>>>> > >>>>>> >>>>> > >>>>>> Thanks, >>>>> > >>>>>> >>>>> > >>>>>> -kkm >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> ----------------------------------------------------------------------- >>>>> > - >>>>> > ------ >>>>> > >>>>>> One dashboard for servers and applications >>>>> across >>>>> > >>>>>> Physical-Virtual-Cloud >>>>> > >>>>>> Widest out-of-the-box monitoring support with >>>>> 50+ >>>>> > applications >>>>> > >>>>>> Performance metrics, stats and reports that >>>>> give you >>>>> > Actionable >>>>> > >>>>>> Insights >>>>> > >>>>>> Deep dive visibility with transaction tracing >>>>> using >>>>> > APM Insight. >>>>> > >>>>>> >>>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>> > >>>>>> _______________________________________________ >>>>> > >>>>>> Kaldi-users mailing list >>>>> > >>>>>> Kal...@li... >>>>> > >>>>>> >>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> ----------------------------------------------------------------------- >>>>> > - >>>>> > ------ >>>>> > >>>>> One dashboard for servers and applications across >>>>> > >>>>> Physical-Virtual-Cloud >>>>> > >>>>> Widest out-of-the-box monitoring support with 50+ >>>>> > applications >>>>> > >>>>> Performance metrics, stats and reports that give >>>>> you >>>>> > Actionable >>>>> > >>>>> Insights >>>>> > >>>>> Deep dive visibility with transaction tracing >>>>> using >>>>> > APM Insight. >>>>> > >>>>> >>>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>> > >>>>> _______________________________________________ >>>>> > >>>>> Kaldi-users mailing list >>>>> > >>>>> Kal...@li... >>>>> > >>>>> >>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> > >>>> -------------- next part -------------- >>>>> > >>>> An HTML attachment was scrubbed... >>>>> > >>>> >>>>> > >>>> ------------------------------ >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>>>> ----------------------------------------------------------------------- >>>>> > - >>>>> > ------ >>>>> > >>>> One dashboard for servers and applications across >>>>> > Physical-Virtual-Cloud >>>>> > >>>> Widest out-of-the-box monitoring support with 50+ >>>>> > applications >>>>> > >>>> Performance metrics, stats and reports that give >>>>> you >>>>> > Actionable Insights >>>>> > >>>> Deep dive visibility with transaction tracing >>>>> using >>>>> > APM Insight. >>>>> > >>>> >>>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>> > >>>> >>>>> > >>>> ------------------------------ >>>>> > >>>> >>>>> > >>>> _______________________________________________ >>>>> > >>>> Kaldi-users mailing list >>>>> > >>>> Kal...@li... >>>>> > >>>> >>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> End of Kaldi-users Digest, Vol 29, Issue 15 >>>>> > >>>> ******************************************* >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>>>> ----------------------------------------------------------------------- >>>>> > - >>>>> > ------ >>>>> > >>> One dashboard for servers and applications across >>>>> > Physical-Virtual-Cloud >>>>> > >>> Widest out-of-the-box monitoring support with 50+ >>>>> > applications >>>>> > >>> Performance metrics, stats and reports that give >>>>> you >>>>> > Actionable Insights >>>>> > >>> Deep dive visibility with transaction tracing >>>>> using APM >>>>> > Insight. >>>>> > >>> >>>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>> > >>> _______________________________________________ >>>>> > >>> Kaldi-users mailing list >>>>> > >>> Kal...@li... >>>>> > >>> >>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > >>>>> ----------------------------------------------------------------------- >>>>> > - >>>>> > ------ >>>>> > >> One dashboard for servers and applications across >>>>> > Physical-Virtual-Cloud >>>>> > >> Widest out-of-the-box monitoring support with 50+ >>>>> > applications >>>>> > >> Performance metrics, stats and reports that give you >>>>> > Actionable Insights >>>>> > >> Deep dive visibility with transaction tracing using >>>>> APM >>>>> > Insight. >>>>> > >> >>>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>> > >> _______________________________________________ >>>>> > >> Kaldi-users mailing list >>>>> > >> Kal...@li... >>>>> > >> >>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> > >> >>>>> > >>>>> > >>>>> > >>>>> ----------------------------------------------------------------------- >>>>> > - >>>>> > ------ >>>>> > One dashboard for servers and applications across >>>>> Physical- >>>>> > Virtual-Cloud >>>>> > Widest out-of-the-box monitoring support with 50+ >>>>> > applications >>>>> > Performance metrics, stats and reports that give you >>>>> > Actionable Insights >>>>> > Deep dive visibility with transaction tracing using APM >>>>> > Insight. >>>>> > >>>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>> > _______________________________________________ >>>>> > Kaldi-users mailing list >>>>> > Kal...@li... >>>>> > >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> ----------------------------------------------------------------------- >>>>> > - >>>>> > ------ >>>>> > One dashboard for servers and applications across Physical- >>>>> > Virtual-Cloud >>>>> > Widest out-of-the-box monitoring support with 50+ applications >>>>> > Performance metrics, stats and reports that give you Actionable >>>>> > Insights >>>>> > Deep dive visibility with transaction tracing using APM >>>>> Insight. >>>>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>> > _______________________________________________ >>>>> > Kaldi-users mailing list >>>>> > Kal...@li... >>>>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> > >>>>> > >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> One dashboard for servers and applications across >>>>> Physical-Virtual-Cloud >>>>> Widest out-of-the-box monitoring support with 50+ applications >>>>> Performance metrics, stats and reports that give you Actionable >>>>> Insights >>>>> Deep dive visibility with transaction tracing using APM Insight. >>>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>>>> _______________________________________________ >>>>> Kaldi-users mailing list >>>>> Kal...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>>> >>>> >>>> >>>> >>>> -- >>>> Ondřej Plátek, +420 737 758 650, skype:ondrejplatek, >>>> ond...@gm... >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> Kaldi-users mailing list >>>> Kal...@li... >>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> >>>> >>> >> >> >> -- >> Ondřej Plátek, +420 737 758 650, skype:ondrejplatek, >> ond...@gm... >> > > -- Ondřej Plátek, +420 737 758 650, skype:ondrejplatek, ond...@gm... |
From: Ondrej P. <ond...@gm...> - 2015-06-17 20:08:21
|
For the Czech data we are running the system live with Kaldi and we use class LM. For the English data I will give you few examples from top of my head: PRICE_RANGE - cheap, middle price-range... FOOD_TYPE - Indian, Chinese, LOCATION - city center, Chesterton area, .. .... We will try to find the classes definition, since we are not running the system. Ondrej On Wed, Jun 17, 2015 at 10:01 PM, Sandeep Reddy <san...@go...> wrote: > Ondrej, > I'll run the Vystadial recipe and see what opportunities are there. Did > somebody already make a class LM on it or at least define what potential > classes are? I hadn't looked into it earlier. > Thanks > Nagendra > > On Wed, Jun 17, 2015 at 3:42 AM, Ondrej Platek <ond...@gm...> > wrote: > >> Dear all, >> >> thanks to reminder of Dimitris, I realized that the Vystadial dataset is >> very convenient for Class based LM/ LM grafting. >> As the scripts for Vystadial Cs & En are already in Kaldi it may be >> convenient starting data because >> they contain transcription of user utterances from communication with >> spoken dialogue system where we have the classes defined. >> >> See scritps: >> https://github.com/kaldi-asr/kaldi/tree/master/egs/vystadial_en >> https://github.com/kaldi-asr/kaldi/tree/master/egs/vystadial_cz >> >> See data (scroll to the bottom to download the datasets): >> http://hdl.handle.net/11858/00-097C-0000-0023-4671-4 (en) >> http://hdl.handle.net/11858/00-097C-0000-0023-4670-6 (cs) >> >> >> We can probably recreate / find the list of words in the classes for >> English if there is interest. >> For Czech this should be no problem at all. >> >> Please, let me know if you are interested in these datasets and the lists >> of classes and their members. >> >> Ondra >> >> PS: Currently, we used classed based (CB) LM which we later expand to >> full LM in arpa format than create G.fst as in standard use case. >> It is not optimal attitude but it works for us. >> If you want to know how we are modeling the CBLM just let me know, I am >> working on slight improvement of it right now, >> so I am interested in improving it. >> >> >> On Tue, May 26, 2015 at 8:11 PM, Kirill Katsnelson < >> kir...@sm...> wrote: >> >>> Speaking about data set preprocessing only, will Stanford NLP POS tagger >>> pull the trick? >>> >>> -kkm >>> >>> > -----Original Message----- >>> > From: Nagendra Goel [mailto:nag...@go...] >>> > Sent: 2015-05-24 1511 >>> > To: Matthew Aylett >>> > Cc: Dimitris Vassos; kal...@li... >>> > Subject: Re: [Kaldi-users] LM grafting >>> > >>> > A systematic way for identifying special elements in text will be very >>> > useful. Currently NSW-EXPAND from festival conflicts with this sub- >>> > grammar approach although otherwise it's a good lm pre-processing step. >>> > >>> > Nagendra Kumar Goel >>> > >>> > On May 24, 2015 4:45 PM, "Matthew Aylett" <mat...@gm...> >>> > wrote: >>> > >>> > >>> > Not sure if this is relevant to this thread. But in the speech >>> > synthesis system branch we have a very early text normaliser which >>> > (when >>> > complete) will detect things like phone numbers addresses, currencies >>> > etc. The output form this could then be used to inform language model >>> > building. Currently it deals with symbols and tokenisations in English. >>> > >>> > Potentially `(although I wasn't currently planning on this), the >>> > text normaliser could be written in thrax - based on openfst - authored >>> > by Richard Sproat I believe). However if this approach would benefit >>> > ASR as well then it might be worth doing it this way rather than my >>> > plan of a simple greedy normaliser. >>> > >>> > >>> > v best >>> > >>> > Matthew Aylett >>> > >>> > >>> > On Sun, May 24, 2015 at 8:34 AM, Dimitris Vassos >>> > <dva...@gm...> wrote: >>> > >>> > >>> > We have access to several corpora and we are trying to >>> put >>> > together something appropriate. >>> > >>> > In the next couple of days, we will also volunteer a >>> server >>> > to set it all up and run the tests. >>> > >>> > Dimitris >>> > >>> > > On 24 Μαΐ 2015, at 02:06, Daniel Povey < >>> dp...@gm...> >>> > wrote: >>> > > >>> > > One possibility is to use a completely open-source >>> setup, >>> > e.g. >>> > > Voxforge, and forget about the "has a clear advantage" >>> > requirement. >>> > > E.g. target anything that looks like a year, and make a >>> > grammar for >>> > > years. >>> > > Dan >>> > > >>> > > >>> > > On Fri, May 22, 2015 at 6:32 AM, Nagendra Goel >>> > > <nag...@go...> wrote: >>> > >> Since I cannot volunteer my enviornment, do you >>> > recommend another >>> > >> enviornment where this can be prototyped and where >>> you >>> > can check in some >>> > >> class lm recipe that has advantage. >>> > >> >>> > >> Nagendra >>> > >> >>> > >> Nagendra Kumar Goel >>> > >> >>> > >>> On May 21, 2015 11:01 PM, "Dimitris Vassos" >>> > <dva...@gm...> wrote: >>> > >>> >>> > >>> +1 for the class-based LMs. I have also been >>> interested >>> > in this >>> > >>> functionality for some time now, so will be more than >>> > happy to try out the >>> > >>> current implementation, if possible. >>> > >>> >>> > >>> Thanks >>> > >>> Dimitris >>> > >>> >>> > >>>> On 22 Μαΐ 2015, at 01:34, >>> > kal...@li... >>> > >>>> wrote: >>> > >>>> >>> > >>>> Send Kaldi-users mailing list submissions to >>> > >>>> kal...@li... >>> > >>>> >>> > >>>> To subscribe or unsubscribe via the World Wide Web, >>> > visit >>> > >>>> >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >>>> or, via email, send a message with subject or body >>> > 'help' to >>> > >>>> kal...@li... >>> > >>>> >>> > >>>> You can reach the person managing the list at >>> > >>>> kal...@li... >>> > >>>> >>> > >>>> When replying, please edit your Subject line so it >>> is >>> > more specific >>> > >>>> than "Re: Contents of Kaldi-users digest..." >>> > >>>> >>> > >>>> >>> > >>>> Today's Topics: >>> > >>>> >>> > >>>> 1. Re: LM grafting (Daniel Povey) >>> > >>>> 2. Re: LM grafting (Kirill Katsnelson) >>> > >>>> 3. Re: LM grafting (Hainan Xu) >>> > >>>> 4. Re: LM grafting (Sean True) >>> > >>>> >>> > >>>> >>> > >>>> >>> > ---------------------------------------------------------------------- >>> > >>>> >>> > >>>> Message: 1 >>> > >>>> Date: Thu, 21 May 2015 15:04:04 -0400 >>> > >>>> From: Daniel Povey <dp...@gm...> >>> > >>>> Subject: Re: [Kaldi-users] LM grafting >>> > >>>> To: Sean True <se...@se...> >>> > >>>> Cc: Hainan Xu <hai...@gm...>, >>> > >>>> "kal...@li..." >>> > >>>> <kal...@li...>, Kirill >>> > Katsnelson >>> > >>>> <kir...@sm...> >>> > >>>> Message-ID: >>> > >>>> >>> > <CAEWAuySHaXwdNJZAoL6CanzHth=k4Y...@ma... >>> > <mailto:k4YJVsBiAfEuFDFMvY%2B...@ma...> > >>> > >>>> Content-Type: text/plain; charset=UTF-8 >>> > >>>> >>> > >>>> The general approach is to create an FST for the >>> > little language >>> > >>>> model, and then to use fstreplace to replace >>> instances >>> > of a particular >>> > >>>> symbol in the top-level language model, with that >>> FST. >>> > >>>> The tricky part is ensuring that the result is >>> > determinizable after >>> > >>>> composing with the lexicon. In general our solution >>> > is to add special >>> > >>>> disambiguation symbols at the beginning and end of >>> > each of the >>> > >>>> sub-FSTs, and of course making sure that the >>> sub-FSTs >>> > are themselves >>> > >>>> determinizable. >>> > >>>> Dan >>> > >>>> >>> > >>>> >>> > >>>>> On Thu, May 21, 2015 at 3:01 PM, Sean True >>> > <se...@se...> >>> > >>>>> wrote: >>> > >>>>> That's a subject of some general interest. Is >>> there a >>> > discussion of the >>> > >>>>> general approach that was taken somewhere? >>> > >>>>> >>> > >>>>> -- Sean >>> > >>>>> >>> > >>>>> Sean True >>> > >>>>> Semantic Machines >>> > >>>>> >>> > >>>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >>> > <dp...@gm...> >>> > >>>>>> wrote: >>> > >>>>>> >>> > >>>>>> Nagendra Goel has worked on some example scripts >>> for >>> > this type of >>> > >>>>>> thing, and with Hainan we were working on trying >>> to >>> > get it cleaned up >>> > >>>>>> and checked in, but he's going for an internship >>> so >>> > it will have to >>> > >>>>>> wait. But Nagendra might be willing to share it >>> > with you. >>> > >>>>>> Dan >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill Katsnelson >>> > >>>>>> <kir...@sm...> wrote: >>> > >>>>>>> Suppose I have a language model where one token >>> (a >>> > "word") is a >>> > >>>>>>> pointer >>> > >>>>>>> to a whole another LM. This is a practical case >>> > when you expect an >>> > >>>>>>> abrupt >>> > >>>>>>> change in model, a clear example being "my phone >>> > number is..." and >>> > >>>>>>> then >>> > >>>>>>> you'd expect them rattling a string of digits. >>> > Is there any support >>> > >>>>>>> in kaldi >>> > >>>>>>> for this? >>> > >>>>>>> >>> > >>>>>>> Thanks, >>> > >>>>>>> >>> > >>>>>>> -kkm >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > ----------------------------------------------------------------------- >>> > - >>> > ------ >>> > >>>>>>> One dashboard for servers and applications across >>> > >>>>>>> Physical-Virtual-Cloud >>> > >>>>>>> Widest out-of-the-box monitoring support with >>> > 50+ applications >>> > >>>>>>> Performance metrics, stats and reports that give >>> > you Actionable >>> > >>>>>>> Insights >>> > >>>>>>> Deep dive visibility with transaction tracing >>> using >>> > APM Insight. >>> > >>>>>>> >>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> > >>>>>>> _______________________________________________ >>> > >>>>>>> Kaldi-users mailing list >>> > >>>>>>> Kal...@li... >>> > >>>>>>> >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> >>> > ----------------------------------------------------------------------- >>> > - >>> > ------ >>> > >>>>>> One dashboard for servers and applications across >>> > >>>>>> Physical-Virtual-Cloud >>> > >>>>>> Widest out-of-the-box monitoring support with 50+ >>> > applications >>> > >>>>>> Performance metrics, stats and reports that give >>> you >>> > Actionable >>> > >>>>>> Insights >>> > >>>>>> Deep dive visibility with transaction tracing >>> using >>> > APM Insight. >>> > >>>>>> >>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> > >>>>>> _______________________________________________ >>> > >>>>>> Kaldi-users mailing list >>> > >>>>>> Kal...@li... >>> > >>>>>> >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> ------------------------------ >>> > >>>> >>> > >>>> Message: 2 >>> > >>>> Date: Thu, 21 May 2015 19:24:38 +0000 >>> > >>>> From: Kirill Katsnelson >>> > <kir...@sm...> >>> > >>>> Subject: Re: [Kaldi-users] LM grafting >>> > >>>> To: "dp...@gm..." <dp...@gm...>, Sean >>> True >>> > >>>> <se...@se...> >>> > >>>> Cc: Hainan Xu <hai...@gm...>, >>> > >>>> "kal...@li..." >>> > >>>> <kal...@li...> >>> > >>>> Message-ID: >>> > >>>> >>> > >>>> >>> > <CY1...@CY...d.out >>> > l >>> > ook.com> >>> > >>>> >>> > >>>> Content-Type: text/plain; charset="utf-8" >>> > >>>> >>> > >>>> Also, from the practical standpoint, >>> > backoff/discounting weights usually >>> > >>>> need to be massaged. Otherwise when the grafted LM >>> is >>> > small and the main LM >>> > >>>> is large, the little model will tend to shoehorn an >>> > utterance into itself >>> > >>>> rather than let go of it. In my phone number >>> example, >>> > everything becomes >>> > >>>> digits once the phone number starts. >>> > >>>> >>> > >>>> -kkm >>> > >>>> >>> > >>>>> -----Original Message----- >>> > >>>>> From: Daniel Povey [mailto:dp...@gm...] >>> > >>>>> Sent: 2015-05-21 1204 >>> > >>>>> To: Sean True >>> > >>>>> Cc: Kirill Katsnelson; Nagendra Goel; Hainan Xu; >>> > kaldi- >>> > >>>>> us...@li... >>> > >>>>> Subject: Re: [Kaldi-users] LM grafting >>> > >>>>> >>> > >>>>> The general approach is to create an FST for the >>> > little language model, >>> > >>>>> and then to use fstreplace to replace instances of >>> a >>> > particular symbol >>> > >>>>> in the top-level language model, with that FST. >>> > >>>>> The tricky part is ensuring that the result is >>> > determinizable after >>> > >>>>> composing with the lexicon. In general our >>> solution >>> > is to add special >>> > >>>>> disambiguation symbols at the beginning and end of >>> > each of the sub- >>> > >>>>> FSTs, and of course making sure that the sub-FSTs >>> are >>> > themselves >>> > >>>>> determinizable. >>> > >>>>> Dan >>> > >>>>> >>> > >>>>> >>> > >>>>> On Thu, May 21, 2015 at 3:01 PM, Sean True >>> > <se...@se...> >>> > >>>>> wrote: >>> > >>>>>> That's a subject of some general interest. Is >>> there >>> > a discussion of >>> > >>>>>> the general approach that was taken somewhere? >>> > >>>>>> >>> > >>>>>> -- Sean >>> > >>>>>> >>> > >>>>>> Sean True >>> > >>>>>> Semantic Machines >>> > >>>>>> >>> > >>>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >>> > <dp...@gm...> >>> > >>>>> wrote: >>> > >>>>>>> >>> > >>>>>>> Nagendra Goel has worked on some example scripts >>> > for this type of >>> > >>>>>>> thing, and with Hainan we were working on trying >>> to >>> > get it cleaned >>> > >>>>> up >>> > >>>>>>> and checked in, but he's going for an internship >>> so >>> > it will have to >>> > >>>>>>> wait. But Nagendra might be willing to share it >>> > with you. >>> > >>>>>>> Dan >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill >>> Katsnelson >>> > >>>>>>> <kir...@sm...> wrote: >>> > >>>>>>>> Suppose I have a language model where one token >>> (a >>> > "word") is a >>> > >>>>>>>> pointer to a whole another LM. This is a >>> practical >>> > case when you >>> > >>>>>>>> expect an abrupt change in model, a clear >>> example >>> > being "my phone >>> > >>>>>>>> number is..." and then you'd expect them >>> rattling >>> > a string of >>> > >>>>>>>> digits. Is there any support in kaldi for this? >>> > >>>>>>>> >>> > >>>>>>>> Thanks, >>> > >>>>>>>> >>> > >>>>>>>> -kkm >>> > >>>>>>>> >>> > >>>>>>>> >>> > ------------------------------------------------------------------ >>> > >>>>> - >>> > >>>>>>>> ----------- One dashboard for servers and >>> > applications across >>> > >>>>>>>> Physical-Virtual-Cloud Widest out-of-the-box >>> > monitoring support >>> > >>>>>>>> with 50+ applications Performance metrics, stats >>> > and reports that >>> > >>>>>>>> give you Actionable Insights Deep dive >>> visibility >>> > with transaction >>> > >>>>>>>> tracing using APM Insight. >>> > >>>>>>>> >>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> > >>>>>>>> _______________________________________________ >>> > >>>>>>>> Kaldi-users mailing list >>> > >>>>>>>> Kal...@li... >>> > >>>>>>>> >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > -------------------------------------------------------------------- >>> > >>>>> - >>> > >>>>>>> --------- One dashboard for servers and >>> > applications across >>> > >>>>>>> Physical-Virtual-Cloud Widest out-of-the-box >>> > monitoring support with >>> > >>>>>>> 50+ applications Performance metrics, stats and >>> > reports that give >>> > >>>>> you >>> > >>>>>>> Actionable Insights Deep dive visibility with >>> > transaction tracing >>> > >>>>>>> using APM Insight. >>> > >>>>>>> >>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> > >>>>>>> _______________________________________________ >>> > >>>>>>> Kaldi-users mailing list >>> > >>>>>>> Kal...@li... >>> > >>>>>>> >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >>>> >>> > >>>> ------------------------------ >>> > >>>> >>> > >>>> Message: 3 >>> > >>>> Date: Thu, 21 May 2015 15:29:54 -0400 >>> > >>>> From: Hainan Xu <hai...@gm...> >>> > >>>> Subject: Re: [Kaldi-users] LM grafting >>> > >>>> To: Daniel Povey <dp...@gm...> >>> > >>>> Cc: Sean True <se...@se...>, >>> > >>>> "kal...@li..." >>> > >>>> <kal...@li...>, Kirill >>> > Katsnelson >>> > >>>> <kir...@sm...> >>> > >>>> Message-ID: >>> > >>>> >>> > <CALP+BDZvJP-2cZ+fEJEXaMaVWzgy63mtc=J1E...@ma...> >>> > >>>> Content-Type: text/plain; charset="utf-8" >>> > >>>> >>> > >>>> There is a paper in ICASSP 2015 that described some >>> > very similar idea: >>> > >>>> >>> > >>>> Improved recognition of contact names in voice >>> > commands >>> > >>>> >>> > >>>>> On Thu, May 21, 2015 at 3:04 PM, Daniel Povey >>> > <dp...@gm...> wrote: >>> > >>>>> >>> > >>>>> The general approach is to create an FST for the >>> > little language >>> > >>>>> model, and then to use fstreplace to replace >>> > instances of a particular >>> > >>>>> symbol in the top-level language model, with that >>> > FST. >>> > >>>>> The tricky part is ensuring that the result is >>> > determinizable after >>> > >>>>> composing with the lexicon. In general our >>> solution >>> > is to add special >>> > >>>>> disambiguation symbols at the beginning and end of >>> > each of the >>> > >>>>> sub-FSTs, and of course making sure that the >>> sub-FSTs >>> > are themselves >>> > >>>>> determinizable. >>> > >>>>> Dan >>> > >>>>> >>> > >>>>> >>> > >>>>> On Thu, May 21, 2015 at 3:01 PM, Sean True >>> > <se...@se...> >>> > >>>>> wrote: >>> > >>>>>> That's a subject of some general interest. Is >>> there >>> > a discussion of >>> > >>>>>> the >>> > >>>>>> general approach that was taken somewhere? >>> > >>>>>> >>> > >>>>>> -- Sean >>> > >>>>>> >>> > >>>>>> Sean True >>> > >>>>>> Semantic Machines >>> > >>>>>> >>> > >>>>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >>> > <dp...@gm...> >>> > >>>>>>> wrote: >>> > >>>>>>> >>> > >>>>>>> Nagendra Goel has worked on some example scripts >>> > for this type of >>> > >>>>>>> thing, and with Hainan we were working on trying >>> to >>> > get it cleaned up >>> > >>>>>>> and checked in, but he's going for an internship >>> so >>> > it will have to >>> > >>>>>>> wait. But Nagendra might be willing to share it >>> > with you. >>> > >>>>>>> Dan >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill >>> Katsnelson >>> > >>>>>>> <kir...@sm...> wrote: >>> > >>>>>>>> Suppose I have a language model where one token >>> (a >>> > "word") is a >>> > >>>>> pointer >>> > >>>>>>>> to a whole another LM. This is a practical case >>> > when you expect an >>> > >>>>> abrupt >>> > >>>>>>>> change in model, a clear example being "my phone >>> > number is..." and >>> > >>>>> then >>> > >>>>>>>> you'd expect them rattling a string of digits. >>> > Is there any support >>> > >>>>> in kaldi >>> > >>>>>>>> for this? >>> > >>>>>>>> >>> > >>>>>>>> Thanks, >>> > >>>>>>>> >>> > >>>>>>>> -kkm >>> > >>>>> >>> > >>>>> >>> > ----------------------------------------------------------------------- >>> > - >>> > ------ >>> > >>>>>>>> One dashboard for servers and applications >>> across >>> > >>>>> Physical-Virtual-Cloud >>> > >>>>>>>> Widest out-of-the-box monitoring support with >>> > 50+ applications >>> > >>>>>>>> Performance metrics, stats and reports that give >>> > you Actionable >>> > >>>>> Insights >>> > >>>>>>>> Deep dive visibility with transaction tracing >>> > using APM Insight. >>> > >>>>>>>> >>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> > >>>>>>>> _______________________________________________ >>> > >>>>>>>> Kaldi-users mailing list >>> > >>>>>>>> Kal...@li... >>> > >>>>>>>> >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >>>>> >>> > >>>>> >>> > ----------------------------------------------------------------------- >>> > - >>> > ------ >>> > >>>>>>> One dashboard for servers and applications across >>> > >>>>>>> Physical-Virtual-Cloud >>> > >>>>>>> Widest out-of-the-box monitoring support with >>> > 50+ applications >>> > >>>>>>> Performance metrics, stats and reports that give >>> > you Actionable >>> > >>>>>>> Insights >>> > >>>>>>> Deep dive visibility with transaction tracing >>> using >>> > APM Insight. >>> > >>>>>>> >>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> > >>>>>>> _______________________________________________ >>> > >>>>>>> Kaldi-users mailing list >>> > >>>>>>> Kal...@li... >>> > >>>>>>> >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> -- >>> > >>>> - Hainan >>> > >>>> -------------- next part -------------- >>> > >>>> An HTML attachment was scrubbed... >>> > >>>> >>> > >>>> ------------------------------ >>> > >>>> >>> > >>>> Message: 4 >>> > >>>> Date: Thu, 21 May 2015 15:01:51 -0400 >>> > >>>> From: Sean True <se...@se...> >>> > >>>> Subject: Re: [Kaldi-users] LM grafting >>> > >>>> To: Daniel Povey <dp...@gm...> >>> > >>>> Cc: Hainan Xu <hai...@gm...>, >>> > >>>> "kal...@li..." >>> > >>>> <kal...@li...>, Kirill >>> > Katsnelson >>> > >>>> <kir...@sm...> >>> > >>>> Message-ID: >>> > >>>> >>> > <CALtEaHntdAcmO_Ji5dxsPnT8i9M_LVuGnY0UjkJUPp=pY...@ma...> >>> > >>>> Content-Type: text/plain; charset="utf-8" >>> > >>>> >>> > >>>> That's a subject of some general interest. Is there >>> a >>> > discussion of the >>> > >>>> general approach that was taken somewhere? >>> > >>>> >>> > >>>> -- Sean >>> > >>>> >>> > >>>> Sean True >>> > >>>> Semantic Machines >>> > >>>> >>> > >>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey >>> > <dp...@gm...> wrote: >>> > >>>>> >>> > >>>>> Nagendra Goel has worked on some example scripts >>> for >>> > this type of >>> > >>>>> thing, and with Hainan we were working on trying to >>> > get it cleaned up >>> > >>>>> and checked in, but he's going for an internship so >>> > it will have to >>> > >>>>> wait. But Nagendra might be willing to share it >>> with >>> > you. >>> > >>>>> Dan >>> > >>>>> >>> > >>>>> >>> > >>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill Katsnelson >>> > >>>>> <kir...@sm...> wrote: >>> > >>>>>> Suppose I have a language model where one token (a >>> > "word") is a >>> > >>>>>> pointer >>> > >>>>> to a whole another LM. This is a practical case >>> when >>> > you expect an >>> > >>>>> abrupt >>> > >>>>> change in model, a clear example being "my phone >>> > number is..." and then >>> > >>>>> you'd expect them rattling a string of digits. Is >>> > there any support in >>> > >>>>> kaldi for this? >>> > >>>>>> >>> > >>>>>> Thanks, >>> > >>>>>> >>> > >>>>>> -kkm >>> > >>>>> >>> > >>>>> >>> > ----------------------------------------------------------------------- >>> > - >>> > ------ >>> > >>>>>> One dashboard for servers and applications across >>> > >>>>>> Physical-Virtual-Cloud >>> > >>>>>> Widest out-of-the-box monitoring support with 50+ >>> > applications >>> > >>>>>> Performance metrics, stats and reports that give >>> you >>> > Actionable >>> > >>>>>> Insights >>> > >>>>>> Deep dive visibility with transaction tracing >>> using >>> > APM Insight. >>> > >>>>>> >>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> > >>>>>> _______________________________________________ >>> > >>>>>> Kaldi-users mailing list >>> > >>>>>> Kal...@li... >>> > >>>>>> >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > ----------------------------------------------------------------------- >>> > - >>> > ------ >>> > >>>>> One dashboard for servers and applications across >>> > >>>>> Physical-Virtual-Cloud >>> > >>>>> Widest out-of-the-box monitoring support with 50+ >>> > applications >>> > >>>>> Performance metrics, stats and reports that give >>> you >>> > Actionable >>> > >>>>> Insights >>> > >>>>> Deep dive visibility with transaction tracing using >>> > APM Insight. >>> > >>>>> >>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> > >>>>> _______________________________________________ >>> > >>>>> Kaldi-users mailing list >>> > >>>>> Kal...@li... >>> > >>>>> >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >>>> -------------- next part -------------- >>> > >>>> An HTML attachment was scrubbed... >>> > >>>> >>> > >>>> ------------------------------ >>> > >>>> >>> > >>>> >>> > >>>> >>> > ----------------------------------------------------------------------- >>> > - >>> > ------ >>> > >>>> One dashboard for servers and applications across >>> > Physical-Virtual-Cloud >>> > >>>> Widest out-of-the-box monitoring support with 50+ >>> > applications >>> > >>>> Performance metrics, stats and reports that give you >>> > Actionable Insights >>> > >>>> Deep dive visibility with transaction tracing using >>> > APM Insight. >>> > >>>> >>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> > >>>> >>> > >>>> ------------------------------ >>> > >>>> >>> > >>>> _______________________________________________ >>> > >>>> Kaldi-users mailing list >>> > >>>> Kal...@li... >>> > >>>> >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >>>> >>> > >>>> >>> > >>>> End of Kaldi-users Digest, Vol 29, Issue 15 >>> > >>>> ******************************************* >>> > >>> >>> > >>> >>> > >>> >>> > ----------------------------------------------------------------------- >>> > - >>> > ------ >>> > >>> One dashboard for servers and applications across >>> > Physical-Virtual-Cloud >>> > >>> Widest out-of-the-box monitoring support with 50+ >>> > applications >>> > >>> Performance metrics, stats and reports that give you >>> > Actionable Insights >>> > >>> Deep dive visibility with transaction tracing using >>> APM >>> > Insight. >>> > >>> >>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> > >>> _______________________________________________ >>> > >>> Kaldi-users mailing list >>> > >>> Kal...@li... >>> > >>> >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >> >>> > >> >>> > >> >>> > ----------------------------------------------------------------------- >>> > - >>> > ------ >>> > >> One dashboard for servers and applications across >>> > Physical-Virtual-Cloud >>> > >> Widest out-of-the-box monitoring support with 50+ >>> > applications >>> > >> Performance metrics, stats and reports that give you >>> > Actionable Insights >>> > >> Deep dive visibility with transaction tracing using >>> APM >>> > Insight. >>> > >> >>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> > >> _______________________________________________ >>> > >> Kaldi-users mailing list >>> > >> Kal...@li... >>> > >> >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >> >>> > >>> > >>> > ----------------------------------------------------------------------- >>> > - >>> > ------ >>> > One dashboard for servers and applications across >>> Physical- >>> > Virtual-Cloud >>> > Widest out-of-the-box monitoring support with 50+ >>> > applications >>> > Performance metrics, stats and reports that give you >>> > Actionable Insights >>> > Deep dive visibility with transaction tracing using APM >>> > Insight. >>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> > _______________________________________________ >>> > Kaldi-users mailing list >>> > Kal...@li... >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >>> > >>> > >>> > >>> > >>> > ----------------------------------------------------------------------- >>> > - >>> > ------ >>> > One dashboard for servers and applications across Physical- >>> > Virtual-Cloud >>> > Widest out-of-the-box monitoring support with 50+ applications >>> > Performance metrics, stats and reports that give you Actionable >>> > Insights >>> > Deep dive visibility with transaction tracing using APM Insight. >>> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> > _______________________________________________ >>> > Kaldi-users mailing list >>> > Kal...@li... >>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> > >>> > >>> >>> >>> ------------------------------------------------------------------------------ >>> One dashboard for servers and applications across Physical-Virtual-Cloud >>> Widest out-of-the-box monitoring support with 50+ applications >>> Performance metrics, stats and reports that give you Actionable Insights >>> Deep dive visibility with transaction tracing using APM Insight. >>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> _______________________________________________ >>> Kaldi-users mailing list >>> Kal...@li... >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >> >> >> >> -- >> Ondřej Plátek, +420 737 758 650, skype:ondrejplatek, >> ond...@gm... >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Kaldi-users mailing list >> Kal...@li... >> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> >> > -- Ondřej Plátek, +420 737 758 650, skype:ondrejplatek, ond...@gm... |
From: Kirill K. <kir...@sm...> - 2015-06-17 17:53:33
|
> From: David Warde-Farley [mailto:d.w...@gm...] > Sent: 2015-06-17 0028 > Subject: Re: [Kaldi-users] non-cluster usage of Librispeech s5 recipe? > > Many thanks for the pointers. On your setup, how long does the entire > recipe take without decoding? A few hours to train the tri5 model (10 to 15 hours I guess, on a 6-core CPU), then maybe 4-5 days to train the nnet2 on the 460 hour data set on the 980 GPU board. I did not go any further than that. Guess it would take at least twice that time to process the 1000 hour set. > For the life of me I can't figure out where num_jobs_nnet is being set > (it's being written in the egs_dir as 4, I've changed it everywhere I > could find it.) I did not have to change anything in this regard, except for the number of jobs argument to train_multisplice_accel2 in run_nnet2_ms.sh. What file the number of jobs was saved into? Some steps rely on the number of jobs in previous steps. Sometimes the number of jobs sticks in the file which is not recreated. It may be easier to start clean. Do you run the discriminative training script (run_nnet2_ms_disc.sh)? I did not. -kkm > On Fri, Jun 12, 2015 at 7:00 PM, Kirill Katsnelson > <kir...@sm...> wrote: > >> From: David Warde-Farley [mailto:d.w...@gm...] > >> Subject: [Kaldi-users] non-cluster usage of Librispeech s5 recipe? > >> > >> I'm trying to > >> use the s5 recipe for LibriSpeech on a single machine with a single > >> GPU. I've modified cmd.sh to use run.pl. > > > > I ran it on a single machine, it requires a few modifications. Note > that it took almost a week on a 6-core 4.1GHz overclocked i7-5930K CPU > and GeForce 980 to train on the 500 hour set. > > > >> After about a day, I see a lot of background processes like > >> gmm-latgen- faster, lattice-add-penalty, lattice-scale, etc. that > >> have been launched in the background (the terminal is actually free, > >> which suggests the run.sh script has terminated...). I'm not totally > >> sure what's going on, or how to find out. > > > > In librispeech/s5/run.sh, look for decode commands in subshells, like > > > > ( > > utils/mkgraph.sh data/lang_nosp_test_tgsmall \ > > exp/tri4b exp/tri4b/graph_nosp_tgsmall || exit 1; > > for test in test_clean test_other dev_clean dev_other; do > > steps/decode_fmllr.sh --nj 20 --cmd "$decode_cmd" \ > > . . . > > )& > > > > These decodes are quite slow, if you run them on your machine. They > are slower than other part of the script. In the end, they are > accumulating, eating CPU and blowing up out of memory. They are not > essential for NN training, except possibly for the mkgraph script. The > results are useful to check if you are getting expected WER, but really > not essential. You may either disable these decode blocks completely > (except mkgraph invocations) or remove the '&' at the end to run them > synchronously. NB they will take the most preparation time prior to NN > training step. Dunno about your machine but give it an extra couple > days to complete with these. > > > >> One thing I noticed earlier is that the script was trying to spawn > >> multiple GPU jobs, but this GPU is configured (by administrators) to > >> permit at most one CUDA process, and so I saw "3 of 4 jobs failed" > >> messages. Would these jobs have been retried? > > > > They will not, but you can restart NN training from the last step. > Modify local/online/run_nnet2_ms.sh so that > steps/nnet2/train_multisplice_accel2.sh is invoked with switches "-- > num-jobs-initial 1 --num-jobs-final 1" (the defaults are larger). When > running local/online/run_nnet2_ms.sh, pass it "--stage 7" (this is the > default) and "--train_stage N" the number of iteration you are > restarting from. > > > > Even if not the 1 job limit, you probably won't benefit from running > more than 1 at a time. > > > > -kkm > > > On Fri, Jun 12, 2015 at 4:00 PM, Kirill Katsnelson > <kir...@sm...> wrote: > >> From: David Warde-Farley [mailto:d.w...@gm...] > >> Subject: [Kaldi-users] non-cluster usage of Librispeech s5 recipe? > >> > >> I'm trying to > >> use the s5 recipe for LibriSpeech on a single machine with a single > >> GPU. I've modified cmd.sh to use run.pl. > > > > I ran it on a single machine, it requires a few modifications. Note > that it took almost a week on a 6-core 4.1GHz overclocked i7-5930K CPU > and GeForce 980 to train on the 500 hour set. > > > >> After about a day, I see a lot of background processes like > >> gmm-latgen- faster, lattice-add-penalty, lattice-scale, etc. that > >> have been launched in the background (the terminal is actually free, > >> which suggests the run.sh script has terminated...). I'm not totally > >> sure what's going on, or how to find out. > > > > In librispeech/s5/run.sh, look for decode commands in subshells, like > > > > ( > > utils/mkgraph.sh data/lang_nosp_test_tgsmall \ > > exp/tri4b exp/tri4b/graph_nosp_tgsmall || exit 1; > > for test in test_clean test_other dev_clean dev_other; do > > steps/decode_fmllr.sh --nj 20 --cmd "$decode_cmd" \ > > . . . > > )& > > > > These decodes are quite slow, if you run them on your machine. They > are slower than other part of the script. In the end, they are > accumulating, eating CPU and blowing up out of memory. They are not > essential for NN training, except possibly for the mkgraph script. The > results are useful to check if you are getting expected WER, but really > not essential. You may either disable these decode blocks completely > (except mkgraph invocations) or remove the '&' at the end to run them > synchronously. NB they will take the most preparation time prior to NN > training step. Dunno about your machine but give it an extra couple > days to complete with these. > > > >> One thing I noticed earlier is that the script was trying to spawn > >> multiple GPU jobs, but this GPU is configured (by administrators) to > >> permit at most one CUDA process, and so I saw "3 of 4 jobs failed" > >> messages. Would these jobs have been retried? > > > > They will not, but you can restart NN training from the last step. > Modify local/online/run_nnet2_ms.sh so that > steps/nnet2/train_multisplice_accel2.sh is invoked with switches "-- > num-jobs-initial 1 --num-jobs-final 1" (the defaults are larger). When > running local/online/run_nnet2_ms.sh, pass it "--stage 7" (this is the > default) and "--train_stage N" the number of iteration you are > restarting from. > > > > Even if not the 1 job limit, you probably won't benefit from running > more than 1 at a time. > > > > -kkm |
From: Ondrej P. <ond...@gm...> - 2015-06-17 07:42:28
|
Dear all, thanks to reminder of Dimitris, I realized that the Vystadial dataset is very convenient for Class based LM/ LM grafting. As the scripts for Vystadial Cs & En are already in Kaldi it may be convenient starting data because they contain transcription of user utterances from communication with spoken dialogue system where we have the classes defined. See scritps: https://github.com/kaldi-asr/kaldi/tree/master/egs/vystadial_en https://github.com/kaldi-asr/kaldi/tree/master/egs/vystadial_cz See data (scroll to the bottom to download the datasets): http://hdl.handle.net/11858/00-097C-0000-0023-4671-4 (en) http://hdl.handle.net/11858/00-097C-0000-0023-4670-6 (cs) We can probably recreate / find the list of words in the classes for English if there is interest. For Czech this should be no problem at all. Please, let me know if you are interested in these datasets and the lists of classes and their members. Ondra PS: Currently, we used classed based (CB) LM which we later expand to full LM in arpa format than create G.fst as in standard use case. It is not optimal attitude but it works for us. If you want to know how we are modeling the CBLM just let me know, I am working on slight improvement of it right now, so I am interested in improving it. On Tue, May 26, 2015 at 8:11 PM, Kirill Katsnelson < kir...@sm...> wrote: > Speaking about data set preprocessing only, will Stanford NLP POS tagger > pull the trick? > > -kkm > > > -----Original Message----- > > From: Nagendra Goel [mailto:nag...@go...] > > Sent: 2015-05-24 1511 > > To: Matthew Aylett > > Cc: Dimitris Vassos; kal...@li... > > Subject: Re: [Kaldi-users] LM grafting > > > > A systematic way for identifying special elements in text will be very > > useful. Currently NSW-EXPAND from festival conflicts with this sub- > > grammar approach although otherwise it's a good lm pre-processing step. > > > > Nagendra Kumar Goel > > > > On May 24, 2015 4:45 PM, "Matthew Aylett" <mat...@gm...> > > wrote: > > > > > > Not sure if this is relevant to this thread. But in the speech > > synthesis system branch we have a very early text normaliser which > > (when > > complete) will detect things like phone numbers addresses, currencies > > etc. The output form this could then be used to inform language model > > building. Currently it deals with symbols and tokenisations in English. > > > > Potentially `(although I wasn't currently planning on this), the > > text normaliser could be written in thrax - based on openfst - authored > > by Richard Sproat I believe). However if this approach would benefit > > ASR as well then it might be worth doing it this way rather than my > > plan of a simple greedy normaliser. > > > > > > v best > > > > Matthew Aylett > > > > > > On Sun, May 24, 2015 at 8:34 AM, Dimitris Vassos > > <dva...@gm...> wrote: > > > > > > We have access to several corpora and we are trying to put > > together something appropriate. > > > > In the next couple of days, we will also volunteer a server > > to set it all up and run the tests. > > > > Dimitris > > > > > On 24 Μαΐ 2015, at 02:06, Daniel Povey <dp...@gm... > > > > wrote: > > > > > > One possibility is to use a completely open-source setup, > > e.g. > > > Voxforge, and forget about the "has a clear advantage" > > requirement. > > > E.g. target anything that looks like a year, and make a > > grammar for > > > years. > > > Dan > > > > > > > > > On Fri, May 22, 2015 at 6:32 AM, Nagendra Goel > > > <nag...@go...> wrote: > > >> Since I cannot volunteer my enviornment, do you > > recommend another > > >> enviornment where this can be prototyped and where you > > can check in some > > >> class lm recipe that has advantage. > > >> > > >> Nagendra > > >> > > >> Nagendra Kumar Goel > > >> > > >>> On May 21, 2015 11:01 PM, "Dimitris Vassos" > > <dva...@gm...> wrote: > > >>> > > >>> +1 for the class-based LMs. I have also been interested > > in this > > >>> functionality for some time now, so will be more than > > happy to try out the > > >>> current implementation, if possible. > > >>> > > >>> Thanks > > >>> Dimitris > > >>> > > >>>> On 22 Μαΐ 2015, at 01:34, > > kal...@li... > > >>>> wrote: > > >>>> > > >>>> Send Kaldi-users mailing list submissions to > > >>>> kal...@li... > > >>>> > > >>>> To subscribe or unsubscribe via the World Wide Web, > > visit > > >>>> > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > >>>> or, via email, send a message with subject or body > > 'help' to > > >>>> kal...@li... > > >>>> > > >>>> You can reach the person managing the list at > > >>>> kal...@li... > > >>>> > > >>>> When replying, please edit your Subject line so it is > > more specific > > >>>> than "Re: Contents of Kaldi-users digest..." > > >>>> > > >>>> > > >>>> Today's Topics: > > >>>> > > >>>> 1. Re: LM grafting (Daniel Povey) > > >>>> 2. Re: LM grafting (Kirill Katsnelson) > > >>>> 3. Re: LM grafting (Hainan Xu) > > >>>> 4. Re: LM grafting (Sean True) > > >>>> > > >>>> > > >>>> > > ---------------------------------------------------------------------- > > >>>> > > >>>> Message: 1 > > >>>> Date: Thu, 21 May 2015 15:04:04 -0400 > > >>>> From: Daniel Povey <dp...@gm...> > > >>>> Subject: Re: [Kaldi-users] LM grafting > > >>>> To: Sean True <se...@se...> > > >>>> Cc: Hainan Xu <hai...@gm...>, > > >>>> "kal...@li..." > > >>>> <kal...@li...>, Kirill > > Katsnelson > > >>>> <kir...@sm...> > > >>>> Message-ID: > > >>>> > > <CAEWAuySHaXwdNJZAoL6CanzHth=k4Y...@ma... > > <mailto:k4YJVsBiAfEuFDFMvY%2B...@ma...> > > > >>>> Content-Type: text/plain; charset=UTF-8 > > >>>> > > >>>> The general approach is to create an FST for the > > little language > > >>>> model, and then to use fstreplace to replace instances > > of a particular > > >>>> symbol in the top-level language model, with that FST. > > >>>> The tricky part is ensuring that the result is > > determinizable after > > >>>> composing with the lexicon. In general our solution > > is to add special > > >>>> disambiguation symbols at the beginning and end of > > each of the > > >>>> sub-FSTs, and of course making sure that the sub-FSTs > > are themselves > > >>>> determinizable. > > >>>> Dan > > >>>> > > >>>> > > >>>>> On Thu, May 21, 2015 at 3:01 PM, Sean True > > <se...@se...> > > >>>>> wrote: > > >>>>> That's a subject of some general interest. Is there a > > discussion of the > > >>>>> general approach that was taken somewhere? > > >>>>> > > >>>>> -- Sean > > >>>>> > > >>>>> Sean True > > >>>>> Semantic Machines > > >>>>> > > >>>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey > > <dp...@gm...> > > >>>>>> wrote: > > >>>>>> > > >>>>>> Nagendra Goel has worked on some example scripts for > > this type of > > >>>>>> thing, and with Hainan we were working on trying to > > get it cleaned up > > >>>>>> and checked in, but he's going for an internship so > > it will have to > > >>>>>> wait. But Nagendra might be willing to share it > > with you. > > >>>>>> Dan > > >>>>>> > > >>>>>> > > >>>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill Katsnelson > > >>>>>> <kir...@sm...> wrote: > > >>>>>>> Suppose I have a language model where one token (a > > "word") is a > > >>>>>>> pointer > > >>>>>>> to a whole another LM. This is a practical case > > when you expect an > > >>>>>>> abrupt > > >>>>>>> change in model, a clear example being "my phone > > number is..." and > > >>>>>>> then > > >>>>>>> you'd expect them rattling a string of digits. > > Is there any support > > >>>>>>> in kaldi > > >>>>>>> for this? > > >>>>>>> > > >>>>>>> Thanks, > > >>>>>>> > > >>>>>>> -kkm > > >>>>>>> > > >>>>>>> > > >>>>>>> > > ----------------------------------------------------------------------- > > - > > ------ > > >>>>>>> One dashboard for servers and applications across > > >>>>>>> Physical-Virtual-Cloud > > >>>>>>> Widest out-of-the-box monitoring support with > > 50+ applications > > >>>>>>> Performance metrics, stats and reports that give > > you Actionable > > >>>>>>> Insights > > >>>>>>> Deep dive visibility with transaction tracing using > > APM Insight. > > >>>>>>> > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > >>>>>>> _______________________________________________ > > >>>>>>> Kaldi-users mailing list > > >>>>>>> Kal...@li... > > >>>>>>> > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > ----------------------------------------------------------------------- > > - > > ------ > > >>>>>> One dashboard for servers and applications across > > >>>>>> Physical-Virtual-Cloud > > >>>>>> Widest out-of-the-box monitoring support with 50+ > > applications > > >>>>>> Performance metrics, stats and reports that give you > > Actionable > > >>>>>> Insights > > >>>>>> Deep dive visibility with transaction tracing using > > APM Insight. > > >>>>>> > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > >>>>>> _______________________________________________ > > >>>>>> Kaldi-users mailing list > > >>>>>> Kal...@li... > > >>>>>> > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > >>>> > > >>>> > > >>>> > > >>>> ------------------------------ > > >>>> > > >>>> Message: 2 > > >>>> Date: Thu, 21 May 2015 19:24:38 +0000 > > >>>> From: Kirill Katsnelson > > <kir...@sm...> > > >>>> Subject: Re: [Kaldi-users] LM grafting > > >>>> To: "dp...@gm..." <dp...@gm...>, Sean True > > >>>> <se...@se...> > > >>>> Cc: Hainan Xu <hai...@gm...>, > > >>>> "kal...@li..." > > >>>> <kal...@li...> > > >>>> Message-ID: > > >>>> > > >>>> > > <CY1...@CY...d.out > > l > > ook.com> > > >>>> > > >>>> Content-Type: text/plain; charset="utf-8" > > >>>> > > >>>> Also, from the practical standpoint, > > backoff/discounting weights usually > > >>>> need to be massaged. Otherwise when the grafted LM is > > small and the main LM > > >>>> is large, the little model will tend to shoehorn an > > utterance into itself > > >>>> rather than let go of it. In my phone number example, > > everything becomes > > >>>> digits once the phone number starts. > > >>>> > > >>>> -kkm > > >>>> > > >>>>> -----Original Message----- > > >>>>> From: Daniel Povey [mailto:dp...@gm...] > > >>>>> Sent: 2015-05-21 1204 > > >>>>> To: Sean True > > >>>>> Cc: Kirill Katsnelson; Nagendra Goel; Hainan Xu; > > kaldi- > > >>>>> us...@li... > > >>>>> Subject: Re: [Kaldi-users] LM grafting > > >>>>> > > >>>>> The general approach is to create an FST for the > > little language model, > > >>>>> and then to use fstreplace to replace instances of a > > particular symbol > > >>>>> in the top-level language model, with that FST. > > >>>>> The tricky part is ensuring that the result is > > determinizable after > > >>>>> composing with the lexicon. In general our solution > > is to add special > > >>>>> disambiguation symbols at the beginning and end of > > each of the sub- > > >>>>> FSTs, and of course making sure that the sub-FSTs are > > themselves > > >>>>> determinizable. > > >>>>> Dan > > >>>>> > > >>>>> > > >>>>> On Thu, May 21, 2015 at 3:01 PM, Sean True > > <se...@se...> > > >>>>> wrote: > > >>>>>> That's a subject of some general interest. Is there > > a discussion of > > >>>>>> the general approach that was taken somewhere? > > >>>>>> > > >>>>>> -- Sean > > >>>>>> > > >>>>>> Sean True > > >>>>>> Semantic Machines > > >>>>>> > > >>>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey > > <dp...@gm...> > > >>>>> wrote: > > >>>>>>> > > >>>>>>> Nagendra Goel has worked on some example scripts > > for this type of > > >>>>>>> thing, and with Hainan we were working on trying to > > get it cleaned > > >>>>> up > > >>>>>>> and checked in, but he's going for an internship so > > it will have to > > >>>>>>> wait. But Nagendra might be willing to share it > > with you. > > >>>>>>> Dan > > >>>>>>> > > >>>>>>> > > >>>>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill Katsnelson > > >>>>>>> <kir...@sm...> wrote: > > >>>>>>>> Suppose I have a language model where one token (a > > "word") is a > > >>>>>>>> pointer to a whole another LM. This is a practical > > case when you > > >>>>>>>> expect an abrupt change in model, a clear example > > being "my phone > > >>>>>>>> number is..." and then you'd expect them rattling > > a string of > > >>>>>>>> digits. Is there any support in kaldi for this? > > >>>>>>>> > > >>>>>>>> Thanks, > > >>>>>>>> > > >>>>>>>> -kkm > > >>>>>>>> > > >>>>>>>> > > ------------------------------------------------------------------ > > >>>>> - > > >>>>>>>> ----------- One dashboard for servers and > > applications across > > >>>>>>>> Physical-Virtual-Cloud Widest out-of-the-box > > monitoring support > > >>>>>>>> with 50+ applications Performance metrics, stats > > and reports that > > >>>>>>>> give you Actionable Insights Deep dive visibility > > with transaction > > >>>>>>>> tracing using APM Insight. > > >>>>>>>> > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > >>>>>>>> _______________________________________________ > > >>>>>>>> Kaldi-users mailing list > > >>>>>>>> Kal...@li... > > >>>>>>>> > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > >>>>>>> > > >>>>>>> > > >>>>>>> > > -------------------------------------------------------------------- > > >>>>> - > > >>>>>>> --------- One dashboard for servers and > > applications across > > >>>>>>> Physical-Virtual-Cloud Widest out-of-the-box > > monitoring support with > > >>>>>>> 50+ applications Performance metrics, stats and > > reports that give > > >>>>> you > > >>>>>>> Actionable Insights Deep dive visibility with > > transaction tracing > > >>>>>>> using APM Insight. > > >>>>>>> > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > >>>>>>> _______________________________________________ > > >>>>>>> Kaldi-users mailing list > > >>>>>>> Kal...@li... > > >>>>>>> > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > >>>> > > >>>> ------------------------------ > > >>>> > > >>>> Message: 3 > > >>>> Date: Thu, 21 May 2015 15:29:54 -0400 > > >>>> From: Hainan Xu <hai...@gm...> > > >>>> Subject: Re: [Kaldi-users] LM grafting > > >>>> To: Daniel Povey <dp...@gm...> > > >>>> Cc: Sean True <se...@se...>, > > >>>> "kal...@li..." > > >>>> <kal...@li...>, Kirill > > Katsnelson > > >>>> <kir...@sm...> > > >>>> Message-ID: > > >>>> > > <CALP+BDZvJP-2cZ+fEJEXaMaVWzgy63mtc=J1E...@ma...> > > >>>> Content-Type: text/plain; charset="utf-8" > > >>>> > > >>>> There is a paper in ICASSP 2015 that described some > > very similar idea: > > >>>> > > >>>> Improved recognition of contact names in voice > > commands > > >>>> > > >>>>> On Thu, May 21, 2015 at 3:04 PM, Daniel Povey > > <dp...@gm...> wrote: > > >>>>> > > >>>>> The general approach is to create an FST for the > > little language > > >>>>> model, and then to use fstreplace to replace > > instances of a particular > > >>>>> symbol in the top-level language model, with that > > FST. > > >>>>> The tricky part is ensuring that the result is > > determinizable after > > >>>>> composing with the lexicon. In general our solution > > is to add special > > >>>>> disambiguation symbols at the beginning and end of > > each of the > > >>>>> sub-FSTs, and of course making sure that the sub-FSTs > > are themselves > > >>>>> determinizable. > > >>>>> Dan > > >>>>> > > >>>>> > > >>>>> On Thu, May 21, 2015 at 3:01 PM, Sean True > > <se...@se...> > > >>>>> wrote: > > >>>>>> That's a subject of some general interest. Is there > > a discussion of > > >>>>>> the > > >>>>>> general approach that was taken somewhere? > > >>>>>> > > >>>>>> -- Sean > > >>>>>> > > >>>>>> Sean True > > >>>>>> Semantic Machines > > >>>>>> > > >>>>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey > > <dp...@gm...> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>> Nagendra Goel has worked on some example scripts > > for this type of > > >>>>>>> thing, and with Hainan we were working on trying to > > get it cleaned up > > >>>>>>> and checked in, but he's going for an internship so > > it will have to > > >>>>>>> wait. But Nagendra might be willing to share it > > with you. > > >>>>>>> Dan > > >>>>>>> > > >>>>>>> > > >>>>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill Katsnelson > > >>>>>>> <kir...@sm...> wrote: > > >>>>>>>> Suppose I have a language model where one token (a > > "word") is a > > >>>>> pointer > > >>>>>>>> to a whole another LM. This is a practical case > > when you expect an > > >>>>> abrupt > > >>>>>>>> change in model, a clear example being "my phone > > number is..." and > > >>>>> then > > >>>>>>>> you'd expect them rattling a string of digits. > > Is there any support > > >>>>> in kaldi > > >>>>>>>> for this? > > >>>>>>>> > > >>>>>>>> Thanks, > > >>>>>>>> > > >>>>>>>> -kkm > > >>>>> > > >>>>> > > ----------------------------------------------------------------------- > > - > > ------ > > >>>>>>>> One dashboard for servers and applications across > > >>>>> Physical-Virtual-Cloud > > >>>>>>>> Widest out-of-the-box monitoring support with > > 50+ applications > > >>>>>>>> Performance metrics, stats and reports that give > > you Actionable > > >>>>> Insights > > >>>>>>>> Deep dive visibility with transaction tracing > > using APM Insight. > > >>>>>>>> > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > >>>>>>>> _______________________________________________ > > >>>>>>>> Kaldi-users mailing list > > >>>>>>>> Kal...@li... > > >>>>>>>> > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > >>>>> > > >>>>> > > ----------------------------------------------------------------------- > > - > > ------ > > >>>>>>> One dashboard for servers and applications across > > >>>>>>> Physical-Virtual-Cloud > > >>>>>>> Widest out-of-the-box monitoring support with > > 50+ applications > > >>>>>>> Performance metrics, stats and reports that give > > you Actionable > > >>>>>>> Insights > > >>>>>>> Deep dive visibility with transaction tracing using > > APM Insight. > > >>>>>>> > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > >>>>>>> _______________________________________________ > > >>>>>>> Kaldi-users mailing list > > >>>>>>> Kal...@li... > > >>>>>>> > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > >>>> > > >>>> > > >>>> > > >>>> -- > > >>>> - Hainan > > >>>> -------------- next part -------------- > > >>>> An HTML attachment was scrubbed... > > >>>> > > >>>> ------------------------------ > > >>>> > > >>>> Message: 4 > > >>>> Date: Thu, 21 May 2015 15:01:51 -0400 > > >>>> From: Sean True <se...@se...> > > >>>> Subject: Re: [Kaldi-users] LM grafting > > >>>> To: Daniel Povey <dp...@gm...> > > >>>> Cc: Hainan Xu <hai...@gm...>, > > >>>> "kal...@li..." > > >>>> <kal...@li...>, Kirill > > Katsnelson > > >>>> <kir...@sm...> > > >>>> Message-ID: > > >>>> > > <CALtEaHntdAcmO_Ji5dxsPnT8i9M_LVuGnY0UjkJUPp=pY...@ma...> > > >>>> Content-Type: text/plain; charset="utf-8" > > >>>> > > >>>> That's a subject of some general interest. Is there a > > discussion of the > > >>>> general approach that was taken somewhere? > > >>>> > > >>>> -- Sean > > >>>> > > >>>> Sean True > > >>>> Semantic Machines > > >>>> > > >>>>> On Thu, May 21, 2015 at 2:14 PM, Daniel Povey > > <dp...@gm...> wrote: > > >>>>> > > >>>>> Nagendra Goel has worked on some example scripts for > > this type of > > >>>>> thing, and with Hainan we were working on trying to > > get it cleaned up > > >>>>> and checked in, but he's going for an internship so > > it will have to > > >>>>> wait. But Nagendra might be willing to share it with > > you. > > >>>>> Dan > > >>>>> > > >>>>> > > >>>>> On Thu, May 21, 2015 at 2:10 PM, Kirill Katsnelson > > >>>>> <kir...@sm...> wrote: > > >>>>>> Suppose I have a language model where one token (a > > "word") is a > > >>>>>> pointer > > >>>>> to a whole another LM. This is a practical case when > > you expect an > > >>>>> abrupt > > >>>>> change in model, a clear example being "my phone > > number is..." and then > > >>>>> you'd expect them rattling a string of digits. Is > > there any support in > > >>>>> kaldi for this? > > >>>>>> > > >>>>>> Thanks, > > >>>>>> > > >>>>>> -kkm > > >>>>> > > >>>>> > > ----------------------------------------------------------------------- > > - > > ------ > > >>>>>> One dashboard for servers and applications across > > >>>>>> Physical-Virtual-Cloud > > >>>>>> Widest out-of-the-box monitoring support with 50+ > > applications > > >>>>>> Performance metrics, stats and reports that give you > > Actionable > > >>>>>> Insights > > >>>>>> Deep dive visibility with transaction tracing using > > APM Insight. > > >>>>>> > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > >>>>>> _______________________________________________ > > >>>>>> Kaldi-users mailing list > > >>>>>> Kal...@li... > > >>>>>> > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > ----------------------------------------------------------------------- > > - > > ------ > > >>>>> One dashboard for servers and applications across > > >>>>> Physical-Virtual-Cloud > > >>>>> Widest out-of-the-box monitoring support with 50+ > > applications > > >>>>> Performance metrics, stats and reports that give you > > Actionable > > >>>>> Insights > > >>>>> Deep dive visibility with transaction tracing using > > APM Insight. > > >>>>> > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > >>>>> _______________________________________________ > > >>>>> Kaldi-users mailing list > > >>>>> Kal...@li... > > >>>>> > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > >>>> -------------- next part -------------- > > >>>> An HTML attachment was scrubbed... > > >>>> > > >>>> ------------------------------ > > >>>> > > >>>> > > >>>> > > ----------------------------------------------------------------------- > > - > > ------ > > >>>> One dashboard for servers and applications across > > Physical-Virtual-Cloud > > >>>> Widest out-of-the-box monitoring support with 50+ > > applications > > >>>> Performance metrics, stats and reports that give you > > Actionable Insights > > >>>> Deep dive visibility with transaction tracing using > > APM Insight. > > >>>> > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > >>>> > > >>>> ------------------------------ > > >>>> > > >>>> _______________________________________________ > > >>>> Kaldi-users mailing list > > >>>> Kal...@li... > > >>>> > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > >>>> > > >>>> > > >>>> End of Kaldi-users Digest, Vol 29, Issue 15 > > >>>> ******************************************* > > >>> > > >>> > > >>> > > ----------------------------------------------------------------------- > > - > > ------ > > >>> One dashboard for servers and applications across > > Physical-Virtual-Cloud > > >>> Widest out-of-the-box monitoring support with 50+ > > applications > > >>> Performance metrics, stats and reports that give you > > Actionable Insights > > >>> Deep dive visibility with transaction tracing using APM > > Insight. > > >>> > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > >>> _______________________________________________ > > >>> Kaldi-users mailing list > > >>> Kal...@li... > > >>> > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > >> > > >> > > >> > > ----------------------------------------------------------------------- > > - > > ------ > > >> One dashboard for servers and applications across > > Physical-Virtual-Cloud > > >> Widest out-of-the-box monitoring support with 50+ > > applications > > >> Performance metrics, stats and reports that give you > > Actionable Insights > > >> Deep dive visibility with transaction tracing using APM > > Insight. > > >> > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > >> _______________________________________________ > > >> Kaldi-users mailing list > > >> Kal...@li... > > >> > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > >> > > > > > > ----------------------------------------------------------------------- > > - > > ------ > > One dashboard for servers and applications across Physical- > > Virtual-Cloud > > Widest out-of-the-box monitoring support with 50+ > > applications > > Performance metrics, stats and reports that give you > > Actionable Insights > > Deep dive visibility with transaction tracing using APM > > Insight. > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > _______________________________________________ > > Kaldi-users mailing list > > Kal...@li... > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > > > > > > > > > > > ----------------------------------------------------------------------- > > - > > ------ > > One dashboard for servers and applications across Physical- > > Virtual-Cloud > > Widest out-of-the-box monitoring support with 50+ applications > > Performance metrics, stats and reports that give you Actionable > > Insights > > Deep dive visibility with transaction tracing using APM Insight. > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > _______________________________________________ > > Kaldi-users mailing list > > Kal...@li... > > https://lists.sourceforge.net/lists/listinfo/kaldi-users > > > > > > > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight. > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users > -- Ondřej Plátek, +420 737 758 650, skype:ondrejplatek, ond...@gm... |
From: David Warde-F. <d.w...@gm...> - 2015-06-17 07:27:49
|
Kirill, Many thanks for the pointers. On your setup, how long does the entire recipe take without decoding? For the life of me I can't figure out where num_jobs_nnet is being set (it's being written in the egs_dir as 4, I've changed it everywhere I could find it.) On Fri, Jun 12, 2015 at 7:00 PM, Kirill Katsnelson <kir...@sm...> wrote: >> From: David Warde-Farley [mailto:d.w...@gm...] >> Subject: [Kaldi-users] non-cluster usage of Librispeech s5 recipe? >> >> I'm trying to >> use the s5 recipe for LibriSpeech on a single machine with a single >> GPU. I've modified cmd.sh to use run.pl. > > I ran it on a single machine, it requires a few modifications. Note that it took almost a week on a 6-core 4.1GHz overclocked i7-5930K CPU and GeForce 980 to train on the 500 hour set. > >> After about a day, I see a lot of background processes like gmm-latgen- >> faster, lattice-add-penalty, lattice-scale, etc. that have been >> launched in the background (the terminal is actually free, which >> suggests the run.sh script has terminated...). I'm not totally sure >> what's going on, or how to find out. > > In librispeech/s5/run.sh, look for decode commands in subshells, like > > ( > utils/mkgraph.sh data/lang_nosp_test_tgsmall \ > exp/tri4b exp/tri4b/graph_nosp_tgsmall || exit 1; > for test in test_clean test_other dev_clean dev_other; do > steps/decode_fmllr.sh --nj 20 --cmd "$decode_cmd" \ > . . . > )& > > These decodes are quite slow, if you run them on your machine. They are slower than other part of the script. In the end, they are accumulating, eating CPU and blowing up out of memory. They are not essential for NN training, except possibly for the mkgraph script. The results are useful to check if you are getting expected WER, but really not essential. You may either disable these decode blocks completely (except mkgraph invocations) or remove the '&' at the end to run them synchronously. NB they will take the most preparation time prior to NN training step. Dunno about your machine but give it an extra couple days to complete with these. > >> One thing I noticed earlier is that the script was trying to spawn >> multiple GPU jobs, but this GPU is configured (by administrators) to >> permit at most one CUDA process, and so I saw "3 of 4 jobs failed" >> messages. Would these jobs have been retried? > > They will not, but you can restart NN training from the last step. Modify local/online/run_nnet2_ms.sh so that steps/nnet2/train_multisplice_accel2.sh is invoked with switches "--num-jobs-initial 1 --num-jobs-final 1" (the defaults are larger). When running local/online/run_nnet2_ms.sh, pass it "--stage 7" (this is the default) and "--train_stage N" the number of iteration you are restarting from. > > Even if not the 1 job limit, you probably won't benefit from running more than 1 at a time. > > -kkm On Fri, Jun 12, 2015 at 4:00 PM, Kirill Katsnelson <kir...@sm...> wrote: >> From: David Warde-Farley [mailto:d.w...@gm...] >> Subject: [Kaldi-users] non-cluster usage of Librispeech s5 recipe? >> >> I'm trying to >> use the s5 recipe for LibriSpeech on a single machine with a single >> GPU. I've modified cmd.sh to use run.pl. > > I ran it on a single machine, it requires a few modifications. Note that it took almost a week on a 6-core 4.1GHz overclocked i7-5930K CPU and GeForce 980 to train on the 500 hour set. > >> After about a day, I see a lot of background processes like gmm-latgen- >> faster, lattice-add-penalty, lattice-scale, etc. that have been >> launched in the background (the terminal is actually free, which >> suggests the run.sh script has terminated...). I'm not totally sure >> what's going on, or how to find out. > > In librispeech/s5/run.sh, look for decode commands in subshells, like > > ( > utils/mkgraph.sh data/lang_nosp_test_tgsmall \ > exp/tri4b exp/tri4b/graph_nosp_tgsmall || exit 1; > for test in test_clean test_other dev_clean dev_other; do > steps/decode_fmllr.sh --nj 20 --cmd "$decode_cmd" \ > . . . > )& > > These decodes are quite slow, if you run them on your machine. They are slower than other part of the script. In the end, they are accumulating, eating CPU and blowing up out of memory. They are not essential for NN training, except possibly for the mkgraph script. The results are useful to check if you are getting expected WER, but really not essential. You may either disable these decode blocks completely (except mkgraph invocations) or remove the '&' at the end to run them synchronously. NB they will take the most preparation time prior to NN training step. Dunno about your machine but give it an extra couple days to complete with these. > >> One thing I noticed earlier is that the script was trying to spawn >> multiple GPU jobs, but this GPU is configured (by administrators) to >> permit at most one CUDA process, and so I saw "3 of 4 jobs failed" >> messages. Would these jobs have been retried? > > They will not, but you can restart NN training from the last step. Modify local/online/run_nnet2_ms.sh so that steps/nnet2/train_multisplice_accel2.sh is invoked with switches "--num-jobs-initial 1 --num-jobs-final 1" (the defaults are larger). When running local/online/run_nnet2_ms.sh, pass it "--stage 7" (this is the default) and "--train_stage N" the number of iteration you are restarting from. > > Even if not the 1 job limit, you probably won't benefit from running more than 1 at a time. > > -kkm |
From: Daniel P. <dp...@gm...> - 2015-06-16 22:59:09
|
Guoguo is going to fix arpa2fst tonight so that it will detect that. Later when we rewrite it we'll include that feature. Dan On Tue, Jun 16, 2015 at 6:58 PM, Kirill Katsnelson <kir...@sm...> wrote: > Holy guacamole! That was it. Thank you very very much. > > Perhaps arpa2fst v2.0 would detect such bloopers. > >> -----Original Message----- >> From: Daniel Povey [mailto:dp...@gm...] >> Sent: 2015-06-16 1526 >> To: Kirill Katsnelson >> Cc: kal...@li... >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> It turns out the problem was probably caused by the end of-sentence >> symbol </s> appearing in inappropriate places in the LM, at the start >> of n-grams rather than the end. Probably the training data was >> contaminated somehow by </s>. >> Dan >> >> >> On Tue, Jun 16, 2015 at 2:07 PM, Daniel Povey <dp...@gm...> wrote: >> >> I am currently trying to get a minimal reproduction with a script. >> Let it run for a while. I'll send you what remains of it, and hope it >> might give me an idea too. >> >> >> >> Looks like that fstdeterminize may have completed on this grammar >> >> (how do you call the thing symbolized as $G$? "grammar" sounded >> >> confusing, as I understand, but I have no other word not exceeding 2 >> >> syllables :)) >> > >> > I would call it an LM. >> > >> >>> I have left one running by mistake before going to sleep, and it >> was done. I am running one again with the time command to make sure >> this is not a fluke. So it is possible that it is not exactly non- >> determinizable, but instead takes enormous time (hours on one LM, < 1 >> sec on another). Which is the same thing from the engineering >> standpoint, close enough, as those engineering vs mathematics jokes go. >> But jokes aside, I want something more bounded for a production system, >> so I need to understand what throws it off so badly. >> > >> > I would still call it a problem. Check if your ARPA contains <eps> >> or >> > #0. I may need to add checks for this into arpa2fst (which we will >> > rewrite at some point anyway). Another problem could be weird things >> > like stray \r's which make one word seem like two in some >> > circumstances. >> > If I saw the output of arpa2fst I could probably figure out fairly >> > quickly what the problem was. The way I would debug this is to trace >> > through your LM FST from the start and follow those symbols (or >> > epsilons) on that trace from the determinization failure, and see how >> > there are two different paths. >> > It's better if you share a couple different traces, not just one, so >> > we can see what's in common. >> > >> >> Is fstdeterminizestar more than fstrmepsilon ∘ fstdeterminize (the >> latter with the kaldi patch)? >> > >> > No, it should be faster. fstrmepsilon ∘ fstdeterminize should fail >> too. >> > >> >> Ah, and this is a Linux machine. So everything looks very very >> standard (oops. Did I just create an infinite loop by repeating a >> word?). >> > >> > I am considering changing the way the LM disambig symbols are used to >> > make this kind of problem less likely to happen in future, by having >> > several disambig symbols for the LM, one per order, instead of just >> > one. >> > >> > Dan >> > >> > >> > >> >>> -----Original Message----- >> >>> From: Daniel Povey [mailto:dp...@gm...] >> >>> Sent: 2015-06-15 2340 >> >>> To: Kirill Katsnelson >> >>> Cc: kal...@li... >> >>> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >>> >> >>> In general SRILM language models are OK, but something weird could >> >>> have happened, especially on an unusual platform like Windows. >> >>> Look for duplicate lines with apparently the same n-gram on, and >> >>> also send to me (but not to kaldi-user) the arpa LM. >> >>> Dan >> >>> >> >>> >> >>> On Tue, Jun 16, 2015 at 2:03 AM, Kirill Katsnelson >> >>> <kir...@sm...> wrote: >> >>> > Nope. The only thing I am thinking of doing is to bisect it >> >>> > somehow, >> >>> to get a minimal grammar that still refuses to determinize. I tried >> >>> different smoothing and played with other switches to ngram_count, >> >>> but it still does loop. Are there any known problems with >> >>> srilm-generated models? >> >>> > >> >>> > -kkm >> >>> > >> >>> >> -----Original Message----- >> >>> >> From: Daniel Povey [mailto:dp...@gm...] >> >>> >> Sent: 2015-06-15 2248 >> >>> >> To: Kirill Katsnelson >> >>> >> Cc: kal...@li... >> >>> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >>> >> completes >> >>> >> >> >>> >> OOVs should be OK. >> >>> >> Make sure there are no n-grams with things like <s> <s> >> >>> >> >> >>> >> e.g. see the lines >> >>> >> grep -v '<s> <s>' | \ >> >>> >> grep -v '</s> <s>' | \ >> >>> >> grep -v '</s> </s>' | \ >> >>> >> >> >>> >> in the WSJ script: >> >>> >> >> >>> >> gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \ >> >>> >> grep -v '<s> <s>' | \ >> >>> >> grep -v '</s> <s>' | \ >> >>> >> grep -v '</s> </s>' | \ >> >>> >> arpa2fst - | fstprint | \ >> >>> >> utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \ >> >>> >> utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- >> >>> >> isymbols=$test/words.txt \ >> >>> >> --osymbols=$test/words.txt --keep_isymbols=false -- >> >>> >> keep_osymbols=false | \ >> >>> >> fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst >> >>> >> >> >>> >> Dan >> >>> >> >> >>> >> >> >>> >> On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson >> >>> >> <kir...@sm...> wrote: >> >>> >> > Bingo. G.fst is not determinizable (the "good" G.fst takes >> >>> >> > under a second to determinize). And the bad one loops at the >> word "zero" >> >>> >> > like this >> >>> >> > >> >>> >> > #0 >> >>> >> > unsure unsure >> >>> >> > #0 >> >>> >> > of of >> >>> >> > #0 >> >>> >> > yours yours >> >>> >> > #0 >> >>> >> > is is >> >>> >> > #0 >> >>> >> > your your >> >>> >> > #0 >> >>> >> > zip zip >> >>> >> > #0 >> >>> >> > wrong wrong >> >>> >> > #0 >> >>> >> > with with >> >>> >> > #0 >> >>> >> > zero zero >> >>> >> > #0 >> >>> >> > zero zero >> >>> >> > .... >> >>> >> > >> >>> >> > I am taking the LM straight from ngram_counts to the standard >> >>> >> pipeline, nothing fancy. The only thing is it has a lot of OOVs: >> >>> >> > >> >>> >> > remove_oovs.pl: removed 4646 lines. >> >>> >> > >> >>> >> > Is this generally a problem? So does my "good" arpa LM. I >> >>> >> > grepped >> >>> >> both for the word zero, but could not spot anything outrageous. >> >>> >> Can you think of anything I can look for? >> >>> >> > >> >>> >> > My source is no longer than 10 days old. Here's the pipeline, >> >>> >> > just in >> >>> >> case. >> >>> >> > >> >>> >> > cat $src/$arpalm | tr -d '\r' | \ >> >>> >> > utils/find_arpa_oovs.pl $lang/words.txt > $lang/lm_oovs.txt >> >>> >> > >> >>> >> > cat $src/$arpalm | tr -d '\r' | \ >> >>> >> > arpa2fst - | fstprint | \ >> >>> >> > utils/remove_oovs.pl $lang/lm_oovs.txt | \ >> >>> >> > utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- >> >>> >> isymbols=$lang/words.txt \ >> >>> >> > --osymbols=$lang/words.txt --keep_isymbols=false -- >> >>> >> keep_osymbols=false | \ >> >>> >> > fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst >> >>> >> > >> >>> >> > -kkm >> >>> >> > >> >>> >> > >> >>> >> >> -----Original Message----- >> >>> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >>> >> >> Sent: 2015-06-15 2206 >> >>> >> >> To: Kirill Katsnelson >> >>> >> >> Cc: kal...@li... >> >>> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >>> >> >> completes >> >>> >> >> >> >>> >> >> I don't recommend to look at the fstdeterminizestar algorithm >> >>> >> itself- >> >>> >> >> it's very complicated. Instead focus on the definition of >> >>> >> >> "determinizable" and the twins property, and figure out what >> >>> >> >> path >> >>> >> you >> >>> >> >> are taking through L.fst and G.fst. Trying to >> >>> >> >> fstdeterminizestar G.fst directly, and seeing whether it >> >>> >> >> terminates or not, may tell >> >>> >> you >> >>> >> >> something; if it fails, send the signal and see what happens. >> >>> >> >> fstdeterminizestar does care about the weights, but only to >> >>> >> >> the extent that they are the same or different from each >> >>> >> >> other; and >> >>> if >> >>> >> >> your G.fst is generated from arpa2fst the pipeline should >> work >> >>> for >> >>> >> >> any ARPA-format language model- make sure you are using an >> >>> >> >> up-to- >> >>> >> date >> >>> >> >> Kaldi though, there have been fixes as recently as a few >> >>> >> >> months >> >>> ago. >> >>> >> >> The presence of SIL is not surprising, it is the >> >>> >> >> optional-silence added by the lexicon. I think that script >> is >> >>> >> >> adding #16 if it does >> >>> >> >> *not* take the optional silence, otherwise it adds the phone >> SIL. >> >>> >> >> Since you are calling your FST a "grammar" I'm wondering >> >>> >> >> whether you have done something fancy with mapping words to >> >>> >> >> FSTs or something like that, which is causing the result to >> >>> >> >> not be >> >>> determinizable. >> >>> >> >> >> >>> >> >> Dan >> >>> >> >> >> >>> >> >> >> >>> >> >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson >> >>> >> >> <kir...@sm...> wrote: >> >>> >> >> > Thank you very much for your help Dan, but I am still >> stuck. >> >>> >> >> > >> >>> >> >> > First of all, a question: does the fstdeterminizestar >> >>> >> >> > algorithm >> >>> >> >> depend on actual backoff and n-gram probabilities, i.e. will >> >>> >> >> it behave differently if the numbers in arpa model file are >> >>> different? >> >>> >> >> Or does it depend only on arc labels but not weights? I am >> >>> looking >> >>> >> at >> >>> >> >> the code but certainly I am far from being able to understand >> it. >> >>> >> >> I cheated by looking at all if conditions in it, and this one >> >>> >> >> in EpsilonClosure is seemingly the only one dealing with >> weights: >> >>> >> >> > >> >>> >> >> > if (! ApproxEqual(weight, iter->second.weight, >> >>> >> delta_)) >> >>> >> >> > { >> >>> >> >> // add extra part of weight to queue. >> >>> >> >> > >> >>> >> >> > (In ProcessFinal it also has "if (this_final_weight != >> >>> >> >> > Weight::Zero())" but I do not believe it is relevant?) >> >>> >> >> > >> >>> >> >> > I am trying to understand how to dig into the problem--are >> >>> >> >> > weights in >> >>> >> >> the picture actually. >> >>> >> >> > >> >>> >> >> > Also, just for a test, I ran the grammar trough a "grep -v >> >>> 'real >> >>> >> >> real'", and indeed got a similar loop on the word "very" >> which >> >>> >> >> is also often repeated. But the "real real" 2- and 3-grams >> are >> >>> >> >> there in the "good" grammar too. >> >>> >> >> > >> >>> >> >> > Another thing I do not understand is the presence of the >> SIL >> >>> >> ilabel >> >>> >> >> in the backtrace. Here's the beginning of the trace that >> leads >> >>> >> >> to >> >>> >> the >> >>> >> >> infinite loop as decoded with a little script I wrote (format >> >>> >> >> is ilabel [ TAB olabel ]: >> >>> >> >> > >> >>> >> >> > #16 >> >>> >> >> > #0 >> >>> >> >> > V_B >> >>> >> >> > Y_I >> >>> >> >> > UW1_I >> >>> >> >> > Z_E views >> >>> >> >> > #2 >> >>> >> >> > SIL >> >>> >> >> > #0 >> >>> >> >> > AH0_B >> >>> >> >> > N_I >> >>> >> >> > SH_I unsure >> >>> >> >> > UH1_I >> >>> >> >> > R_E >> >>> >> >> > >> >>> >> >> > Note the presence of SIL at line 8. This is not in lexicon: >> >>> >> >> > >> >>> >> >> > $ grep SIL >> >>> >> >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt >> >>> >> >> > !SIL 1 0.20 1.00 1.00 SIL_S >> >>> >> >> > $ >> >>> >> >> > >> >>> >> >> > Is this a hint? How did it get there at all? I am using a >> >>> >> >> > standard >> >>> >> >> script to build the L_disambig.fst: >> >>> >> >> > >> >>> >> >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' >> >>> >> >> > $lang/phones.txt) word_disambig_symbol=$(awk >> '$1=="#0"{print >> >>> >> >> > $2}' $lang/words.txt) utils/make_lexicon_fst_silprob.pl >> >>> >> >> $lang/dict/lexiconp_silprob_disambig.txt \ >> >>> >> >> > data/local/dict/silprob.txt $silphone >> >>> >> >> > '#'$ndisambig >> >>> >> | \ >> >>> >> >> > fstcompile --isymbols=$lang/phones.txt -- >> >>> >> >> osymbols=$lang/words.txt \ >> >>> >> >> > --keep_isymbols=false --keep_osymbols=false | \ >> >>> >> >> > fstaddselfloops "echo $phone_disambig_symbol |" "echo >> >>> >> >> $word_disambig_symbol |" | \ >> >>> >> >> > fstarcsort --sort_type=olabel > $lang/L_disambig.fst >> || >> >>> >> >> > exit 1; >> >>> >> >> > >> >>> >> >> > I checked the lexicon, and there are indeed only real >> phones >> >>> >> >> > at >> >>> >> the >> >>> >> >> beginning of each word, no empty positions and no #N symbols. >> >>> >> >> > >> >>> >> >> > -kkm >> >>> >> >> > >> >>> >> >> >> -----Original Message----- >> >>> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >>> >> >> >> Sent: 2015-06-15 1944 >> >>> >> >> >> To: Kirill Katsnelson >> >>> >> >> >> Cc: kal...@li... >> >>> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >>> >> >> >> completes >> >>> >> >> >> >> >>> >> >> >> I think the confusion is probably between two loops with >> >>> "real" >> >>> >> on >> >>> >> >> >> them in G.fst: one loop where you always take the bigram >> >>> >> >> probability, >> >>> >> >> >> and one where you always take the unigram probability. Or >> >>> >> >> >> maybe >> >>> >> a >> >>> >> >> >> similar confusion between a loop where you use the trigram >> >>> >> >> >> "real >> >>> >> >> real >> >>> >> >> >> real" and the bigram "real real". Those loops are >> expected >> >>> >> >> >> to >> >>> >> >> exist. >> >>> >> >> >> Probably the issue is that something happened at the start >> >>> >> >> >> of the sequence which caused the FST to be confused about >> >>> >> >> >> which >> >>> of >> >>> >> >> >> those >> >>> >> >> two >> >>> >> >> >> states it was in. If you have any empty words (words with >> >>> >> >> >> empty >> >>> >> >> >> pronunciation) in your lexicon this could possibly happen, >> >>> >> >> >> as it would be confused between taking a normal word, >> then >> >>> >> >> >> the backoff >> >>> >> >> symbol, vs. >> >>> >> >> >> taking a normal word, then the empty word, then the >> backoff >> >>> >> symbol. >> >>> >> >> >> I think the current Kaldi graph-creation script check for >> >>> empty >> >>> >> >> words >> >>> >> >> >> in the lexicon, for this reason. >> >>> >> >> >> >> >>> >> >> >> Dan >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) >> >>> >> >> >> > #0 >> >>> ( >> >>> >> >> >> > ) >> >>> >> >> >> generally almost makes sense, given that #16 is the last >> >>> >> >> >> one >> >>> in >> >>> >> >> >> table, the silence disambiguation symbol. (Not sure why >> "real" >> >>> >> >> >> is emitted at L_E--I would rather expect it to be emitted >> >>> >> >> >> at >> >>> >> >> >> #1.) What >> >>> >> >> I >> >>> >> >> >> do not understand is what exactly the debug trace >> >>> >> >> >> represents, and what should I make out if it. It is a path >> >>> >> >> >> through the FST graph, >> >>> >> >> but >> >>> >> >> >> I do not understand what is this path exactly, and what >> >>> >> >> >> does this endless walk of this loop mean. >> >>> >> >> >> > >> >>> >> >> >> > -kkm >> >>> >> >> >> > >> >>> >> >> >> >> -----Original Message----- >> >>> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >>> >> >> >> >> Sent: 2015-06-15 1858 >> >>> >> >> >> >> To: Kirill Katsnelson >> >>> >> >> >> >> Cc: kal...@li... >> >>> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) >> >>> >> >> >> >> never completes >> >>> >> >> >> >> >> >>> >> >> >> >> Look into the "backoff disambiguation symbol", normally >> >>> >> >> >> >> called >> >>> >> >> #0. >> >>> >> >> >> >> The reason why it is needed should be explained in the >> >>> >> hbka.pdf >> >>> >> >> >> paper. >> >>> >> >> >> >> Dan >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson >> >>> >> >> >> >> <kir...@sm...> wrote: >> >>> >> >> >> >> > Thank you! The output consists of some sequences as >> >>> >> >> >> >> > you described, >> >>> >> >> >> >> quickly falling into a short ever repeated loop. >> >>> >> >> >> >> > >> >>> >> >> >> >> > The non-repeated section ends up with osymbols >> >>> >> >> >> >> > (excluding >> >>> >> >> >> epsilons) >> >>> >> >> >> >> "whatsoever on vacation up", and then the repeated part >> >>> >> >> >> >> looks >> >>> >> >> like " >> >>> >> >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". >> The >> >>> >> >> >> >> word >> >>> >> >> "real" >> >>> >> >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. >> >>> >> >> >> >> > >> >>> >> >> >> >> > Both LMs contain a bigram for "vacation up" and a >> >>> >> >> >> >> > trigram >> >>> >> >> >> "vacation >> >>> >> >> >> >> up there". "up real" is a bigram in both, with 3-grams >> >>> >> >> >> >> "up >> >>> >> real >> >>> >> >> >> quick" >> >>> >> >> >> >> and "up real quickly". "up real" is also a tail of a >> few >> >>> >> >> >> >> other 3-grams, but these are also same in both models >> >>> >> >> >> >> (up >> >>> to >> >>> >> >> >> >> their >> >>> >> >> >> weights). >> >>> >> >> >> >> > >> >>> >> >> >> >> > It looks I do not understand what should I make in >> the >> >>> end >> >>> >> >> >> >> > out of >> >>> >> >> >> >> this >> >>> >> >> >> >> > debug data :( >> >>> >> >> >> >> > >> >>> >> >> >> >> > -kkm >> >>> >> >> >> >> > >> >>> >> >> >> >> >> -----Original Message----- >> >>> >> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >>> >> >> >> >> >> Sent: 2015-06-15 1821 >> >>> >> >> >> >> >> To: Kirill Katsnelson >> >>> >> >> >> >> >> Cc: kal...@li... >> >>> >> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) >> >>> never >> >>> >> >> >> >> >> completes >> >>> >> >> >> >> >> >> >>> >> >> >> >> >> > I have a small set of sentences with repeat >> counts, >> >>> and >> >>> >> >> >> >> >> > generating an >> >>> >> >> >> >> >> LM out of it. One is generated by a horrible local >> >>> >> >> >> >> >> tool >> >>> I >> >>> >> >> >> >> >> have trouble tracing exactly how. For this one L*G >> >>> >> >> >> >> >> composition >> >>> >> >> takes >> >>> >> >> >> >> about >> >>> >> >> >> >> >> 20 seconds on my CPU. Another LM I just generated >> out >> >>> >> >> >> >> >> of >> >>> >> the >> >>> >> >> >> >> >> same files with srilm 1.7.1 ngram-count. This one >> has >> >>> >> >> >> >> >> been sitting in mkgraphs.sh on L_disambig*G >> >>> >> >> >> >> >> composition step for about 30 >> >>> >> >> >> minutes, >> >>> >> >> >> >> >> and still churning. fstdeterminizestar --use- >> log=true >> >>> >> >> >> >> >> is running at >> >>> >> >> >> >> 100%. >> >>> >> >> >> >> >> L_disambig.fst is the same file in both cases. Looks >> >>> like >> >>> >> >> >> >> >> the >> >>> >> >> G >> >>> >> >> >> >> >> making it not determinizable, although I have no >> idea >> >>> how >> >>> >> it >> >>> >> >> >> >> >> came to >> >>> >> >> >> >> be. >> >>> >> >> >> >> >> > >> >>> >> >> >> >> >> > Anyone could share an advice on tracking down the >> >>> >> problem? >> >>> >> >> >> Thanks. >> >>> >> >> >> >> >> >> >>> >> >> >> >> >> You can send a signal to that program like kill - >> >>> SIGUSR1 >> >>> >> >> >> >> >> process-id and it will print out some info about the >> >>> >> >> >> >> >> symbol sequences involved, I think it is like >> >>> >> >> >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. >> >>> >> >> >> >> >> Usually there is a particular word sequence that is >> >>> >> >> problematic. >> >>> >> >> >> >> >> Dan >> >>> >> >> >> >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> >> >> > >> >>> >> >> >> >> >> > -kkm >> >>> >> >> >> >> >> > >> >>> >> >> >> >> >> > -------------------------------------------------- >> - >> >>> >> >> >> >> >> > --- >> >>> - >> >>> >> >> >> >> >> > -- >> >>> >> - >> >>> >> >> >> >> >> > -- >> >>> >> >> - >> >>> >> >> >> >> >> > -- >> >>> >> >> >> - >> >>> >> >> >> >> >> > -- >> >>> >> >> >> >> - >> >>> >> >> >> >> >> > -- >> >>> >> >> >> >> >> - >> >>> >> >> >> >> >> > -------- >> >>> >> >> >> >> >> > _______________________________________________ >> >>> >> >> >> >> >> > Kaldi-users mailing list >> >>> >> >> >> >> >> > Kal...@li... >> >>> >> >> >> >> >> > >> https://lists.sourceforge.net/lists/listinfo/kaldi- >> >>> user >> >>> >> >> >> >> >> > s |
From: Kirill K. <kir...@sm...> - 2015-06-16 22:58:36
|
Holy guacamole! That was it. Thank you very very much. Perhaps arpa2fst v2.0 would detect such bloopers. > -----Original Message----- > From: Daniel Povey [mailto:dp...@gm...] > Sent: 2015-06-16 1526 > To: Kirill Katsnelson > Cc: kal...@li... > Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > > It turns out the problem was probably caused by the end of-sentence > symbol </s> appearing in inappropriate places in the LM, at the start > of n-grams rather than the end. Probably the training data was > contaminated somehow by </s>. > Dan > > > On Tue, Jun 16, 2015 at 2:07 PM, Daniel Povey <dp...@gm...> wrote: > >> I am currently trying to get a minimal reproduction with a script. > Let it run for a while. I'll send you what remains of it, and hope it > might give me an idea too. > >> > >> Looks like that fstdeterminize may have completed on this grammar > >> (how do you call the thing symbolized as $G$? "grammar" sounded > >> confusing, as I understand, but I have no other word not exceeding 2 > >> syllables :)) > > > > I would call it an LM. > > > >>> I have left one running by mistake before going to sleep, and it > was done. I am running one again with the time command to make sure > this is not a fluke. So it is possible that it is not exactly non- > determinizable, but instead takes enormous time (hours on one LM, < 1 > sec on another). Which is the same thing from the engineering > standpoint, close enough, as those engineering vs mathematics jokes go. > But jokes aside, I want something more bounded for a production system, > so I need to understand what throws it off so badly. > > > > I would still call it a problem. Check if your ARPA contains <eps> > or > > #0. I may need to add checks for this into arpa2fst (which we will > > rewrite at some point anyway). Another problem could be weird things > > like stray \r's which make one word seem like two in some > > circumstances. > > If I saw the output of arpa2fst I could probably figure out fairly > > quickly what the problem was. The way I would debug this is to trace > > through your LM FST from the start and follow those symbols (or > > epsilons) on that trace from the determinization failure, and see how > > there are two different paths. > > It's better if you share a couple different traces, not just one, so > > we can see what's in common. > > > >> Is fstdeterminizestar more than fstrmepsilon ∘ fstdeterminize (the > latter with the kaldi patch)? > > > > No, it should be faster. fstrmepsilon ∘ fstdeterminize should fail > too. > > > >> Ah, and this is a Linux machine. So everything looks very very > standard (oops. Did I just create an infinite loop by repeating a > word?). > > > > I am considering changing the way the LM disambig symbols are used to > > make this kind of problem less likely to happen in future, by having > > several disambig symbols for the LM, one per order, instead of just > > one. > > > > Dan > > > > > > > >>> -----Original Message----- > >>> From: Daniel Povey [mailto:dp...@gm...] > >>> Sent: 2015-06-15 2340 > >>> To: Kirill Katsnelson > >>> Cc: kal...@li... > >>> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > >>> > >>> In general SRILM language models are OK, but something weird could > >>> have happened, especially on an unusual platform like Windows. > >>> Look for duplicate lines with apparently the same n-gram on, and > >>> also send to me (but not to kaldi-user) the arpa LM. > >>> Dan > >>> > >>> > >>> On Tue, Jun 16, 2015 at 2:03 AM, Kirill Katsnelson > >>> <kir...@sm...> wrote: > >>> > Nope. The only thing I am thinking of doing is to bisect it > >>> > somehow, > >>> to get a minimal grammar that still refuses to determinize. I tried > >>> different smoothing and played with other switches to ngram_count, > >>> but it still does loop. Are there any known problems with > >>> srilm-generated models? > >>> > > >>> > -kkm > >>> > > >>> >> -----Original Message----- > >>> >> From: Daniel Povey [mailto:dp...@gm...] > >>> >> Sent: 2015-06-15 2248 > >>> >> To: Kirill Katsnelson > >>> >> Cc: kal...@li... > >>> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >>> >> completes > >>> >> > >>> >> OOVs should be OK. > >>> >> Make sure there are no n-grams with things like <s> <s> > >>> >> > >>> >> e.g. see the lines > >>> >> grep -v '<s> <s>' | \ > >>> >> grep -v '</s> <s>' | \ > >>> >> grep -v '</s> </s>' | \ > >>> >> > >>> >> in the WSJ script: > >>> >> > >>> >> gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \ > >>> >> grep -v '<s> <s>' | \ > >>> >> grep -v '</s> <s>' | \ > >>> >> grep -v '</s> </s>' | \ > >>> >> arpa2fst - | fstprint | \ > >>> >> utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \ > >>> >> utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- > >>> >> isymbols=$test/words.txt \ > >>> >> --osymbols=$test/words.txt --keep_isymbols=false -- > >>> >> keep_osymbols=false | \ > >>> >> fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst > >>> >> > >>> >> Dan > >>> >> > >>> >> > >>> >> On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson > >>> >> <kir...@sm...> wrote: > >>> >> > Bingo. G.fst is not determinizable (the "good" G.fst takes > >>> >> > under a second to determinize). And the bad one loops at the > word "zero" > >>> >> > like this > >>> >> > > >>> >> > #0 > >>> >> > unsure unsure > >>> >> > #0 > >>> >> > of of > >>> >> > #0 > >>> >> > yours yours > >>> >> > #0 > >>> >> > is is > >>> >> > #0 > >>> >> > your your > >>> >> > #0 > >>> >> > zip zip > >>> >> > #0 > >>> >> > wrong wrong > >>> >> > #0 > >>> >> > with with > >>> >> > #0 > >>> >> > zero zero > >>> >> > #0 > >>> >> > zero zero > >>> >> > .... > >>> >> > > >>> >> > I am taking the LM straight from ngram_counts to the standard > >>> >> pipeline, nothing fancy. The only thing is it has a lot of OOVs: > >>> >> > > >>> >> > remove_oovs.pl: removed 4646 lines. > >>> >> > > >>> >> > Is this generally a problem? So does my "good" arpa LM. I > >>> >> > grepped > >>> >> both for the word zero, but could not spot anything outrageous. > >>> >> Can you think of anything I can look for? > >>> >> > > >>> >> > My source is no longer than 10 days old. Here's the pipeline, > >>> >> > just in > >>> >> case. > >>> >> > > >>> >> > cat $src/$arpalm | tr -d '\r' | \ > >>> >> > utils/find_arpa_oovs.pl $lang/words.txt > $lang/lm_oovs.txt > >>> >> > > >>> >> > cat $src/$arpalm | tr -d '\r' | \ > >>> >> > arpa2fst - | fstprint | \ > >>> >> > utils/remove_oovs.pl $lang/lm_oovs.txt | \ > >>> >> > utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- > >>> >> isymbols=$lang/words.txt \ > >>> >> > --osymbols=$lang/words.txt --keep_isymbols=false -- > >>> >> keep_osymbols=false | \ > >>> >> > fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst > >>> >> > > >>> >> > -kkm > >>> >> > > >>> >> > > >>> >> >> -----Original Message----- > >>> >> >> From: Daniel Povey [mailto:dp...@gm...] > >>> >> >> Sent: 2015-06-15 2206 > >>> >> >> To: Kirill Katsnelson > >>> >> >> Cc: kal...@li... > >>> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >>> >> >> completes > >>> >> >> > >>> >> >> I don't recommend to look at the fstdeterminizestar algorithm > >>> >> itself- > >>> >> >> it's very complicated. Instead focus on the definition of > >>> >> >> "determinizable" and the twins property, and figure out what > >>> >> >> path > >>> >> you > >>> >> >> are taking through L.fst and G.fst. Trying to > >>> >> >> fstdeterminizestar G.fst directly, and seeing whether it > >>> >> >> terminates or not, may tell > >>> >> you > >>> >> >> something; if it fails, send the signal and see what happens. > >>> >> >> fstdeterminizestar does care about the weights, but only to > >>> >> >> the extent that they are the same or different from each > >>> >> >> other; and > >>> if > >>> >> >> your G.fst is generated from arpa2fst the pipeline should > work > >>> for > >>> >> >> any ARPA-format language model- make sure you are using an > >>> >> >> up-to- > >>> >> date > >>> >> >> Kaldi though, there have been fixes as recently as a few > >>> >> >> months > >>> ago. > >>> >> >> The presence of SIL is not surprising, it is the > >>> >> >> optional-silence added by the lexicon. I think that script > is > >>> >> >> adding #16 if it does > >>> >> >> *not* take the optional silence, otherwise it adds the phone > SIL. > >>> >> >> Since you are calling your FST a "grammar" I'm wondering > >>> >> >> whether you have done something fancy with mapping words to > >>> >> >> FSTs or something like that, which is causing the result to > >>> >> >> not be > >>> determinizable. > >>> >> >> > >>> >> >> Dan > >>> >> >> > >>> >> >> > >>> >> >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson > >>> >> >> <kir...@sm...> wrote: > >>> >> >> > Thank you very much for your help Dan, but I am still > stuck. > >>> >> >> > > >>> >> >> > First of all, a question: does the fstdeterminizestar > >>> >> >> > algorithm > >>> >> >> depend on actual backoff and n-gram probabilities, i.e. will > >>> >> >> it behave differently if the numbers in arpa model file are > >>> different? > >>> >> >> Or does it depend only on arc labels but not weights? I am > >>> looking > >>> >> at > >>> >> >> the code but certainly I am far from being able to understand > it. > >>> >> >> I cheated by looking at all if conditions in it, and this one > >>> >> >> in EpsilonClosure is seemingly the only one dealing with > weights: > >>> >> >> > > >>> >> >> > if (! ApproxEqual(weight, iter->second.weight, > >>> >> delta_)) > >>> >> >> > { > >>> >> >> // add extra part of weight to queue. > >>> >> >> > > >>> >> >> > (In ProcessFinal it also has "if (this_final_weight != > >>> >> >> > Weight::Zero())" but I do not believe it is relevant?) > >>> >> >> > > >>> >> >> > I am trying to understand how to dig into the problem--are > >>> >> >> > weights in > >>> >> >> the picture actually. > >>> >> >> > > >>> >> >> > Also, just for a test, I ran the grammar trough a "grep -v > >>> 'real > >>> >> >> real'", and indeed got a similar loop on the word "very" > which > >>> >> >> is also often repeated. But the "real real" 2- and 3-grams > are > >>> >> >> there in the "good" grammar too. > >>> >> >> > > >>> >> >> > Another thing I do not understand is the presence of the > SIL > >>> >> ilabel > >>> >> >> in the backtrace. Here's the beginning of the trace that > leads > >>> >> >> to > >>> >> the > >>> >> >> infinite loop as decoded with a little script I wrote (format > >>> >> >> is ilabel [ TAB olabel ]: > >>> >> >> > > >>> >> >> > #16 > >>> >> >> > #0 > >>> >> >> > V_B > >>> >> >> > Y_I > >>> >> >> > UW1_I > >>> >> >> > Z_E views > >>> >> >> > #2 > >>> >> >> > SIL > >>> >> >> > #0 > >>> >> >> > AH0_B > >>> >> >> > N_I > >>> >> >> > SH_I unsure > >>> >> >> > UH1_I > >>> >> >> > R_E > >>> >> >> > > >>> >> >> > Note the presence of SIL at line 8. This is not in lexicon: > >>> >> >> > > >>> >> >> > $ grep SIL > >>> >> >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt > >>> >> >> > !SIL 1 0.20 1.00 1.00 SIL_S > >>> >> >> > $ > >>> >> >> > > >>> >> >> > Is this a hint? How did it get there at all? I am using a > >>> >> >> > standard > >>> >> >> script to build the L_disambig.fst: > >>> >> >> > > >>> >> >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' > >>> >> >> > $lang/phones.txt) word_disambig_symbol=$(awk > '$1=="#0"{print > >>> >> >> > $2}' $lang/words.txt) utils/make_lexicon_fst_silprob.pl > >>> >> >> $lang/dict/lexiconp_silprob_disambig.txt \ > >>> >> >> > data/local/dict/silprob.txt $silphone > >>> >> >> > '#'$ndisambig > >>> >> | \ > >>> >> >> > fstcompile --isymbols=$lang/phones.txt -- > >>> >> >> osymbols=$lang/words.txt \ > >>> >> >> > --keep_isymbols=false --keep_osymbols=false | \ > >>> >> >> > fstaddselfloops "echo $phone_disambig_symbol |" "echo > >>> >> >> $word_disambig_symbol |" | \ > >>> >> >> > fstarcsort --sort_type=olabel > $lang/L_disambig.fst > || > >>> >> >> > exit 1; > >>> >> >> > > >>> >> >> > I checked the lexicon, and there are indeed only real > phones > >>> >> >> > at > >>> >> the > >>> >> >> beginning of each word, no empty positions and no #N symbols. > >>> >> >> > > >>> >> >> > -kkm > >>> >> >> > > >>> >> >> >> -----Original Message----- > >>> >> >> >> From: Daniel Povey [mailto:dp...@gm...] > >>> >> >> >> Sent: 2015-06-15 1944 > >>> >> >> >> To: Kirill Katsnelson > >>> >> >> >> Cc: kal...@li... > >>> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >>> >> >> >> completes > >>> >> >> >> > >>> >> >> >> I think the confusion is probably between two loops with > >>> "real" > >>> >> on > >>> >> >> >> them in G.fst: one loop where you always take the bigram > >>> >> >> probability, > >>> >> >> >> and one where you always take the unigram probability. Or > >>> >> >> >> maybe > >>> >> a > >>> >> >> >> similar confusion between a loop where you use the trigram > >>> >> >> >> "real > >>> >> >> real > >>> >> >> >> real" and the bigram "real real". Those loops are > expected > >>> >> >> >> to > >>> >> >> exist. > >>> >> >> >> Probably the issue is that something happened at the start > >>> >> >> >> of the sequence which caused the FST to be confused about > >>> >> >> >> which > >>> of > >>> >> >> >> those > >>> >> >> two > >>> >> >> >> states it was in. If you have any empty words (words with > >>> >> >> >> empty > >>> >> >> >> pronunciation) in your lexicon this could possibly happen, > >>> >> >> >> as it would be confused between taking a normal word, > then > >>> >> >> >> the backoff > >>> >> >> symbol, vs. > >>> >> >> >> taking a normal word, then the empty word, then the > backoff > >>> >> symbol. > >>> >> >> >> I think the current Kaldi graph-creation script check for > >>> empty > >>> >> >> words > >>> >> >> >> in the lexicon, for this reason. > >>> >> >> >> > >>> >> >> >> Dan > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) > >>> >> >> >> > #0 > >>> ( > >>> >> >> >> > ) > >>> >> >> >> generally almost makes sense, given that #16 is the last > >>> >> >> >> one > >>> in > >>> >> >> >> table, the silence disambiguation symbol. (Not sure why > "real" > >>> >> >> >> is emitted at L_E--I would rather expect it to be emitted > >>> >> >> >> at > >>> >> >> >> #1.) What > >>> >> >> I > >>> >> >> >> do not understand is what exactly the debug trace > >>> >> >> >> represents, and what should I make out if it. It is a path > >>> >> >> >> through the FST graph, > >>> >> >> but > >>> >> >> >> I do not understand what is this path exactly, and what > >>> >> >> >> does this endless walk of this loop mean. > >>> >> >> >> > > >>> >> >> >> > -kkm > >>> >> >> >> > > >>> >> >> >> >> -----Original Message----- > >>> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] > >>> >> >> >> >> Sent: 2015-06-15 1858 > >>> >> >> >> >> To: Kirill Katsnelson > >>> >> >> >> >> Cc: kal...@li... > >>> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) > >>> >> >> >> >> never completes > >>> >> >> >> >> > >>> >> >> >> >> Look into the "backoff disambiguation symbol", normally > >>> >> >> >> >> called > >>> >> >> #0. > >>> >> >> >> >> The reason why it is needed should be explained in the > >>> >> hbka.pdf > >>> >> >> >> paper. > >>> >> >> >> >> Dan > >>> >> >> >> >> > >>> >> >> >> >> > >>> >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson > >>> >> >> >> >> <kir...@sm...> wrote: > >>> >> >> >> >> > Thank you! The output consists of some sequences as > >>> >> >> >> >> > you described, > >>> >> >> >> >> quickly falling into a short ever repeated loop. > >>> >> >> >> >> > > >>> >> >> >> >> > The non-repeated section ends up with osymbols > >>> >> >> >> >> > (excluding > >>> >> >> >> epsilons) > >>> >> >> >> >> "whatsoever on vacation up", and then the repeated part > >>> >> >> >> >> looks > >>> >> >> like " > >>> >> >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". > The > >>> >> >> >> >> word > >>> >> >> "real" > >>> >> >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. > >>> >> >> >> >> > > >>> >> >> >> >> > Both LMs contain a bigram for "vacation up" and a > >>> >> >> >> >> > trigram > >>> >> >> >> "vacation > >>> >> >> >> >> up there". "up real" is a bigram in both, with 3-grams > >>> >> >> >> >> "up > >>> >> real > >>> >> >> >> quick" > >>> >> >> >> >> and "up real quickly". "up real" is also a tail of a > few > >>> >> >> >> >> other 3-grams, but these are also same in both models > >>> >> >> >> >> (up > >>> to > >>> >> >> >> >> their > >>> >> >> >> weights). > >>> >> >> >> >> > > >>> >> >> >> >> > It looks I do not understand what should I make in > the > >>> end > >>> >> >> >> >> > out of > >>> >> >> >> >> this > >>> >> >> >> >> > debug data :( > >>> >> >> >> >> > > >>> >> >> >> >> > -kkm > >>> >> >> >> >> > > >>> >> >> >> >> >> -----Original Message----- > >>> >> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] > >>> >> >> >> >> >> Sent: 2015-06-15 1821 > >>> >> >> >> >> >> To: Kirill Katsnelson > >>> >> >> >> >> >> Cc: kal...@li... > >>> >> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) > >>> never > >>> >> >> >> >> >> completes > >>> >> >> >> >> >> > >>> >> >> >> >> >> > I have a small set of sentences with repeat > counts, > >>> and > >>> >> >> >> >> >> > generating an > >>> >> >> >> >> >> LM out of it. One is generated by a horrible local > >>> >> >> >> >> >> tool > >>> I > >>> >> >> >> >> >> have trouble tracing exactly how. For this one L*G > >>> >> >> >> >> >> composition > >>> >> >> takes > >>> >> >> >> >> about > >>> >> >> >> >> >> 20 seconds on my CPU. Another LM I just generated > out > >>> >> >> >> >> >> of > >>> >> the > >>> >> >> >> >> >> same files with srilm 1.7.1 ngram-count. This one > has > >>> >> >> >> >> >> been sitting in mkgraphs.sh on L_disambig*G > >>> >> >> >> >> >> composition step for about 30 > >>> >> >> >> minutes, > >>> >> >> >> >> >> and still churning. fstdeterminizestar --use- > log=true > >>> >> >> >> >> >> is running at > >>> >> >> >> >> 100%. > >>> >> >> >> >> >> L_disambig.fst is the same file in both cases. Looks > >>> like > >>> >> >> >> >> >> the > >>> >> >> G > >>> >> >> >> >> >> making it not determinizable, although I have no > idea > >>> how > >>> >> it > >>> >> >> >> >> >> came to > >>> >> >> >> >> be. > >>> >> >> >> >> >> > > >>> >> >> >> >> >> > Anyone could share an advice on tracking down the > >>> >> problem? > >>> >> >> >> Thanks. > >>> >> >> >> >> >> > >>> >> >> >> >> >> You can send a signal to that program like kill - > >>> SIGUSR1 > >>> >> >> >> >> >> process-id and it will print out some info about the > >>> >> >> >> >> >> symbol sequences involved, I think it is like > >>> >> >> >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. > >>> >> >> >> >> >> Usually there is a particular word sequence that is > >>> >> >> problematic. > >>> >> >> >> >> >> Dan > >>> >> >> >> >> >> > >>> >> >> >> >> >> > >>> >> >> >> >> >> > >>> >> >> >> >> >> > >>> >> >> >> >> >> > > >>> >> >> >> >> >> > -kkm > >>> >> >> >> >> >> > > >>> >> >> >> >> >> > -------------------------------------------------- > - > >>> >> >> >> >> >> > --- > >>> - > >>> >> >> >> >> >> > -- > >>> >> - > >>> >> >> >> >> >> > -- > >>> >> >> - > >>> >> >> >> >> >> > -- > >>> >> >> >> - > >>> >> >> >> >> >> > -- > >>> >> >> >> >> - > >>> >> >> >> >> >> > -- > >>> >> >> >> >> >> - > >>> >> >> >> >> >> > -------- > >>> >> >> >> >> >> > _______________________________________________ > >>> >> >> >> >> >> > Kaldi-users mailing list > >>> >> >> >> >> >> > Kal...@li... > >>> >> >> >> >> >> > > https://lists.sourceforge.net/lists/listinfo/kaldi- > >>> user > >>> >> >> >> >> >> > s |
From: Daniel P. <dp...@gm...> - 2015-06-16 22:26:29
|
It turns out the problem was probably caused by the end of-sentence symbol </s> appearing in inappropriate places in the LM, at the start of n-grams rather than the end. Probably the training data was contaminated somehow by </s>. Dan On Tue, Jun 16, 2015 at 2:07 PM, Daniel Povey <dp...@gm...> wrote: >> I am currently trying to get a minimal reproduction with a script. Let it run for a while. I'll send you what remains of it, and hope it might give me an idea too. >> >> Looks like that fstdeterminize may have completed on this grammar (how do you call the thing symbolized as $G$? "grammar" sounded confusing, as I understand, but I have no other word not exceeding 2 syllables :)) > > I would call it an LM. > >>> I have left one running by mistake before going to sleep, and it was done. I am running one again with the time command to make sure this is not a fluke. So it is possible that it is not exactly non-determinizable, but instead takes enormous time (hours on one LM, < 1 sec on another). Which is the same thing from the engineering standpoint, close enough, as those engineering vs mathematics jokes go. But jokes aside, I want something more bounded for a production system, so I need to understand what throws it off so badly. > > I would still call it a problem. Check if your ARPA contains <eps> or > #0. I may need to add checks for this into arpa2fst (which we will > rewrite at some point anyway). Another problem could be weird things > like stray \r's which make one word seem like two in some > circumstances. > If I saw the output of arpa2fst I could probably figure out fairly > quickly what the problem was. The way I would debug this is to trace > through your LM FST from the start and follow those symbols (or > epsilons) on that trace from the determinization failure, and see how > there are two different paths. > It's better if you share a couple different traces, not just one, so > we can see what's in common. > >> Is fstdeterminizestar more than fstrmepsilon ∘ fstdeterminize (the latter with the kaldi patch)? > > No, it should be faster. fstrmepsilon ∘ fstdeterminize should fail too. > >> Ah, and this is a Linux machine. So everything looks very very standard (oops. Did I just create an infinite loop by repeating a word?). > > I am considering changing the way the LM disambig symbols are used to > make this kind of problem less likely to happen in future, by having > several disambig symbols for the LM, one per order, instead of just > one. > > Dan > > > >>> -----Original Message----- >>> From: Daniel Povey [mailto:dp...@gm...] >>> Sent: 2015-06-15 2340 >>> To: Kirill Katsnelson >>> Cc: kal...@li... >>> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >>> >>> In general SRILM language models are OK, but something weird could have >>> happened, especially on an unusual platform like Windows. >>> Look for duplicate lines with apparently the same n-gram on, and also >>> send to me (but not to kaldi-user) the arpa LM. >>> Dan >>> >>> >>> On Tue, Jun 16, 2015 at 2:03 AM, Kirill Katsnelson >>> <kir...@sm...> wrote: >>> > Nope. The only thing I am thinking of doing is to bisect it somehow, >>> to get a minimal grammar that still refuses to determinize. I tried >>> different smoothing and played with other switches to ngram_count, but >>> it still does loop. Are there any known problems with srilm-generated >>> models? >>> > >>> > -kkm >>> > >>> >> -----Original Message----- >>> >> From: Daniel Povey [mailto:dp...@gm...] >>> >> Sent: 2015-06-15 2248 >>> >> To: Kirill Katsnelson >>> >> Cc: kal...@li... >>> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >>> >> >>> >> OOVs should be OK. >>> >> Make sure there are no n-grams with things like <s> <s> >>> >> >>> >> e.g. see the lines >>> >> grep -v '<s> <s>' | \ >>> >> grep -v '</s> <s>' | \ >>> >> grep -v '</s> </s>' | \ >>> >> >>> >> in the WSJ script: >>> >> >>> >> gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \ >>> >> grep -v '<s> <s>' | \ >>> >> grep -v '</s> <s>' | \ >>> >> grep -v '</s> </s>' | \ >>> >> arpa2fst - | fstprint | \ >>> >> utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \ >>> >> utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- >>> >> isymbols=$test/words.txt \ >>> >> --osymbols=$test/words.txt --keep_isymbols=false -- >>> >> keep_osymbols=false | \ >>> >> fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst >>> >> >>> >> Dan >>> >> >>> >> >>> >> On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson >>> >> <kir...@sm...> wrote: >>> >> > Bingo. G.fst is not determinizable (the "good" G.fst takes under a >>> >> > second to determinize). And the bad one loops at the word "zero" >>> >> > like this >>> >> > >>> >> > #0 >>> >> > unsure unsure >>> >> > #0 >>> >> > of of >>> >> > #0 >>> >> > yours yours >>> >> > #0 >>> >> > is is >>> >> > #0 >>> >> > your your >>> >> > #0 >>> >> > zip zip >>> >> > #0 >>> >> > wrong wrong >>> >> > #0 >>> >> > with with >>> >> > #0 >>> >> > zero zero >>> >> > #0 >>> >> > zero zero >>> >> > .... >>> >> > >>> >> > I am taking the LM straight from ngram_counts to the standard >>> >> pipeline, nothing fancy. The only thing is it has a lot of OOVs: >>> >> > >>> >> > remove_oovs.pl: removed 4646 lines. >>> >> > >>> >> > Is this generally a problem? So does my "good" arpa LM. I grepped >>> >> both for the word zero, but could not spot anything outrageous. Can >>> >> you think of anything I can look for? >>> >> > >>> >> > My source is no longer than 10 days old. Here's the pipeline, just >>> >> > in >>> >> case. >>> >> > >>> >> > cat $src/$arpalm | tr -d '\r' | \ >>> >> > utils/find_arpa_oovs.pl $lang/words.txt > $lang/lm_oovs.txt >>> >> > >>> >> > cat $src/$arpalm | tr -d '\r' | \ >>> >> > arpa2fst - | fstprint | \ >>> >> > utils/remove_oovs.pl $lang/lm_oovs.txt | \ >>> >> > utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- >>> >> isymbols=$lang/words.txt \ >>> >> > --osymbols=$lang/words.txt --keep_isymbols=false -- >>> >> keep_osymbols=false | \ >>> >> > fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst >>> >> > >>> >> > -kkm >>> >> > >>> >> > >>> >> >> -----Original Message----- >>> >> >> From: Daniel Povey [mailto:dp...@gm...] >>> >> >> Sent: 2015-06-15 2206 >>> >> >> To: Kirill Katsnelson >>> >> >> Cc: kal...@li... >>> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >>> >> >> completes >>> >> >> >>> >> >> I don't recommend to look at the fstdeterminizestar algorithm >>> >> itself- >>> >> >> it's very complicated. Instead focus on the definition of >>> >> >> "determinizable" and the twins property, and figure out what path >>> >> you >>> >> >> are taking through L.fst and G.fst. Trying to fstdeterminizestar >>> >> >> G.fst directly, and seeing whether it terminates or not, may tell >>> >> you >>> >> >> something; if it fails, send the signal and see what happens. >>> >> >> fstdeterminizestar does care about the weights, but only to the >>> >> >> extent that they are the same or different from each other; and >>> if >>> >> >> your G.fst is generated from arpa2fst the pipeline should work >>> for >>> >> >> any ARPA-format language model- make sure you are using an up-to- >>> >> date >>> >> >> Kaldi though, there have been fixes as recently as a few months >>> ago. >>> >> >> The presence of SIL is not surprising, it is the optional-silence >>> >> >> added by the lexicon. I think that script is adding #16 if it >>> >> >> does >>> >> >> *not* take the optional silence, otherwise it adds the phone SIL. >>> >> >> Since you are calling your FST a "grammar" I'm wondering whether >>> >> >> you have done something fancy with mapping words to FSTs or >>> >> >> something like that, which is causing the result to not be >>> determinizable. >>> >> >> >>> >> >> Dan >>> >> >> >>> >> >> >>> >> >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson >>> >> >> <kir...@sm...> wrote: >>> >> >> > Thank you very much for your help Dan, but I am still stuck. >>> >> >> > >>> >> >> > First of all, a question: does the fstdeterminizestar algorithm >>> >> >> depend on actual backoff and n-gram probabilities, i.e. will it >>> >> >> behave differently if the numbers in arpa model file are >>> different? >>> >> >> Or does it depend only on arc labels but not weights? I am >>> looking >>> >> at >>> >> >> the code but certainly I am far from being able to understand it. >>> >> >> I cheated by looking at all if conditions in it, and this one in >>> >> >> EpsilonClosure is seemingly the only one dealing with weights: >>> >> >> > >>> >> >> > if (! ApproxEqual(weight, iter->second.weight, >>> >> delta_)) >>> >> >> > { >>> >> >> // add extra part of weight to queue. >>> >> >> > >>> >> >> > (In ProcessFinal it also has "if (this_final_weight != >>> >> >> > Weight::Zero())" but I do not believe it is relevant?) >>> >> >> > >>> >> >> > I am trying to understand how to dig into the problem--are >>> >> >> > weights in >>> >> >> the picture actually. >>> >> >> > >>> >> >> > Also, just for a test, I ran the grammar trough a "grep -v >>> 'real >>> >> >> real'", and indeed got a similar loop on the word "very" which is >>> >> >> also often repeated. But the "real real" 2- and 3-grams are there >>> >> >> in the "good" grammar too. >>> >> >> > >>> >> >> > Another thing I do not understand is the presence of the SIL >>> >> ilabel >>> >> >> in the backtrace. Here's the beginning of the trace that leads to >>> >> the >>> >> >> infinite loop as decoded with a little script I wrote (format is >>> >> >> ilabel [ TAB olabel ]: >>> >> >> > >>> >> >> > #16 >>> >> >> > #0 >>> >> >> > V_B >>> >> >> > Y_I >>> >> >> > UW1_I >>> >> >> > Z_E views >>> >> >> > #2 >>> >> >> > SIL >>> >> >> > #0 >>> >> >> > AH0_B >>> >> >> > N_I >>> >> >> > SH_I unsure >>> >> >> > UH1_I >>> >> >> > R_E >>> >> >> > >>> >> >> > Note the presence of SIL at line 8. This is not in lexicon: >>> >> >> > >>> >> >> > $ grep SIL >>> >> >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt >>> >> >> > !SIL 1 0.20 1.00 1.00 SIL_S >>> >> >> > $ >>> >> >> > >>> >> >> > Is this a hint? How did it get there at all? I am using a >>> >> >> > standard >>> >> >> script to build the L_disambig.fst: >>> >> >> > >>> >> >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' >>> >> >> > $lang/phones.txt) word_disambig_symbol=$(awk '$1=="#0"{print >>> >> >> > $2}' $lang/words.txt) utils/make_lexicon_fst_silprob.pl >>> >> >> $lang/dict/lexiconp_silprob_disambig.txt \ >>> >> >> > data/local/dict/silprob.txt $silphone >>> >> >> > '#'$ndisambig >>> >> | \ >>> >> >> > fstcompile --isymbols=$lang/phones.txt -- >>> >> >> osymbols=$lang/words.txt \ >>> >> >> > --keep_isymbols=false --keep_osymbols=false | \ >>> >> >> > fstaddselfloops "echo $phone_disambig_symbol |" "echo >>> >> >> $word_disambig_symbol |" | \ >>> >> >> > fstarcsort --sort_type=olabel > $lang/L_disambig.fst || >>> >> >> > exit 1; >>> >> >> > >>> >> >> > I checked the lexicon, and there are indeed only real phones at >>> >> the >>> >> >> beginning of each word, no empty positions and no #N symbols. >>> >> >> > >>> >> >> > -kkm >>> >> >> > >>> >> >> >> -----Original Message----- >>> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >>> >> >> >> Sent: 2015-06-15 1944 >>> >> >> >> To: Kirill Katsnelson >>> >> >> >> Cc: kal...@li... >>> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >>> >> >> >> completes >>> >> >> >> >>> >> >> >> I think the confusion is probably between two loops with >>> "real" >>> >> on >>> >> >> >> them in G.fst: one loop where you always take the bigram >>> >> >> probability, >>> >> >> >> and one where you always take the unigram probability. Or >>> >> >> >> maybe >>> >> a >>> >> >> >> similar confusion between a loop where you use the trigram >>> >> >> >> "real >>> >> >> real >>> >> >> >> real" and the bigram "real real". Those loops are expected to >>> >> >> exist. >>> >> >> >> Probably the issue is that something happened at the start of >>> >> >> >> the sequence which caused the FST to be confused about which >>> of >>> >> >> >> those >>> >> >> two >>> >> >> >> states it was in. If you have any empty words (words with >>> >> >> >> empty >>> >> >> >> pronunciation) in your lexicon this could possibly happen, as >>> >> >> >> it would be confused between taking a normal word, then the >>> >> >> >> backoff >>> >> >> symbol, vs. >>> >> >> >> taking a normal word, then the empty word, then the backoff >>> >> symbol. >>> >> >> >> I think the current Kaldi graph-creation script check for >>> empty >>> >> >> words >>> >> >> >> in the lexicon, for this reason. >>> >> >> >> >>> >> >> >> Dan >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 >>> ( >>> >> >> >> > ) >>> >> >> >> generally almost makes sense, given that #16 is the last one >>> in >>> >> >> >> table, the silence disambiguation symbol. (Not sure why "real" >>> >> >> >> is emitted at L_E--I would rather expect it to be emitted at >>> >> >> >> #1.) What >>> >> >> I >>> >> >> >> do not understand is what exactly the debug trace represents, >>> >> >> >> and what should I make out if it. It is a path through the FST >>> >> >> >> graph, >>> >> >> but >>> >> >> >> I do not understand what is this path exactly, and what does >>> >> >> >> this endless walk of this loop mean. >>> >> >> >> > >>> >> >> >> > -kkm >>> >> >> >> > >>> >> >> >> >> -----Original Message----- >>> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >>> >> >> >> >> Sent: 2015-06-15 1858 >>> >> >> >> >> To: Kirill Katsnelson >>> >> >> >> >> Cc: kal...@li... >>> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >>> >> >> >> >> completes >>> >> >> >> >> >>> >> >> >> >> Look into the "backoff disambiguation symbol", normally >>> >> >> >> >> called >>> >> >> #0. >>> >> >> >> >> The reason why it is needed should be explained in the >>> >> hbka.pdf >>> >> >> >> paper. >>> >> >> >> >> Dan >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson >>> >> >> >> >> <kir...@sm...> wrote: >>> >> >> >> >> > Thank you! The output consists of some sequences as you >>> >> >> >> >> > described, >>> >> >> >> >> quickly falling into a short ever repeated loop. >>> >> >> >> >> > >>> >> >> >> >> > The non-repeated section ends up with osymbols (excluding >>> >> >> >> epsilons) >>> >> >> >> >> "whatsoever on vacation up", and then the repeated part >>> >> >> >> >> looks >>> >> >> like " >>> >> >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The >>> >> >> >> >> word >>> >> >> "real" >>> >> >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. >>> >> >> >> >> > >>> >> >> >> >> > Both LMs contain a bigram for "vacation up" and a trigram >>> >> >> >> "vacation >>> >> >> >> >> up there". "up real" is a bigram in both, with 3-grams "up >>> >> real >>> >> >> >> quick" >>> >> >> >> >> and "up real quickly". "up real" is also a tail of a few >>> >> >> >> >> other 3-grams, but these are also same in both models (up >>> to >>> >> >> >> >> their >>> >> >> >> weights). >>> >> >> >> >> > >>> >> >> >> >> > It looks I do not understand what should I make in the >>> end >>> >> >> >> >> > out of >>> >> >> >> >> this >>> >> >> >> >> > debug data :( >>> >> >> >> >> > >>> >> >> >> >> > -kkm >>> >> >> >> >> > >>> >> >> >> >> >> -----Original Message----- >>> >> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >>> >> >> >> >> >> Sent: 2015-06-15 1821 >>> >> >> >> >> >> To: Kirill Katsnelson >>> >> >> >> >> >> Cc: kal...@li... >>> >> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) >>> never >>> >> >> >> >> >> completes >>> >> >> >> >> >> >>> >> >> >> >> >> > I have a small set of sentences with repeat counts, >>> and >>> >> >> >> >> >> > generating an >>> >> >> >> >> >> LM out of it. One is generated by a horrible local tool >>> I >>> >> >> >> >> >> have trouble tracing exactly how. For this one L*G >>> >> >> >> >> >> composition >>> >> >> takes >>> >> >> >> >> about >>> >> >> >> >> >> 20 seconds on my CPU. Another LM I just generated out of >>> >> the >>> >> >> >> >> >> same files with srilm 1.7.1 ngram-count. This one has >>> >> >> >> >> >> been sitting in mkgraphs.sh on L_disambig*G composition >>> >> >> >> >> >> step for about 30 >>> >> >> >> minutes, >>> >> >> >> >> >> and still churning. fstdeterminizestar --use-log=true is >>> >> >> >> >> >> running at >>> >> >> >> >> 100%. >>> >> >> >> >> >> L_disambig.fst is the same file in both cases. Looks >>> like >>> >> >> >> >> >> the >>> >> >> G >>> >> >> >> >> >> making it not determinizable, although I have no idea >>> how >>> >> it >>> >> >> >> >> >> came to >>> >> >> >> >> be. >>> >> >> >> >> >> > >>> >> >> >> >> >> > Anyone could share an advice on tracking down the >>> >> problem? >>> >> >> >> Thanks. >>> >> >> >> >> >> >>> >> >> >> >> >> You can send a signal to that program like kill - >>> SIGUSR1 >>> >> >> >> >> >> process-id and it will print out some info about the >>> >> >> >> >> >> symbol sequences involved, I think it is like >>> >> >> >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. >>> >> >> >> >> >> Usually there is a particular word sequence that is >>> >> >> problematic. >>> >> >> >> >> >> Dan >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> > >>> >> >> >> >> >> > -kkm >>> >> >> >> >> >> > >>> >> >> >> >> >> > ------------------------------------------------------ >>> - >>> >> >> >> >> >> > -- >>> >> - >>> >> >> >> >> >> > -- >>> >> >> - >>> >> >> >> >> >> > -- >>> >> >> >> - >>> >> >> >> >> >> > -- >>> >> >> >> >> - >>> >> >> >> >> >> > -- >>> >> >> >> >> >> - >>> >> >> >> >> >> > -------- >>> >> >> >> >> >> > _______________________________________________ >>> >> >> >> >> >> > Kaldi-users mailing list >>> >> >> >> >> >> > Kal...@li... >>> >> >> >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi- >>> user >>> >> >> >> >> >> > s |
From: Daniel P. <dp...@gm...> - 2015-06-16 18:07:29
|
> I am currently trying to get a minimal reproduction with a script. Let it run for a while. I'll send you what remains of it, and hope it might give me an idea too. > > Looks like that fstdeterminize may have completed on this grammar (how do you call the thing symbolized as $G$? "grammar" sounded confusing, as I understand, but I have no other word not exceeding 2 syllables :)) I would call it an LM. >> I have left one running by mistake before going to sleep, and it was done. I am running one again with the time command to make sure this is not a fluke. So it is possible that it is not exactly non-determinizable, but instead takes enormous time (hours on one LM, < 1 sec on another). Which is the same thing from the engineering standpoint, close enough, as those engineering vs mathematics jokes go. But jokes aside, I want something more bounded for a production system, so I need to understand what throws it off so badly. I would still call it a problem. Check if your ARPA contains <eps> or #0. I may need to add checks for this into arpa2fst (which we will rewrite at some point anyway). Another problem could be weird things like stray \r's which make one word seem like two in some circumstances. If I saw the output of arpa2fst I could probably figure out fairly quickly what the problem was. The way I would debug this is to trace through your LM FST from the start and follow those symbols (or epsilons) on that trace from the determinization failure, and see how there are two different paths. It's better if you share a couple different traces, not just one, so we can see what's in common. > Is fstdeterminizestar more than fstrmepsilon ∘ fstdeterminize (the latter with the kaldi patch)? No, it should be faster. fstrmepsilon ∘ fstdeterminize should fail too. > Ah, and this is a Linux machine. So everything looks very very standard (oops. Did I just create an infinite loop by repeating a word?). I am considering changing the way the LM disambig symbols are used to make this kind of problem less likely to happen in future, by having several disambig symbols for the LM, one per order, instead of just one. Dan >> -----Original Message----- >> From: Daniel Povey [mailto:dp...@gm...] >> Sent: 2015-06-15 2340 >> To: Kirill Katsnelson >> Cc: kal...@li... >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> In general SRILM language models are OK, but something weird could have >> happened, especially on an unusual platform like Windows. >> Look for duplicate lines with apparently the same n-gram on, and also >> send to me (but not to kaldi-user) the arpa LM. >> Dan >> >> >> On Tue, Jun 16, 2015 at 2:03 AM, Kirill Katsnelson >> <kir...@sm...> wrote: >> > Nope. The only thing I am thinking of doing is to bisect it somehow, >> to get a minimal grammar that still refuses to determinize. I tried >> different smoothing and played with other switches to ngram_count, but >> it still does loop. Are there any known problems with srilm-generated >> models? >> > >> > -kkm >> > >> >> -----Original Message----- >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> Sent: 2015-06-15 2248 >> >> To: Kirill Katsnelson >> >> Cc: kal...@li... >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> >> >> OOVs should be OK. >> >> Make sure there are no n-grams with things like <s> <s> >> >> >> >> e.g. see the lines >> >> grep -v '<s> <s>' | \ >> >> grep -v '</s> <s>' | \ >> >> grep -v '</s> </s>' | \ >> >> >> >> in the WSJ script: >> >> >> >> gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \ >> >> grep -v '<s> <s>' | \ >> >> grep -v '</s> <s>' | \ >> >> grep -v '</s> </s>' | \ >> >> arpa2fst - | fstprint | \ >> >> utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \ >> >> utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- >> >> isymbols=$test/words.txt \ >> >> --osymbols=$test/words.txt --keep_isymbols=false -- >> >> keep_osymbols=false | \ >> >> fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst >> >> >> >> Dan >> >> >> >> >> >> On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson >> >> <kir...@sm...> wrote: >> >> > Bingo. G.fst is not determinizable (the "good" G.fst takes under a >> >> > second to determinize). And the bad one loops at the word "zero" >> >> > like this >> >> > >> >> > #0 >> >> > unsure unsure >> >> > #0 >> >> > of of >> >> > #0 >> >> > yours yours >> >> > #0 >> >> > is is >> >> > #0 >> >> > your your >> >> > #0 >> >> > zip zip >> >> > #0 >> >> > wrong wrong >> >> > #0 >> >> > with with >> >> > #0 >> >> > zero zero >> >> > #0 >> >> > zero zero >> >> > .... >> >> > >> >> > I am taking the LM straight from ngram_counts to the standard >> >> pipeline, nothing fancy. The only thing is it has a lot of OOVs: >> >> > >> >> > remove_oovs.pl: removed 4646 lines. >> >> > >> >> > Is this generally a problem? So does my "good" arpa LM. I grepped >> >> both for the word zero, but could not spot anything outrageous. Can >> >> you think of anything I can look for? >> >> > >> >> > My source is no longer than 10 days old. Here's the pipeline, just >> >> > in >> >> case. >> >> > >> >> > cat $src/$arpalm | tr -d '\r' | \ >> >> > utils/find_arpa_oovs.pl $lang/words.txt > $lang/lm_oovs.txt >> >> > >> >> > cat $src/$arpalm | tr -d '\r' | \ >> >> > arpa2fst - | fstprint | \ >> >> > utils/remove_oovs.pl $lang/lm_oovs.txt | \ >> >> > utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- >> >> isymbols=$lang/words.txt \ >> >> > --osymbols=$lang/words.txt --keep_isymbols=false -- >> >> keep_osymbols=false | \ >> >> > fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst >> >> > >> >> > -kkm >> >> > >> >> > >> >> >> -----Original Message----- >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> Sent: 2015-06-15 2206 >> >> >> To: Kirill Katsnelson >> >> >> Cc: kal...@li... >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >> >> completes >> >> >> >> >> >> I don't recommend to look at the fstdeterminizestar algorithm >> >> itself- >> >> >> it's very complicated. Instead focus on the definition of >> >> >> "determinizable" and the twins property, and figure out what path >> >> you >> >> >> are taking through L.fst and G.fst. Trying to fstdeterminizestar >> >> >> G.fst directly, and seeing whether it terminates or not, may tell >> >> you >> >> >> something; if it fails, send the signal and see what happens. >> >> >> fstdeterminizestar does care about the weights, but only to the >> >> >> extent that they are the same or different from each other; and >> if >> >> >> your G.fst is generated from arpa2fst the pipeline should work >> for >> >> >> any ARPA-format language model- make sure you are using an up-to- >> >> date >> >> >> Kaldi though, there have been fixes as recently as a few months >> ago. >> >> >> The presence of SIL is not surprising, it is the optional-silence >> >> >> added by the lexicon. I think that script is adding #16 if it >> >> >> does >> >> >> *not* take the optional silence, otherwise it adds the phone SIL. >> >> >> Since you are calling your FST a "grammar" I'm wondering whether >> >> >> you have done something fancy with mapping words to FSTs or >> >> >> something like that, which is causing the result to not be >> determinizable. >> >> >> >> >> >> Dan >> >> >> >> >> >> >> >> >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson >> >> >> <kir...@sm...> wrote: >> >> >> > Thank you very much for your help Dan, but I am still stuck. >> >> >> > >> >> >> > First of all, a question: does the fstdeterminizestar algorithm >> >> >> depend on actual backoff and n-gram probabilities, i.e. will it >> >> >> behave differently if the numbers in arpa model file are >> different? >> >> >> Or does it depend only on arc labels but not weights? I am >> looking >> >> at >> >> >> the code but certainly I am far from being able to understand it. >> >> >> I cheated by looking at all if conditions in it, and this one in >> >> >> EpsilonClosure is seemingly the only one dealing with weights: >> >> >> > >> >> >> > if (! ApproxEqual(weight, iter->second.weight, >> >> delta_)) >> >> >> > { >> >> >> // add extra part of weight to queue. >> >> >> > >> >> >> > (In ProcessFinal it also has "if (this_final_weight != >> >> >> > Weight::Zero())" but I do not believe it is relevant?) >> >> >> > >> >> >> > I am trying to understand how to dig into the problem--are >> >> >> > weights in >> >> >> the picture actually. >> >> >> > >> >> >> > Also, just for a test, I ran the grammar trough a "grep -v >> 'real >> >> >> real'", and indeed got a similar loop on the word "very" which is >> >> >> also often repeated. But the "real real" 2- and 3-grams are there >> >> >> in the "good" grammar too. >> >> >> > >> >> >> > Another thing I do not understand is the presence of the SIL >> >> ilabel >> >> >> in the backtrace. Here's the beginning of the trace that leads to >> >> the >> >> >> infinite loop as decoded with a little script I wrote (format is >> >> >> ilabel [ TAB olabel ]: >> >> >> > >> >> >> > #16 >> >> >> > #0 >> >> >> > V_B >> >> >> > Y_I >> >> >> > UW1_I >> >> >> > Z_E views >> >> >> > #2 >> >> >> > SIL >> >> >> > #0 >> >> >> > AH0_B >> >> >> > N_I >> >> >> > SH_I unsure >> >> >> > UH1_I >> >> >> > R_E >> >> >> > >> >> >> > Note the presence of SIL at line 8. This is not in lexicon: >> >> >> > >> >> >> > $ grep SIL >> >> >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt >> >> >> > !SIL 1 0.20 1.00 1.00 SIL_S >> >> >> > $ >> >> >> > >> >> >> > Is this a hint? How did it get there at all? I am using a >> >> >> > standard >> >> >> script to build the L_disambig.fst: >> >> >> > >> >> >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' >> >> >> > $lang/phones.txt) word_disambig_symbol=$(awk '$1=="#0"{print >> >> >> > $2}' $lang/words.txt) utils/make_lexicon_fst_silprob.pl >> >> >> $lang/dict/lexiconp_silprob_disambig.txt \ >> >> >> > data/local/dict/silprob.txt $silphone >> >> >> > '#'$ndisambig >> >> | \ >> >> >> > fstcompile --isymbols=$lang/phones.txt -- >> >> >> osymbols=$lang/words.txt \ >> >> >> > --keep_isymbols=false --keep_osymbols=false | \ >> >> >> > fstaddselfloops "echo $phone_disambig_symbol |" "echo >> >> >> $word_disambig_symbol |" | \ >> >> >> > fstarcsort --sort_type=olabel > $lang/L_disambig.fst || >> >> >> > exit 1; >> >> >> > >> >> >> > I checked the lexicon, and there are indeed only real phones at >> >> the >> >> >> beginning of each word, no empty positions and no #N symbols. >> >> >> > >> >> >> > -kkm >> >> >> > >> >> >> >> -----Original Message----- >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> >> Sent: 2015-06-15 1944 >> >> >> >> To: Kirill Katsnelson >> >> >> >> Cc: kal...@li... >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >> >> >> completes >> >> >> >> >> >> >> >> I think the confusion is probably between two loops with >> "real" >> >> on >> >> >> >> them in G.fst: one loop where you always take the bigram >> >> >> probability, >> >> >> >> and one where you always take the unigram probability. Or >> >> >> >> maybe >> >> a >> >> >> >> similar confusion between a loop where you use the trigram >> >> >> >> "real >> >> >> real >> >> >> >> real" and the bigram "real real". Those loops are expected to >> >> >> exist. >> >> >> >> Probably the issue is that something happened at the start of >> >> >> >> the sequence which caused the FST to be confused about which >> of >> >> >> >> those >> >> >> two >> >> >> >> states it was in. If you have any empty words (words with >> >> >> >> empty >> >> >> >> pronunciation) in your lexicon this could possibly happen, as >> >> >> >> it would be confused between taking a normal word, then the >> >> >> >> backoff >> >> >> symbol, vs. >> >> >> >> taking a normal word, then the empty word, then the backoff >> >> symbol. >> >> >> >> I think the current Kaldi graph-creation script check for >> empty >> >> >> words >> >> >> >> in the lexicon, for this reason. >> >> >> >> >> >> >> >> Dan >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 >> ( >> >> >> >> > ) >> >> >> >> generally almost makes sense, given that #16 is the last one >> in >> >> >> >> table, the silence disambiguation symbol. (Not sure why "real" >> >> >> >> is emitted at L_E--I would rather expect it to be emitted at >> >> >> >> #1.) What >> >> >> I >> >> >> >> do not understand is what exactly the debug trace represents, >> >> >> >> and what should I make out if it. It is a path through the FST >> >> >> >> graph, >> >> >> but >> >> >> >> I do not understand what is this path exactly, and what does >> >> >> >> this endless walk of this loop mean. >> >> >> >> > >> >> >> >> > -kkm >> >> >> >> > >> >> >> >> >> -----Original Message----- >> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> >> >> Sent: 2015-06-15 1858 >> >> >> >> >> To: Kirill Katsnelson >> >> >> >> >> Cc: kal...@li... >> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >> >> >> >> completes >> >> >> >> >> >> >> >> >> >> Look into the "backoff disambiguation symbol", normally >> >> >> >> >> called >> >> >> #0. >> >> >> >> >> The reason why it is needed should be explained in the >> >> hbka.pdf >> >> >> >> paper. >> >> >> >> >> Dan >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson >> >> >> >> >> <kir...@sm...> wrote: >> >> >> >> >> > Thank you! The output consists of some sequences as you >> >> >> >> >> > described, >> >> >> >> >> quickly falling into a short ever repeated loop. >> >> >> >> >> > >> >> >> >> >> > The non-repeated section ends up with osymbols (excluding >> >> >> >> epsilons) >> >> >> >> >> "whatsoever on vacation up", and then the repeated part >> >> >> >> >> looks >> >> >> like " >> >> >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The >> >> >> >> >> word >> >> >> "real" >> >> >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. >> >> >> >> >> > >> >> >> >> >> > Both LMs contain a bigram for "vacation up" and a trigram >> >> >> >> "vacation >> >> >> >> >> up there". "up real" is a bigram in both, with 3-grams "up >> >> real >> >> >> >> quick" >> >> >> >> >> and "up real quickly". "up real" is also a tail of a few >> >> >> >> >> other 3-grams, but these are also same in both models (up >> to >> >> >> >> >> their >> >> >> >> weights). >> >> >> >> >> > >> >> >> >> >> > It looks I do not understand what should I make in the >> end >> >> >> >> >> > out of >> >> >> >> >> this >> >> >> >> >> > debug data :( >> >> >> >> >> > >> >> >> >> >> > -kkm >> >> >> >> >> > >> >> >> >> >> >> -----Original Message----- >> >> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> >> >> >> Sent: 2015-06-15 1821 >> >> >> >> >> >> To: Kirill Katsnelson >> >> >> >> >> >> Cc: kal...@li... >> >> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) >> never >> >> >> >> >> >> completes >> >> >> >> >> >> >> >> >> >> >> >> > I have a small set of sentences with repeat counts, >> and >> >> >> >> >> >> > generating an >> >> >> >> >> >> LM out of it. One is generated by a horrible local tool >> I >> >> >> >> >> >> have trouble tracing exactly how. For this one L*G >> >> >> >> >> >> composition >> >> >> takes >> >> >> >> >> about >> >> >> >> >> >> 20 seconds on my CPU. Another LM I just generated out of >> >> the >> >> >> >> >> >> same files with srilm 1.7.1 ngram-count. This one has >> >> >> >> >> >> been sitting in mkgraphs.sh on L_disambig*G composition >> >> >> >> >> >> step for about 30 >> >> >> >> minutes, >> >> >> >> >> >> and still churning. fstdeterminizestar --use-log=true is >> >> >> >> >> >> running at >> >> >> >> >> 100%. >> >> >> >> >> >> L_disambig.fst is the same file in both cases. Looks >> like >> >> >> >> >> >> the >> >> >> G >> >> >> >> >> >> making it not determinizable, although I have no idea >> how >> >> it >> >> >> >> >> >> came to >> >> >> >> >> be. >> >> >> >> >> >> > >> >> >> >> >> >> > Anyone could share an advice on tracking down the >> >> problem? >> >> >> >> Thanks. >> >> >> >> >> >> >> >> >> >> >> >> You can send a signal to that program like kill - >> SIGUSR1 >> >> >> >> >> >> process-id and it will print out some info about the >> >> >> >> >> >> symbol sequences involved, I think it is like >> >> >> >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. >> >> >> >> >> >> Usually there is a particular word sequence that is >> >> >> problematic. >> >> >> >> >> >> Dan >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> > -kkm >> >> >> >> >> >> > >> >> >> >> >> >> > ------------------------------------------------------ >> - >> >> >> >> >> >> > -- >> >> - >> >> >> >> >> >> > -- >> >> >> - >> >> >> >> >> >> > -- >> >> >> >> - >> >> >> >> >> >> > -- >> >> >> >> >> - >> >> >> >> >> >> > -- >> >> >> >> >> >> - >> >> >> >> >> >> > -------- >> >> >> >> >> >> > _______________________________________________ >> >> >> >> >> >> > Kaldi-users mailing list >> >> >> >> >> >> > Kal...@li... >> >> >> >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi- >> user >> >> >> >> >> >> > s |
From: Kirill K. <kir...@sm...> - 2015-06-16 16:15:17
|
I am currently trying to get a minimal reproduction with a script. Let it run for a while. I'll send you what remains of it, and hope it might give me an idea too. Looks like that fstdeterminize may have completed on this grammar (how do you call the thing symbolized as $G$? "grammar" sounded confusing, as I understand, but I have no other word not exceeding 2 syllables :)). I have left one running by mistake before going to sleep, and it was done. I am running one again with the time command to make sure this is not a fluke. So it is possible that it is not exactly non-determinizable, but instead takes enormous time (hours on one LM, < 1 sec on another). Which is the same thing from the engineering standpoint, close enough, as those engineering vs mathematics jokes go. But jokes aside, I want something more bounded for a production system, so I need to understand what throws it off so badly. Is fstdeterminizestar more than fstrmepsilon ∘ fstdeterminize (the latter with the kaldi patch)? Ah, and this is a Linux machine. So everything looks very very standard (oops. Did I just create an infinite loop by repeating a word?). -kkm > -----Original Message----- > From: Daniel Povey [mailto:dp...@gm...] > Sent: 2015-06-15 2340 > To: Kirill Katsnelson > Cc: kal...@li... > Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > > In general SRILM language models are OK, but something weird could have > happened, especially on an unusual platform like Windows. > Look for duplicate lines with apparently the same n-gram on, and also > send to me (but not to kaldi-user) the arpa LM. > Dan > > > On Tue, Jun 16, 2015 at 2:03 AM, Kirill Katsnelson > <kir...@sm...> wrote: > > Nope. The only thing I am thinking of doing is to bisect it somehow, > to get a minimal grammar that still refuses to determinize. I tried > different smoothing and played with other switches to ngram_count, but > it still does loop. Are there any known problems with srilm-generated > models? > > > > -kkm > > > >> -----Original Message----- > >> From: Daniel Povey [mailto:dp...@gm...] > >> Sent: 2015-06-15 2248 > >> To: Kirill Katsnelson > >> Cc: kal...@li... > >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > >> > >> OOVs should be OK. > >> Make sure there are no n-grams with things like <s> <s> > >> > >> e.g. see the lines > >> grep -v '<s> <s>' | \ > >> grep -v '</s> <s>' | \ > >> grep -v '</s> </s>' | \ > >> > >> in the WSJ script: > >> > >> gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \ > >> grep -v '<s> <s>' | \ > >> grep -v '</s> <s>' | \ > >> grep -v '</s> </s>' | \ > >> arpa2fst - | fstprint | \ > >> utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \ > >> utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- > >> isymbols=$test/words.txt \ > >> --osymbols=$test/words.txt --keep_isymbols=false -- > >> keep_osymbols=false | \ > >> fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst > >> > >> Dan > >> > >> > >> On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson > >> <kir...@sm...> wrote: > >> > Bingo. G.fst is not determinizable (the "good" G.fst takes under a > >> > second to determinize). And the bad one loops at the word "zero" > >> > like this > >> > > >> > #0 > >> > unsure unsure > >> > #0 > >> > of of > >> > #0 > >> > yours yours > >> > #0 > >> > is is > >> > #0 > >> > your your > >> > #0 > >> > zip zip > >> > #0 > >> > wrong wrong > >> > #0 > >> > with with > >> > #0 > >> > zero zero > >> > #0 > >> > zero zero > >> > .... > >> > > >> > I am taking the LM straight from ngram_counts to the standard > >> pipeline, nothing fancy. The only thing is it has a lot of OOVs: > >> > > >> > remove_oovs.pl: removed 4646 lines. > >> > > >> > Is this generally a problem? So does my "good" arpa LM. I grepped > >> both for the word zero, but could not spot anything outrageous. Can > >> you think of anything I can look for? > >> > > >> > My source is no longer than 10 days old. Here's the pipeline, just > >> > in > >> case. > >> > > >> > cat $src/$arpalm | tr -d '\r' | \ > >> > utils/find_arpa_oovs.pl $lang/words.txt > $lang/lm_oovs.txt > >> > > >> > cat $src/$arpalm | tr -d '\r' | \ > >> > arpa2fst - | fstprint | \ > >> > utils/remove_oovs.pl $lang/lm_oovs.txt | \ > >> > utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- > >> isymbols=$lang/words.txt \ > >> > --osymbols=$lang/words.txt --keep_isymbols=false -- > >> keep_osymbols=false | \ > >> > fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst > >> > > >> > -kkm > >> > > >> > > >> >> -----Original Message----- > >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> Sent: 2015-06-15 2206 > >> >> To: Kirill Katsnelson > >> >> Cc: kal...@li... > >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >> >> completes > >> >> > >> >> I don't recommend to look at the fstdeterminizestar algorithm > >> itself- > >> >> it's very complicated. Instead focus on the definition of > >> >> "determinizable" and the twins property, and figure out what path > >> you > >> >> are taking through L.fst and G.fst. Trying to fstdeterminizestar > >> >> G.fst directly, and seeing whether it terminates or not, may tell > >> you > >> >> something; if it fails, send the signal and see what happens. > >> >> fstdeterminizestar does care about the weights, but only to the > >> >> extent that they are the same or different from each other; and > if > >> >> your G.fst is generated from arpa2fst the pipeline should work > for > >> >> any ARPA-format language model- make sure you are using an up-to- > >> date > >> >> Kaldi though, there have been fixes as recently as a few months > ago. > >> >> The presence of SIL is not surprising, it is the optional-silence > >> >> added by the lexicon. I think that script is adding #16 if it > >> >> does > >> >> *not* take the optional silence, otherwise it adds the phone SIL. > >> >> Since you are calling your FST a "grammar" I'm wondering whether > >> >> you have done something fancy with mapping words to FSTs or > >> >> something like that, which is causing the result to not be > determinizable. > >> >> > >> >> Dan > >> >> > >> >> > >> >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson > >> >> <kir...@sm...> wrote: > >> >> > Thank you very much for your help Dan, but I am still stuck. > >> >> > > >> >> > First of all, a question: does the fstdeterminizestar algorithm > >> >> depend on actual backoff and n-gram probabilities, i.e. will it > >> >> behave differently if the numbers in arpa model file are > different? > >> >> Or does it depend only on arc labels but not weights? I am > looking > >> at > >> >> the code but certainly I am far from being able to understand it. > >> >> I cheated by looking at all if conditions in it, and this one in > >> >> EpsilonClosure is seemingly the only one dealing with weights: > >> >> > > >> >> > if (! ApproxEqual(weight, iter->second.weight, > >> delta_)) > >> >> > { > >> >> // add extra part of weight to queue. > >> >> > > >> >> > (In ProcessFinal it also has "if (this_final_weight != > >> >> > Weight::Zero())" but I do not believe it is relevant?) > >> >> > > >> >> > I am trying to understand how to dig into the problem--are > >> >> > weights in > >> >> the picture actually. > >> >> > > >> >> > Also, just for a test, I ran the grammar trough a "grep -v > 'real > >> >> real'", and indeed got a similar loop on the word "very" which is > >> >> also often repeated. But the "real real" 2- and 3-grams are there > >> >> in the "good" grammar too. > >> >> > > >> >> > Another thing I do not understand is the presence of the SIL > >> ilabel > >> >> in the backtrace. Here's the beginning of the trace that leads to > >> the > >> >> infinite loop as decoded with a little script I wrote (format is > >> >> ilabel [ TAB olabel ]: > >> >> > > >> >> > #16 > >> >> > #0 > >> >> > V_B > >> >> > Y_I > >> >> > UW1_I > >> >> > Z_E views > >> >> > #2 > >> >> > SIL > >> >> > #0 > >> >> > AH0_B > >> >> > N_I > >> >> > SH_I unsure > >> >> > UH1_I > >> >> > R_E > >> >> > > >> >> > Note the presence of SIL at line 8. This is not in lexicon: > >> >> > > >> >> > $ grep SIL > >> >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt > >> >> > !SIL 1 0.20 1.00 1.00 SIL_S > >> >> > $ > >> >> > > >> >> > Is this a hint? How did it get there at all? I am using a > >> >> > standard > >> >> script to build the L_disambig.fst: > >> >> > > >> >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' > >> >> > $lang/phones.txt) word_disambig_symbol=$(awk '$1=="#0"{print > >> >> > $2}' $lang/words.txt) utils/make_lexicon_fst_silprob.pl > >> >> $lang/dict/lexiconp_silprob_disambig.txt \ > >> >> > data/local/dict/silprob.txt $silphone > >> >> > '#'$ndisambig > >> | \ > >> >> > fstcompile --isymbols=$lang/phones.txt -- > >> >> osymbols=$lang/words.txt \ > >> >> > --keep_isymbols=false --keep_osymbols=false | \ > >> >> > fstaddselfloops "echo $phone_disambig_symbol |" "echo > >> >> $word_disambig_symbol |" | \ > >> >> > fstarcsort --sort_type=olabel > $lang/L_disambig.fst || > >> >> > exit 1; > >> >> > > >> >> > I checked the lexicon, and there are indeed only real phones at > >> the > >> >> beginning of each word, no empty positions and no #N symbols. > >> >> > > >> >> > -kkm > >> >> > > >> >> >> -----Original Message----- > >> >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> >> Sent: 2015-06-15 1944 > >> >> >> To: Kirill Katsnelson > >> >> >> Cc: kal...@li... > >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >> >> >> completes > >> >> >> > >> >> >> I think the confusion is probably between two loops with > "real" > >> on > >> >> >> them in G.fst: one loop where you always take the bigram > >> >> probability, > >> >> >> and one where you always take the unigram probability. Or > >> >> >> maybe > >> a > >> >> >> similar confusion between a loop where you use the trigram > >> >> >> "real > >> >> real > >> >> >> real" and the bigram "real real". Those loops are expected to > >> >> exist. > >> >> >> Probably the issue is that something happened at the start of > >> >> >> the sequence which caused the FST to be confused about which > of > >> >> >> those > >> >> two > >> >> >> states it was in. If you have any empty words (words with > >> >> >> empty > >> >> >> pronunciation) in your lexicon this could possibly happen, as > >> >> >> it would be confused between taking a normal word, then the > >> >> >> backoff > >> >> symbol, vs. > >> >> >> taking a normal word, then the empty word, then the backoff > >> symbol. > >> >> >> I think the current Kaldi graph-creation script check for > empty > >> >> words > >> >> >> in the lexicon, for this reason. > >> >> >> > >> >> >> Dan > >> >> >> > >> >> >> > >> >> >> > >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 > ( > >> >> >> > ) > >> >> >> generally almost makes sense, given that #16 is the last one > in > >> >> >> table, the silence disambiguation symbol. (Not sure why "real" > >> >> >> is emitted at L_E--I would rather expect it to be emitted at > >> >> >> #1.) What > >> >> I > >> >> >> do not understand is what exactly the debug trace represents, > >> >> >> and what should I make out if it. It is a path through the FST > >> >> >> graph, > >> >> but > >> >> >> I do not understand what is this path exactly, and what does > >> >> >> this endless walk of this loop mean. > >> >> >> > > >> >> >> > -kkm > >> >> >> > > >> >> >> >> -----Original Message----- > >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> >> >> Sent: 2015-06-15 1858 > >> >> >> >> To: Kirill Katsnelson > >> >> >> >> Cc: kal...@li... > >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >> >> >> >> completes > >> >> >> >> > >> >> >> >> Look into the "backoff disambiguation symbol", normally > >> >> >> >> called > >> >> #0. > >> >> >> >> The reason why it is needed should be explained in the > >> hbka.pdf > >> >> >> paper. > >> >> >> >> Dan > >> >> >> >> > >> >> >> >> > >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson > >> >> >> >> <kir...@sm...> wrote: > >> >> >> >> > Thank you! The output consists of some sequences as you > >> >> >> >> > described, > >> >> >> >> quickly falling into a short ever repeated loop. > >> >> >> >> > > >> >> >> >> > The non-repeated section ends up with osymbols (excluding > >> >> >> epsilons) > >> >> >> >> "whatsoever on vacation up", and then the repeated part > >> >> >> >> looks > >> >> like " > >> >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The > >> >> >> >> word > >> >> "real" > >> >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. > >> >> >> >> > > >> >> >> >> > Both LMs contain a bigram for "vacation up" and a trigram > >> >> >> "vacation > >> >> >> >> up there". "up real" is a bigram in both, with 3-grams "up > >> real > >> >> >> quick" > >> >> >> >> and "up real quickly". "up real" is also a tail of a few > >> >> >> >> other 3-grams, but these are also same in both models (up > to > >> >> >> >> their > >> >> >> weights). > >> >> >> >> > > >> >> >> >> > It looks I do not understand what should I make in the > end > >> >> >> >> > out of > >> >> >> >> this > >> >> >> >> > debug data :( > >> >> >> >> > > >> >> >> >> > -kkm > >> >> >> >> > > >> >> >> >> >> -----Original Message----- > >> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> >> >> >> Sent: 2015-06-15 1821 > >> >> >> >> >> To: Kirill Katsnelson > >> >> >> >> >> Cc: kal...@li... > >> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) > never > >> >> >> >> >> completes > >> >> >> >> >> > >> >> >> >> >> > I have a small set of sentences with repeat counts, > and > >> >> >> >> >> > generating an > >> >> >> >> >> LM out of it. One is generated by a horrible local tool > I > >> >> >> >> >> have trouble tracing exactly how. For this one L*G > >> >> >> >> >> composition > >> >> takes > >> >> >> >> about > >> >> >> >> >> 20 seconds on my CPU. Another LM I just generated out of > >> the > >> >> >> >> >> same files with srilm 1.7.1 ngram-count. This one has > >> >> >> >> >> been sitting in mkgraphs.sh on L_disambig*G composition > >> >> >> >> >> step for about 30 > >> >> >> minutes, > >> >> >> >> >> and still churning. fstdeterminizestar --use-log=true is > >> >> >> >> >> running at > >> >> >> >> 100%. > >> >> >> >> >> L_disambig.fst is the same file in both cases. Looks > like > >> >> >> >> >> the > >> >> G > >> >> >> >> >> making it not determinizable, although I have no idea > how > >> it > >> >> >> >> >> came to > >> >> >> >> be. > >> >> >> >> >> > > >> >> >> >> >> > Anyone could share an advice on tracking down the > >> problem? > >> >> >> Thanks. > >> >> >> >> >> > >> >> >> >> >> You can send a signal to that program like kill - > SIGUSR1 > >> >> >> >> >> process-id and it will print out some info about the > >> >> >> >> >> symbol sequences involved, I think it is like > >> >> >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. > >> >> >> >> >> Usually there is a particular word sequence that is > >> >> problematic. > >> >> >> >> >> Dan > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > > >> >> >> >> >> > -kkm > >> >> >> >> >> > > >> >> >> >> >> > ------------------------------------------------------ > - > >> >> >> >> >> > -- > >> - > >> >> >> >> >> > -- > >> >> - > >> >> >> >> >> > -- > >> >> >> - > >> >> >> >> >> > -- > >> >> >> >> - > >> >> >> >> >> > -- > >> >> >> >> >> - > >> >> >> >> >> > -------- > >> >> >> >> >> > _______________________________________________ > >> >> >> >> >> > Kaldi-users mailing list > >> >> >> >> >> > Kal...@li... > >> >> >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi- > user > >> >> >> >> >> > s |
From: Daniel P. <dp...@gm...> - 2015-06-16 06:40:06
|
In general SRILM language models are OK, but something weird could have happened, especially on an unusual platform like Windows. Look for duplicate lines with apparently the same n-gram on, and also send to me (but not to kaldi-user) the arpa LM. Dan On Tue, Jun 16, 2015 at 2:03 AM, Kirill Katsnelson <kir...@sm...> wrote: > Nope. The only thing I am thinking of doing is to bisect it somehow, to get a minimal grammar that still refuses to determinize. I tried different smoothing and played with other switches to ngram_count, but it still does loop. Are there any known problems with srilm-generated models? > > -kkm > >> -----Original Message----- >> From: Daniel Povey [mailto:dp...@gm...] >> Sent: 2015-06-15 2248 >> To: Kirill Katsnelson >> Cc: kal...@li... >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> OOVs should be OK. >> Make sure there are no n-grams with things like <s> <s> >> >> e.g. see the lines >> grep -v '<s> <s>' | \ >> grep -v '</s> <s>' | \ >> grep -v '</s> </s>' | \ >> >> in the WSJ script: >> >> gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \ >> grep -v '<s> <s>' | \ >> grep -v '</s> <s>' | \ >> grep -v '</s> </s>' | \ >> arpa2fst - | fstprint | \ >> utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \ >> utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- >> isymbols=$test/words.txt \ >> --osymbols=$test/words.txt --keep_isymbols=false -- >> keep_osymbols=false | \ >> fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst >> >> Dan >> >> >> On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson >> <kir...@sm...> wrote: >> > Bingo. G.fst is not determinizable (the "good" G.fst takes under a >> > second to determinize). And the bad one loops at the word "zero" like >> > this >> > >> > #0 >> > unsure unsure >> > #0 >> > of of >> > #0 >> > yours yours >> > #0 >> > is is >> > #0 >> > your your >> > #0 >> > zip zip >> > #0 >> > wrong wrong >> > #0 >> > with with >> > #0 >> > zero zero >> > #0 >> > zero zero >> > .... >> > >> > I am taking the LM straight from ngram_counts to the standard >> pipeline, nothing fancy. The only thing is it has a lot of OOVs: >> > >> > remove_oovs.pl: removed 4646 lines. >> > >> > Is this generally a problem? So does my "good" arpa LM. I grepped >> both for the word zero, but could not spot anything outrageous. Can you >> think of anything I can look for? >> > >> > My source is no longer than 10 days old. Here's the pipeline, just in >> case. >> > >> > cat $src/$arpalm | tr -d '\r' | \ >> > utils/find_arpa_oovs.pl $lang/words.txt > $lang/lm_oovs.txt >> > >> > cat $src/$arpalm | tr -d '\r' | \ >> > arpa2fst - | fstprint | \ >> > utils/remove_oovs.pl $lang/lm_oovs.txt | \ >> > utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- >> isymbols=$lang/words.txt \ >> > --osymbols=$lang/words.txt --keep_isymbols=false -- >> keep_osymbols=false | \ >> > fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst >> > >> > -kkm >> > >> > >> >> -----Original Message----- >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> Sent: 2015-06-15 2206 >> >> To: Kirill Katsnelson >> >> Cc: kal...@li... >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> >> >> I don't recommend to look at the fstdeterminizestar algorithm >> itself- >> >> it's very complicated. Instead focus on the definition of >> >> "determinizable" and the twins property, and figure out what path >> you >> >> are taking through L.fst and G.fst. Trying to fstdeterminizestar >> >> G.fst directly, and seeing whether it terminates or not, may tell >> you >> >> something; if it fails, send the signal and see what happens. >> >> fstdeterminizestar does care about the weights, but only to the >> >> extent that they are the same or different from each other; and if >> >> your G.fst is generated from arpa2fst the pipeline should work for >> >> any ARPA-format language model- make sure you are using an up-to- >> date >> >> Kaldi though, there have been fixes as recently as a few months ago. >> >> The presence of SIL is not surprising, it is the optional-silence >> >> added by the lexicon. I think that script is adding #16 if it does >> >> *not* take the optional silence, otherwise it adds the phone SIL. >> >> Since you are calling your FST a "grammar" I'm wondering whether you >> >> have done something fancy with mapping words to FSTs or something >> >> like that, which is causing the result to not be determinizable. >> >> >> >> Dan >> >> >> >> >> >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson >> >> <kir...@sm...> wrote: >> >> > Thank you very much for your help Dan, but I am still stuck. >> >> > >> >> > First of all, a question: does the fstdeterminizestar algorithm >> >> depend on actual backoff and n-gram probabilities, i.e. will it >> >> behave differently if the numbers in arpa model file are different? >> >> Or does it depend only on arc labels but not weights? I am looking >> at >> >> the code but certainly I am far from being able to understand it. I >> >> cheated by looking at all if conditions in it, and this one in >> >> EpsilonClosure is seemingly the only one dealing with weights: >> >> > >> >> > if (! ApproxEqual(weight, iter->second.weight, >> delta_)) >> >> > { >> >> // add extra part of weight to queue. >> >> > >> >> > (In ProcessFinal it also has "if (this_final_weight != >> >> > Weight::Zero())" but I do not believe it is relevant?) >> >> > >> >> > I am trying to understand how to dig into the problem--are weights >> >> > in >> >> the picture actually. >> >> > >> >> > Also, just for a test, I ran the grammar trough a "grep -v 'real >> >> real'", and indeed got a similar loop on the word "very" which is >> >> also often repeated. But the "real real" 2- and 3-grams are there in >> >> the "good" grammar too. >> >> > >> >> > Another thing I do not understand is the presence of the SIL >> ilabel >> >> in the backtrace. Here's the beginning of the trace that leads to >> the >> >> infinite loop as decoded with a little script I wrote (format is >> >> ilabel [ TAB olabel ]: >> >> > >> >> > #16 >> >> > #0 >> >> > V_B >> >> > Y_I >> >> > UW1_I >> >> > Z_E views >> >> > #2 >> >> > SIL >> >> > #0 >> >> > AH0_B >> >> > N_I >> >> > SH_I unsure >> >> > UH1_I >> >> > R_E >> >> > >> >> > Note the presence of SIL at line 8. This is not in lexicon: >> >> > >> >> > $ grep SIL >> >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt >> >> > !SIL 1 0.20 1.00 1.00 SIL_S >> >> > $ >> >> > >> >> > Is this a hint? How did it get there at all? I am using a standard >> >> script to build the L_disambig.fst: >> >> > >> >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/phones.txt) >> >> > word_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/words.txt) >> >> > utils/make_lexicon_fst_silprob.pl >> >> $lang/dict/lexiconp_silprob_disambig.txt \ >> >> > data/local/dict/silprob.txt $silphone '#'$ndisambig >> | \ >> >> > fstcompile --isymbols=$lang/phones.txt -- >> >> osymbols=$lang/words.txt \ >> >> > --keep_isymbols=false --keep_osymbols=false | \ >> >> > fstaddselfloops "echo $phone_disambig_symbol |" "echo >> >> $word_disambig_symbol |" | \ >> >> > fstarcsort --sort_type=olabel > $lang/L_disambig.fst || exit >> >> > 1; >> >> > >> >> > I checked the lexicon, and there are indeed only real phones at >> the >> >> beginning of each word, no empty positions and no #N symbols. >> >> > >> >> > -kkm >> >> > >> >> >> -----Original Message----- >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> Sent: 2015-06-15 1944 >> >> >> To: Kirill Katsnelson >> >> >> Cc: kal...@li... >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >> >> completes >> >> >> >> >> >> I think the confusion is probably between two loops with "real" >> on >> >> >> them in G.fst: one loop where you always take the bigram >> >> probability, >> >> >> and one where you always take the unigram probability. Or maybe >> a >> >> >> similar confusion between a loop where you use the trigram "real >> >> real >> >> >> real" and the bigram "real real". Those loops are expected to >> >> exist. >> >> >> Probably the issue is that something happened at the start of the >> >> >> sequence which caused the FST to be confused about which of those >> >> two >> >> >> states it was in. If you have any empty words (words with empty >> >> >> pronunciation) in your lexicon this could possibly happen, as it >> >> >> would be confused between taking a normal word, then the backoff >> >> symbol, vs. >> >> >> taking a normal word, then the empty word, then the backoff >> symbol. >> >> >> I think the current Kaldi graph-creation script check for empty >> >> words >> >> >> in the lexicon, for this reason. >> >> >> >> >> >> Dan >> >> >> >> >> >> >> >> >> >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( ) >> >> >> generally almost makes sense, given that #16 is the last one in >> >> >> table, the silence disambiguation symbol. (Not sure why "real" is >> >> >> emitted at L_E--I would rather expect it to be emitted at #1.) >> >> >> What >> >> I >> >> >> do not understand is what exactly the debug trace represents, and >> >> >> what should I make out if it. It is a path through the FST graph, >> >> but >> >> >> I do not understand what is this path exactly, and what does this >> >> >> endless walk of this loop mean. >> >> >> > >> >> >> > -kkm >> >> >> > >> >> >> >> -----Original Message----- >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> >> Sent: 2015-06-15 1858 >> >> >> >> To: Kirill Katsnelson >> >> >> >> Cc: kal...@li... >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >> >> >> completes >> >> >> >> >> >> >> >> Look into the "backoff disambiguation symbol", normally called >> >> #0. >> >> >> >> The reason why it is needed should be explained in the >> hbka.pdf >> >> >> paper. >> >> >> >> Dan >> >> >> >> >> >> >> >> >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson >> >> >> >> <kir...@sm...> wrote: >> >> >> >> > Thank you! The output consists of some sequences as you >> >> >> >> > described, >> >> >> >> quickly falling into a short ever repeated loop. >> >> >> >> > >> >> >> >> > The non-repeated section ends up with osymbols (excluding >> >> >> epsilons) >> >> >> >> "whatsoever on vacation up", and then the repeated part looks >> >> like " >> >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word >> >> "real" >> >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. >> >> >> >> > >> >> >> >> > Both LMs contain a bigram for "vacation up" and a trigram >> >> >> "vacation >> >> >> >> up there". "up real" is a bigram in both, with 3-grams "up >> real >> >> >> quick" >> >> >> >> and "up real quickly". "up real" is also a tail of a few other >> >> >> >> 3-grams, but these are also same in both models (up to their >> >> >> weights). >> >> >> >> > >> >> >> >> > It looks I do not understand what should I make in the end >> >> >> >> > out of >> >> >> >> this >> >> >> >> > debug data :( >> >> >> >> > >> >> >> >> > -kkm >> >> >> >> > >> >> >> >> >> -----Original Message----- >> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> >> >> Sent: 2015-06-15 1821 >> >> >> >> >> To: Kirill Katsnelson >> >> >> >> >> Cc: kal...@li... >> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >> >> >> >> completes >> >> >> >> >> >> >> >> >> >> > I have a small set of sentences with repeat counts, and >> >> >> >> >> > generating an >> >> >> >> >> LM out of it. One is generated by a horrible local tool I >> >> >> >> >> have trouble tracing exactly how. For this one L*G >> >> >> >> >> composition >> >> takes >> >> >> >> about >> >> >> >> >> 20 seconds on my CPU. Another LM I just generated out of >> the >> >> >> >> >> same files with srilm 1.7.1 ngram-count. This one has been >> >> >> >> >> sitting in mkgraphs.sh on L_disambig*G composition step for >> >> >> >> >> about 30 >> >> >> minutes, >> >> >> >> >> and still churning. fstdeterminizestar --use-log=true is >> >> >> >> >> running at >> >> >> >> 100%. >> >> >> >> >> L_disambig.fst is the same file in both cases. Looks like >> >> >> >> >> the >> >> G >> >> >> >> >> making it not determinizable, although I have no idea how >> it >> >> >> >> >> came to >> >> >> >> be. >> >> >> >> >> > >> >> >> >> >> > Anyone could share an advice on tracking down the >> problem? >> >> >> Thanks. >> >> >> >> >> >> >> >> >> >> You can send a signal to that program like kill -SIGUSR1 >> >> >> >> >> process-id and it will print out some info about the symbol >> >> >> >> >> sequences involved, I think it is like >> >> >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. >> >> >> >> >> Usually there is a particular word sequence that is >> >> problematic. >> >> >> >> >> Dan >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> > -kkm >> >> >> >> >> > >> >> >> >> >> > --------------------------------------------------------- >> - >> >> >> >> >> > -- >> >> - >> >> >> >> >> > -- >> >> >> - >> >> >> >> >> > -- >> >> >> >> - >> >> >> >> >> > -- >> >> >> >> >> - >> >> >> >> >> > -------- _______________________________________________ >> >> >> >> >> > Kaldi-users mailing list >> >> >> >> >> > Kal...@li... >> >> >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users |
From: Kirill K. <kir...@sm...> - 2015-06-16 06:03:59
|
Nope. The only thing I am thinking of doing is to bisect it somehow, to get a minimal grammar that still refuses to determinize. I tried different smoothing and played with other switches to ngram_count, but it still does loop. Are there any known problems with srilm-generated models? -kkm > -----Original Message----- > From: Daniel Povey [mailto:dp...@gm...] > Sent: 2015-06-15 2248 > To: Kirill Katsnelson > Cc: kal...@li... > Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > > OOVs should be OK. > Make sure there are no n-grams with things like <s> <s> > > e.g. see the lines > grep -v '<s> <s>' | \ > grep -v '</s> <s>' | \ > grep -v '</s> </s>' | \ > > in the WSJ script: > > gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \ > grep -v '<s> <s>' | \ > grep -v '</s> <s>' | \ > grep -v '</s> </s>' | \ > arpa2fst - | fstprint | \ > utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \ > utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- > isymbols=$test/words.txt \ > --osymbols=$test/words.txt --keep_isymbols=false -- > keep_osymbols=false | \ > fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst > > Dan > > > On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson > <kir...@sm...> wrote: > > Bingo. G.fst is not determinizable (the "good" G.fst takes under a > > second to determinize). And the bad one loops at the word "zero" like > > this > > > > #0 > > unsure unsure > > #0 > > of of > > #0 > > yours yours > > #0 > > is is > > #0 > > your your > > #0 > > zip zip > > #0 > > wrong wrong > > #0 > > with with > > #0 > > zero zero > > #0 > > zero zero > > .... > > > > I am taking the LM straight from ngram_counts to the standard > pipeline, nothing fancy. The only thing is it has a lot of OOVs: > > > > remove_oovs.pl: removed 4646 lines. > > > > Is this generally a problem? So does my "good" arpa LM. I grepped > both for the word zero, but could not spot anything outrageous. Can you > think of anything I can look for? > > > > My source is no longer than 10 days old. Here's the pipeline, just in > case. > > > > cat $src/$arpalm | tr -d '\r' | \ > > utils/find_arpa_oovs.pl $lang/words.txt > $lang/lm_oovs.txt > > > > cat $src/$arpalm | tr -d '\r' | \ > > arpa2fst - | fstprint | \ > > utils/remove_oovs.pl $lang/lm_oovs.txt | \ > > utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- > isymbols=$lang/words.txt \ > > --osymbols=$lang/words.txt --keep_isymbols=false -- > keep_osymbols=false | \ > > fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst > > > > -kkm > > > > > >> -----Original Message----- > >> From: Daniel Povey [mailto:dp...@gm...] > >> Sent: 2015-06-15 2206 > >> To: Kirill Katsnelson > >> Cc: kal...@li... > >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > >> > >> I don't recommend to look at the fstdeterminizestar algorithm > itself- > >> it's very complicated. Instead focus on the definition of > >> "determinizable" and the twins property, and figure out what path > you > >> are taking through L.fst and G.fst. Trying to fstdeterminizestar > >> G.fst directly, and seeing whether it terminates or not, may tell > you > >> something; if it fails, send the signal and see what happens. > >> fstdeterminizestar does care about the weights, but only to the > >> extent that they are the same or different from each other; and if > >> your G.fst is generated from arpa2fst the pipeline should work for > >> any ARPA-format language model- make sure you are using an up-to- > date > >> Kaldi though, there have been fixes as recently as a few months ago. > >> The presence of SIL is not surprising, it is the optional-silence > >> added by the lexicon. I think that script is adding #16 if it does > >> *not* take the optional silence, otherwise it adds the phone SIL. > >> Since you are calling your FST a "grammar" I'm wondering whether you > >> have done something fancy with mapping words to FSTs or something > >> like that, which is causing the result to not be determinizable. > >> > >> Dan > >> > >> > >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson > >> <kir...@sm...> wrote: > >> > Thank you very much for your help Dan, but I am still stuck. > >> > > >> > First of all, a question: does the fstdeterminizestar algorithm > >> depend on actual backoff and n-gram probabilities, i.e. will it > >> behave differently if the numbers in arpa model file are different? > >> Or does it depend only on arc labels but not weights? I am looking > at > >> the code but certainly I am far from being able to understand it. I > >> cheated by looking at all if conditions in it, and this one in > >> EpsilonClosure is seemingly the only one dealing with weights: > >> > > >> > if (! ApproxEqual(weight, iter->second.weight, > delta_)) > >> > { > >> // add extra part of weight to queue. > >> > > >> > (In ProcessFinal it also has "if (this_final_weight != > >> > Weight::Zero())" but I do not believe it is relevant?) > >> > > >> > I am trying to understand how to dig into the problem--are weights > >> > in > >> the picture actually. > >> > > >> > Also, just for a test, I ran the grammar trough a "grep -v 'real > >> real'", and indeed got a similar loop on the word "very" which is > >> also often repeated. But the "real real" 2- and 3-grams are there in > >> the "good" grammar too. > >> > > >> > Another thing I do not understand is the presence of the SIL > ilabel > >> in the backtrace. Here's the beginning of the trace that leads to > the > >> infinite loop as decoded with a little script I wrote (format is > >> ilabel [ TAB olabel ]: > >> > > >> > #16 > >> > #0 > >> > V_B > >> > Y_I > >> > UW1_I > >> > Z_E views > >> > #2 > >> > SIL > >> > #0 > >> > AH0_B > >> > N_I > >> > SH_I unsure > >> > UH1_I > >> > R_E > >> > > >> > Note the presence of SIL at line 8. This is not in lexicon: > >> > > >> > $ grep SIL > >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt > >> > !SIL 1 0.20 1.00 1.00 SIL_S > >> > $ > >> > > >> > Is this a hint? How did it get there at all? I am using a standard > >> script to build the L_disambig.fst: > >> > > >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/phones.txt) > >> > word_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/words.txt) > >> > utils/make_lexicon_fst_silprob.pl > >> $lang/dict/lexiconp_silprob_disambig.txt \ > >> > data/local/dict/silprob.txt $silphone '#'$ndisambig > | \ > >> > fstcompile --isymbols=$lang/phones.txt -- > >> osymbols=$lang/words.txt \ > >> > --keep_isymbols=false --keep_osymbols=false | \ > >> > fstaddselfloops "echo $phone_disambig_symbol |" "echo > >> $word_disambig_symbol |" | \ > >> > fstarcsort --sort_type=olabel > $lang/L_disambig.fst || exit > >> > 1; > >> > > >> > I checked the lexicon, and there are indeed only real phones at > the > >> beginning of each word, no empty positions and no #N symbols. > >> > > >> > -kkm > >> > > >> >> -----Original Message----- > >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> Sent: 2015-06-15 1944 > >> >> To: Kirill Katsnelson > >> >> Cc: kal...@li... > >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >> >> completes > >> >> > >> >> I think the confusion is probably between two loops with "real" > on > >> >> them in G.fst: one loop where you always take the bigram > >> probability, > >> >> and one where you always take the unigram probability. Or maybe > a > >> >> similar confusion between a loop where you use the trigram "real > >> real > >> >> real" and the bigram "real real". Those loops are expected to > >> exist. > >> >> Probably the issue is that something happened at the start of the > >> >> sequence which caused the FST to be confused about which of those > >> two > >> >> states it was in. If you have any empty words (words with empty > >> >> pronunciation) in your lexicon this could possibly happen, as it > >> >> would be confused between taking a normal word, then the backoff > >> symbol, vs. > >> >> taking a normal word, then the empty word, then the backoff > symbol. > >> >> I think the current Kaldi graph-creation script check for empty > >> words > >> >> in the lexicon, for this reason. > >> >> > >> >> Dan > >> >> > >> >> > >> >> > >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( ) > >> >> generally almost makes sense, given that #16 is the last one in > >> >> table, the silence disambiguation symbol. (Not sure why "real" is > >> >> emitted at L_E--I would rather expect it to be emitted at #1.) > >> >> What > >> I > >> >> do not understand is what exactly the debug trace represents, and > >> >> what should I make out if it. It is a path through the FST graph, > >> but > >> >> I do not understand what is this path exactly, and what does this > >> >> endless walk of this loop mean. > >> >> > > >> >> > -kkm > >> >> > > >> >> >> -----Original Message----- > >> >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> >> Sent: 2015-06-15 1858 > >> >> >> To: Kirill Katsnelson > >> >> >> Cc: kal...@li... > >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >> >> >> completes > >> >> >> > >> >> >> Look into the "backoff disambiguation symbol", normally called > >> #0. > >> >> >> The reason why it is needed should be explained in the > hbka.pdf > >> >> paper. > >> >> >> Dan > >> >> >> > >> >> >> > >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson > >> >> >> <kir...@sm...> wrote: > >> >> >> > Thank you! The output consists of some sequences as you > >> >> >> > described, > >> >> >> quickly falling into a short ever repeated loop. > >> >> >> > > >> >> >> > The non-repeated section ends up with osymbols (excluding > >> >> epsilons) > >> >> >> "whatsoever on vacation up", and then the repeated part looks > >> like " > >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word > >> "real" > >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. > >> >> >> > > >> >> >> > Both LMs contain a bigram for "vacation up" and a trigram > >> >> "vacation > >> >> >> up there". "up real" is a bigram in both, with 3-grams "up > real > >> >> quick" > >> >> >> and "up real quickly". "up real" is also a tail of a few other > >> >> >> 3-grams, but these are also same in both models (up to their > >> >> weights). > >> >> >> > > >> >> >> > It looks I do not understand what should I make in the end > >> >> >> > out of > >> >> >> this > >> >> >> > debug data :( > >> >> >> > > >> >> >> > -kkm > >> >> >> > > >> >> >> >> -----Original Message----- > >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> >> >> Sent: 2015-06-15 1821 > >> >> >> >> To: Kirill Katsnelson > >> >> >> >> Cc: kal...@li... > >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >> >> >> >> completes > >> >> >> >> > >> >> >> >> > I have a small set of sentences with repeat counts, and > >> >> >> >> > generating an > >> >> >> >> LM out of it. One is generated by a horrible local tool I > >> >> >> >> have trouble tracing exactly how. For this one L*G > >> >> >> >> composition > >> takes > >> >> >> about > >> >> >> >> 20 seconds on my CPU. Another LM I just generated out of > the > >> >> >> >> same files with srilm 1.7.1 ngram-count. This one has been > >> >> >> >> sitting in mkgraphs.sh on L_disambig*G composition step for > >> >> >> >> about 30 > >> >> minutes, > >> >> >> >> and still churning. fstdeterminizestar --use-log=true is > >> >> >> >> running at > >> >> >> 100%. > >> >> >> >> L_disambig.fst is the same file in both cases. Looks like > >> >> >> >> the > >> G > >> >> >> >> making it not determinizable, although I have no idea how > it > >> >> >> >> came to > >> >> >> be. > >> >> >> >> > > >> >> >> >> > Anyone could share an advice on tracking down the > problem? > >> >> Thanks. > >> >> >> >> > >> >> >> >> You can send a signal to that program like kill -SIGUSR1 > >> >> >> >> process-id and it will print out some info about the symbol > >> >> >> >> sequences involved, I think it is like > >> >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. > >> >> >> >> Usually there is a particular word sequence that is > >> problematic. > >> >> >> >> Dan > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > > >> >> >> >> > -kkm > >> >> >> >> > > >> >> >> >> > --------------------------------------------------------- > - > >> >> >> >> > -- > >> - > >> >> >> >> > -- > >> >> - > >> >> >> >> > -- > >> >> >> - > >> >> >> >> > -- > >> >> >> >> - > >> >> >> >> > -------- _______________________________________________ > >> >> >> >> > Kaldi-users mailing list > >> >> >> >> > Kal...@li... > >> >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users |
From: Daniel P. <dp...@gm...> - 2015-06-16 05:48:28
|
OOVs should be OK. Make sure there are no n-grams with things like <s> <s> e.g. see the lines grep -v '<s> <s>' | \ grep -v '</s> <s>' | \ grep -v '</s> </s>' | \ in the WSJ script: gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \ grep -v '<s> <s>' | \ grep -v '</s> <s>' | \ grep -v '</s> </s>' | \ arpa2fst - | fstprint | \ utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \ utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --isymbols=$test/words.txt \ --osymbols=$test/words.txt --keep_isymbols=false --keep_osymbols=false | \ fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst Dan On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson <kir...@sm...> wrote: > Bingo. G.fst is not determinizable (the "good" G.fst takes under a second to determinize). And the bad one loops at the word "zero" like this > > #0 > unsure unsure > #0 > of of > #0 > yours yours > #0 > is is > #0 > your your > #0 > zip zip > #0 > wrong wrong > #0 > with with > #0 > zero zero > #0 > zero zero > .... > > I am taking the LM straight from ngram_counts to the standard pipeline, nothing fancy. The only thing is it has a lot of OOVs: > > remove_oovs.pl: removed 4646 lines. > > Is this generally a problem? So does my "good" arpa LM. I grepped both for the word zero, but could not spot anything outrageous. Can you think of anything I can look for? > > My source is no longer than 10 days old. Here's the pipeline, just in case. > > cat $src/$arpalm | tr -d '\r' | \ > utils/find_arpa_oovs.pl $lang/words.txt > $lang/lm_oovs.txt > > cat $src/$arpalm | tr -d '\r' | \ > arpa2fst - | fstprint | \ > utils/remove_oovs.pl $lang/lm_oovs.txt | \ > utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --isymbols=$lang/words.txt \ > --osymbols=$lang/words.txt --keep_isymbols=false --keep_osymbols=false | \ > fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst > > -kkm > > >> -----Original Message----- >> From: Daniel Povey [mailto:dp...@gm...] >> Sent: 2015-06-15 2206 >> To: Kirill Katsnelson >> Cc: kal...@li... >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> I don't recommend to look at the fstdeterminizestar algorithm itself- >> it's very complicated. Instead focus on the definition of >> "determinizable" and the twins property, and figure out what path you >> are taking through L.fst and G.fst. Trying to fstdeterminizestar G.fst >> directly, and seeing whether it terminates or not, may tell you >> something; if it fails, send the signal and see what happens. >> fstdeterminizestar does care about the weights, but only to the extent >> that they are the same or different from each other; and if your G.fst >> is generated from arpa2fst the pipeline should work for any ARPA-format >> language model- make sure you are using an up-to-date Kaldi though, >> there have been fixes as recently as a few months ago. >> The presence of SIL is not surprising, it is the optional-silence added >> by the lexicon. I think that script is adding #16 if it does >> *not* take the optional silence, otherwise it adds the phone SIL. >> Since you are calling your FST a "grammar" I'm wondering whether you >> have done something fancy with mapping words to FSTs or something like >> that, which is causing the result to not be determinizable. >> >> Dan >> >> >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson >> <kir...@sm...> wrote: >> > Thank you very much for your help Dan, but I am still stuck. >> > >> > First of all, a question: does the fstdeterminizestar algorithm >> depend on actual backoff and n-gram probabilities, i.e. will it behave >> differently if the numbers in arpa model file are different? Or does it >> depend only on arc labels but not weights? I am looking at the code but >> certainly I am far from being able to understand it. I cheated by >> looking at all if conditions in it, and this one in EpsilonClosure is >> seemingly the only one dealing with weights: >> > >> > if (! ApproxEqual(weight, iter->second.weight, delta_)) { >> // add extra part of weight to queue. >> > >> > (In ProcessFinal it also has "if (this_final_weight != >> > Weight::Zero())" but I do not believe it is relevant?) >> > >> > I am trying to understand how to dig into the problem--are weights in >> the picture actually. >> > >> > Also, just for a test, I ran the grammar trough a "grep -v 'real >> real'", and indeed got a similar loop on the word "very" which is also >> often repeated. But the "real real" 2- and 3-grams are there in the >> "good" grammar too. >> > >> > Another thing I do not understand is the presence of the SIL ilabel >> in the backtrace. Here's the beginning of the trace that leads to the >> infinite loop as decoded with a little script I wrote (format is ilabel >> [ TAB olabel ]: >> > >> > #16 >> > #0 >> > V_B >> > Y_I >> > UW1_I >> > Z_E views >> > #2 >> > SIL >> > #0 >> > AH0_B >> > N_I >> > SH_I unsure >> > UH1_I >> > R_E >> > >> > Note the presence of SIL at line 8. This is not in lexicon: >> > >> > $ grep SIL >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt >> > !SIL 1 0.20 1.00 1.00 SIL_S >> > $ >> > >> > Is this a hint? How did it get there at all? I am using a standard >> script to build the L_disambig.fst: >> > >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/phones.txt) >> > word_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/words.txt) >> > utils/make_lexicon_fst_silprob.pl >> $lang/dict/lexiconp_silprob_disambig.txt \ >> > data/local/dict/silprob.txt $silphone '#'$ndisambig | \ >> > fstcompile --isymbols=$lang/phones.txt -- >> osymbols=$lang/words.txt \ >> > --keep_isymbols=false --keep_osymbols=false | \ >> > fstaddselfloops "echo $phone_disambig_symbol |" "echo >> $word_disambig_symbol |" | \ >> > fstarcsort --sort_type=olabel > $lang/L_disambig.fst || exit 1; >> > >> > I checked the lexicon, and there are indeed only real phones at the >> beginning of each word, no empty positions and no #N symbols. >> > >> > -kkm >> > >> >> -----Original Message----- >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> Sent: 2015-06-15 1944 >> >> To: Kirill Katsnelson >> >> Cc: kal...@li... >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> >> >> I think the confusion is probably between two loops with "real" on >> >> them in G.fst: one loop where you always take the bigram >> probability, >> >> and one where you always take the unigram probability. Or maybe a >> >> similar confusion between a loop where you use the trigram "real >> real >> >> real" and the bigram "real real". Those loops are expected to >> exist. >> >> Probably the issue is that something happened at the start of the >> >> sequence which caused the FST to be confused about which of those >> two >> >> states it was in. If you have any empty words (words with empty >> >> pronunciation) in your lexicon this could possibly happen, as it >> >> would be confused between taking a normal word, then the backoff >> symbol, vs. >> >> taking a normal word, then the empty word, then the backoff symbol. >> >> I think the current Kaldi graph-creation script check for empty >> words >> >> in the lexicon, for this reason. >> >> >> >> Dan >> >> >> >> >> >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( ) >> >> generally almost makes sense, given that #16 is the last one in >> >> table, the silence disambiguation symbol. (Not sure why "real" is >> >> emitted at L_E--I would rather expect it to be emitted at #1.) What >> I >> >> do not understand is what exactly the debug trace represents, and >> >> what should I make out if it. It is a path through the FST graph, >> but >> >> I do not understand what is this path exactly, and what does this >> >> endless walk of this loop mean. >> >> > >> >> > -kkm >> >> > >> >> >> -----Original Message----- >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> Sent: 2015-06-15 1858 >> >> >> To: Kirill Katsnelson >> >> >> Cc: kal...@li... >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >> >> completes >> >> >> >> >> >> Look into the "backoff disambiguation symbol", normally called >> #0. >> >> >> The reason why it is needed should be explained in the hbka.pdf >> >> paper. >> >> >> Dan >> >> >> >> >> >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson >> >> >> <kir...@sm...> wrote: >> >> >> > Thank you! The output consists of some sequences as you >> >> >> > described, >> >> >> quickly falling into a short ever repeated loop. >> >> >> > >> >> >> > The non-repeated section ends up with osymbols (excluding >> >> epsilons) >> >> >> "whatsoever on vacation up", and then the repeated part looks >> like " >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word >> "real" >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. >> >> >> > >> >> >> > Both LMs contain a bigram for "vacation up" and a trigram >> >> "vacation >> >> >> up there". "up real" is a bigram in both, with 3-grams "up real >> >> quick" >> >> >> and "up real quickly". "up real" is also a tail of a few other >> >> >> 3-grams, but these are also same in both models (up to their >> >> weights). >> >> >> > >> >> >> > It looks I do not understand what should I make in the end out >> >> >> > of >> >> >> this >> >> >> > debug data :( >> >> >> > >> >> >> > -kkm >> >> >> > >> >> >> >> -----Original Message----- >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> >> Sent: 2015-06-15 1821 >> >> >> >> To: Kirill Katsnelson >> >> >> >> Cc: kal...@li... >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >> >> >> completes >> >> >> >> >> >> >> >> > I have a small set of sentences with repeat counts, and >> >> >> >> > generating an >> >> >> >> LM out of it. One is generated by a horrible local tool I have >> >> >> >> trouble tracing exactly how. For this one L*G composition >> takes >> >> >> about >> >> >> >> 20 seconds on my CPU. Another LM I just generated out of the >> >> >> >> same files with srilm 1.7.1 ngram-count. This one has been >> >> >> >> sitting in mkgraphs.sh on L_disambig*G composition step for >> >> >> >> about 30 >> >> minutes, >> >> >> >> and still churning. fstdeterminizestar --use-log=true is >> >> >> >> running at >> >> >> 100%. >> >> >> >> L_disambig.fst is the same file in both cases. Looks like the >> G >> >> >> >> making it not determinizable, although I have no idea how it >> >> >> >> came to >> >> >> be. >> >> >> >> > >> >> >> >> > Anyone could share an advice on tracking down the problem? >> >> Thanks. >> >> >> >> >> >> >> >> You can send a signal to that program like kill -SIGUSR1 >> >> >> >> process-id and it will print out some info about the symbol >> >> >> >> sequences involved, I think it is like >> >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. >> >> >> >> Usually there is a particular word sequence that is >> problematic. >> >> >> >> Dan >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> > -kkm >> >> >> >> > >> >> >> >> > ------------------------------------------------------------ >> - >> >> >> >> > -- >> >> - >> >> >> >> > -- >> >> >> - >> >> >> >> > -- >> >> >> >> - >> >> >> >> > -------- _______________________________________________ >> >> >> >> > Kaldi-users mailing list >> >> >> >> > Kal...@li... >> >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users |
From: Kirill K. <kir...@sm...> - 2015-06-16 05:42:30
|
Bingo. G.fst is not determinizable (the "good" G.fst takes under a second to determinize). And the bad one loops at the word "zero" like this #0 unsure unsure #0 of of #0 yours yours #0 is is #0 your your #0 zip zip #0 wrong wrong #0 with with #0 zero zero #0 zero zero .... I am taking the LM straight from ngram_counts to the standard pipeline, nothing fancy. The only thing is it has a lot of OOVs: remove_oovs.pl: removed 4646 lines. Is this generally a problem? So does my "good" arpa LM. I grepped both for the word zero, but could not spot anything outrageous. Can you think of anything I can look for? My source is no longer than 10 days old. Here's the pipeline, just in case. cat $src/$arpalm | tr -d '\r' | \ utils/find_arpa_oovs.pl $lang/words.txt > $lang/lm_oovs.txt cat $src/$arpalm | tr -d '\r' | \ arpa2fst - | fstprint | \ utils/remove_oovs.pl $lang/lm_oovs.txt | \ utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --isymbols=$lang/words.txt \ --osymbols=$lang/words.txt --keep_isymbols=false --keep_osymbols=false | \ fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst -kkm > -----Original Message----- > From: Daniel Povey [mailto:dp...@gm...] > Sent: 2015-06-15 2206 > To: Kirill Katsnelson > Cc: kal...@li... > Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > > I don't recommend to look at the fstdeterminizestar algorithm itself- > it's very complicated. Instead focus on the definition of > "determinizable" and the twins property, and figure out what path you > are taking through L.fst and G.fst. Trying to fstdeterminizestar G.fst > directly, and seeing whether it terminates or not, may tell you > something; if it fails, send the signal and see what happens. > fstdeterminizestar does care about the weights, but only to the extent > that they are the same or different from each other; and if your G.fst > is generated from arpa2fst the pipeline should work for any ARPA-format > language model- make sure you are using an up-to-date Kaldi though, > there have been fixes as recently as a few months ago. > The presence of SIL is not surprising, it is the optional-silence added > by the lexicon. I think that script is adding #16 if it does > *not* take the optional silence, otherwise it adds the phone SIL. > Since you are calling your FST a "grammar" I'm wondering whether you > have done something fancy with mapping words to FSTs or something like > that, which is causing the result to not be determinizable. > > Dan > > > On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson > <kir...@sm...> wrote: > > Thank you very much for your help Dan, but I am still stuck. > > > > First of all, a question: does the fstdeterminizestar algorithm > depend on actual backoff and n-gram probabilities, i.e. will it behave > differently if the numbers in arpa model file are different? Or does it > depend only on arc labels but not weights? I am looking at the code but > certainly I am far from being able to understand it. I cheated by > looking at all if conditions in it, and this one in EpsilonClosure is > seemingly the only one dealing with weights: > > > > if (! ApproxEqual(weight, iter->second.weight, delta_)) { > // add extra part of weight to queue. > > > > (In ProcessFinal it also has "if (this_final_weight != > > Weight::Zero())" but I do not believe it is relevant?) > > > > I am trying to understand how to dig into the problem--are weights in > the picture actually. > > > > Also, just for a test, I ran the grammar trough a "grep -v 'real > real'", and indeed got a similar loop on the word "very" which is also > often repeated. But the "real real" 2- and 3-grams are there in the > "good" grammar too. > > > > Another thing I do not understand is the presence of the SIL ilabel > in the backtrace. Here's the beginning of the trace that leads to the > infinite loop as decoded with a little script I wrote (format is ilabel > [ TAB olabel ]: > > > > #16 > > #0 > > V_B > > Y_I > > UW1_I > > Z_E views > > #2 > > SIL > > #0 > > AH0_B > > N_I > > SH_I unsure > > UH1_I > > R_E > > > > Note the presence of SIL at line 8. This is not in lexicon: > > > > $ grep SIL > data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt > > !SIL 1 0.20 1.00 1.00 SIL_S > > $ > > > > Is this a hint? How did it get there at all? I am using a standard > script to build the L_disambig.fst: > > > > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/phones.txt) > > word_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/words.txt) > > utils/make_lexicon_fst_silprob.pl > $lang/dict/lexiconp_silprob_disambig.txt \ > > data/local/dict/silprob.txt $silphone '#'$ndisambig | \ > > fstcompile --isymbols=$lang/phones.txt -- > osymbols=$lang/words.txt \ > > --keep_isymbols=false --keep_osymbols=false | \ > > fstaddselfloops "echo $phone_disambig_symbol |" "echo > $word_disambig_symbol |" | \ > > fstarcsort --sort_type=olabel > $lang/L_disambig.fst || exit 1; > > > > I checked the lexicon, and there are indeed only real phones at the > beginning of each word, no empty positions and no #N symbols. > > > > -kkm > > > >> -----Original Message----- > >> From: Daniel Povey [mailto:dp...@gm...] > >> Sent: 2015-06-15 1944 > >> To: Kirill Katsnelson > >> Cc: kal...@li... > >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > >> > >> I think the confusion is probably between two loops with "real" on > >> them in G.fst: one loop where you always take the bigram > probability, > >> and one where you always take the unigram probability. Or maybe a > >> similar confusion between a loop where you use the trigram "real > real > >> real" and the bigram "real real". Those loops are expected to > exist. > >> Probably the issue is that something happened at the start of the > >> sequence which caused the FST to be confused about which of those > two > >> states it was in. If you have any empty words (words with empty > >> pronunciation) in your lexicon this could possibly happen, as it > >> would be confused between taking a normal word, then the backoff > symbol, vs. > >> taking a normal word, then the empty word, then the backoff symbol. > >> I think the current Kaldi graph-creation script check for empty > words > >> in the lexicon, for this reason. > >> > >> Dan > >> > >> > >> > >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( ) > >> generally almost makes sense, given that #16 is the last one in > >> table, the silence disambiguation symbol. (Not sure why "real" is > >> emitted at L_E--I would rather expect it to be emitted at #1.) What > I > >> do not understand is what exactly the debug trace represents, and > >> what should I make out if it. It is a path through the FST graph, > but > >> I do not understand what is this path exactly, and what does this > >> endless walk of this loop mean. > >> > > >> > -kkm > >> > > >> >> -----Original Message----- > >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> Sent: 2015-06-15 1858 > >> >> To: Kirill Katsnelson > >> >> Cc: kal...@li... > >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >> >> completes > >> >> > >> >> Look into the "backoff disambiguation symbol", normally called > #0. > >> >> The reason why it is needed should be explained in the hbka.pdf > >> paper. > >> >> Dan > >> >> > >> >> > >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson > >> >> <kir...@sm...> wrote: > >> >> > Thank you! The output consists of some sequences as you > >> >> > described, > >> >> quickly falling into a short ever repeated loop. > >> >> > > >> >> > The non-repeated section ends up with osymbols (excluding > >> epsilons) > >> >> "whatsoever on vacation up", and then the repeated part looks > like " > >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word > "real" > >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. > >> >> > > >> >> > Both LMs contain a bigram for "vacation up" and a trigram > >> "vacation > >> >> up there". "up real" is a bigram in both, with 3-grams "up real > >> quick" > >> >> and "up real quickly". "up real" is also a tail of a few other > >> >> 3-grams, but these are also same in both models (up to their > >> weights). > >> >> > > >> >> > It looks I do not understand what should I make in the end out > >> >> > of > >> >> this > >> >> > debug data :( > >> >> > > >> >> > -kkm > >> >> > > >> >> >> -----Original Message----- > >> >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> >> Sent: 2015-06-15 1821 > >> >> >> To: Kirill Katsnelson > >> >> >> Cc: kal...@li... > >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >> >> >> completes > >> >> >> > >> >> >> > I have a small set of sentences with repeat counts, and > >> >> >> > generating an > >> >> >> LM out of it. One is generated by a horrible local tool I have > >> >> >> trouble tracing exactly how. For this one L*G composition > takes > >> >> about > >> >> >> 20 seconds on my CPU. Another LM I just generated out of the > >> >> >> same files with srilm 1.7.1 ngram-count. This one has been > >> >> >> sitting in mkgraphs.sh on L_disambig*G composition step for > >> >> >> about 30 > >> minutes, > >> >> >> and still churning. fstdeterminizestar --use-log=true is > >> >> >> running at > >> >> 100%. > >> >> >> L_disambig.fst is the same file in both cases. Looks like the > G > >> >> >> making it not determinizable, although I have no idea how it > >> >> >> came to > >> >> be. > >> >> >> > > >> >> >> > Anyone could share an advice on tracking down the problem? > >> Thanks. > >> >> >> > >> >> >> You can send a signal to that program like kill -SIGUSR1 > >> >> >> process-id and it will print out some info about the symbol > >> >> >> sequences involved, I think it is like > >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. > >> >> >> Usually there is a particular word sequence that is > problematic. > >> >> >> Dan > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > > >> >> >> > -kkm > >> >> >> > > >> >> >> > ------------------------------------------------------------ > - > >> >> >> > -- > >> - > >> >> >> > -- > >> >> - > >> >> >> > -- > >> >> >> - > >> >> >> > -------- _______________________________________________ > >> >> >> > Kaldi-users mailing list > >> >> >> > Kal...@li... > >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users |
From: Daniel P. <dp...@gm...> - 2015-06-16 05:06:16
|
I don't recommend to look at the fstdeterminizestar algorithm itself- it's very complicated. Instead focus on the definition of "determinizable" and the twins property, and figure out what path you are taking through L.fst and G.fst. Trying to fstdeterminizestar G.fst directly, and seeing whether it terminates or not, may tell you something; if it fails, send the signal and see what happens. fstdeterminizestar does care about the weights, but only to the extent that they are the same or different from each other; and if your G.fst is generated from arpa2fst the pipeline should work for any ARPA-format language model- make sure you are using an up-to-date Kaldi though, there have been fixes as recently as a few months ago. The presence of SIL is not surprising, it is the optional-silence added by the lexicon. I think that script is adding #16 if it does *not* take the optional silence, otherwise it adds the phone SIL. Since you are calling your FST a "grammar" I'm wondering whether you have done something fancy with mapping words to FSTs or something like that, which is causing the result to not be determinizable. Dan On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson <kir...@sm...> wrote: > Thank you very much for your help Dan, but I am still stuck. > > First of all, a question: does the fstdeterminizestar algorithm depend on actual backoff and n-gram probabilities, i.e. will it behave differently if the numbers in arpa model file are different? Or does it depend only on arc labels but not weights? I am looking at the code but certainly I am far from being able to understand it. I cheated by looking at all if conditions in it, and this one in EpsilonClosure is seemingly the only one dealing with weights: > > if (! ApproxEqual(weight, iter->second.weight, delta_)) { // add extra part of weight to queue. > > (In ProcessFinal it also has "if (this_final_weight != Weight::Zero())" but I do not believe it is relevant?) > > I am trying to understand how to dig into the problem--are weights in the picture actually. > > Also, just for a test, I ran the grammar trough a "grep -v 'real real'", and indeed got a similar loop on the word "very" which is also often repeated. But the "real real" 2- and 3-grams are there in the "good" grammar too. > > Another thing I do not understand is the presence of the SIL ilabel in the backtrace. Here's the beginning of the trace that leads to the infinite loop as decoded with a little script I wrote (format is ilabel [ TAB olabel ]: > > #16 > #0 > V_B > Y_I > UW1_I > Z_E views > #2 > SIL > #0 > AH0_B > N_I > SH_I unsure > UH1_I > R_E > > Note the presence of SIL at line 8. This is not in lexicon: > > $ grep SIL data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt > !SIL 1 0.20 1.00 1.00 SIL_S > $ > > Is this a hint? How did it get there at all? I am using a standard script to build the L_disambig.fst: > > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/phones.txt) > word_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/words.txt) > utils/make_lexicon_fst_silprob.pl $lang/dict/lexiconp_silprob_disambig.txt \ > data/local/dict/silprob.txt $silphone '#'$ndisambig | \ > fstcompile --isymbols=$lang/phones.txt --osymbols=$lang/words.txt \ > --keep_isymbols=false --keep_osymbols=false | \ > fstaddselfloops "echo $phone_disambig_symbol |" "echo $word_disambig_symbol |" | \ > fstarcsort --sort_type=olabel > $lang/L_disambig.fst || exit 1; > > I checked the lexicon, and there are indeed only real phones at the beginning of each word, no empty positions and no #N symbols. > > -kkm > >> -----Original Message----- >> From: Daniel Povey [mailto:dp...@gm...] >> Sent: 2015-06-15 1944 >> To: Kirill Katsnelson >> Cc: kal...@li... >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> I think the confusion is probably between two loops with "real" on them >> in G.fst: one loop where you always take the bigram probability, and >> one where you always take the unigram probability. Or maybe a similar >> confusion between a loop where you use the trigram "real real real" and >> the bigram "real real". Those loops are expected to exist. >> Probably the issue is that something happened at the start of the >> sequence which caused the FST to be confused about which of those two >> states it was in. If you have any empty words (words with empty >> pronunciation) in your lexicon this could possibly happen, as it would >> be confused between taking a normal word, then the backoff symbol, vs. >> taking a normal word, then the empty word, then the backoff symbol. >> I think the current Kaldi graph-creation script check for empty words >> in the lexicon, for this reason. >> >> Dan >> >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( ) >> generally almost makes sense, given that #16 is the last one in table, >> the silence disambiguation symbol. (Not sure why "real" is emitted at >> L_E--I would rather expect it to be emitted at #1.) What I do not >> understand is what exactly the debug trace represents, and what should >> I make out if it. It is a path through the FST graph, but I do not >> understand what is this path exactly, and what does this endless walk >> of this loop mean. >> > >> > -kkm >> > >> >> -----Original Message----- >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> Sent: 2015-06-15 1858 >> >> To: Kirill Katsnelson >> >> Cc: kal...@li... >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> >> >> Look into the "backoff disambiguation symbol", normally called #0. >> >> The reason why it is needed should be explained in the hbka.pdf >> paper. >> >> Dan >> >> >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson >> >> <kir...@sm...> wrote: >> >> > Thank you! The output consists of some sequences as you described, >> >> quickly falling into a short ever repeated loop. >> >> > >> >> > The non-repeated section ends up with osymbols (excluding >> epsilons) >> >> "whatsoever on vacation up", and then the repeated part looks like " >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word "real" >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. >> >> > >> >> > Both LMs contain a bigram for "vacation up" and a trigram >> "vacation >> >> up there". "up real" is a bigram in both, with 3-grams "up real >> quick" >> >> and "up real quickly". "up real" is also a tail of a few other >> >> 3-grams, but these are also same in both models (up to their >> weights). >> >> > >> >> > It looks I do not understand what should I make in the end out of >> >> this >> >> > debug data :( >> >> > >> >> > -kkm >> >> > >> >> >> -----Original Message----- >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> Sent: 2015-06-15 1821 >> >> >> To: Kirill Katsnelson >> >> >> Cc: kal...@li... >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >> >> completes >> >> >> >> >> >> > I have a small set of sentences with repeat counts, and >> >> >> > generating an >> >> >> LM out of it. One is generated by a horrible local tool I have >> >> >> trouble tracing exactly how. For this one L*G composition takes >> >> about >> >> >> 20 seconds on my CPU. Another LM I just generated out of the same >> >> >> files with srilm 1.7.1 ngram-count. This one has been sitting in >> >> >> mkgraphs.sh on L_disambig*G composition step for about 30 >> minutes, >> >> >> and still churning. fstdeterminizestar --use-log=true is running >> >> >> at >> >> 100%. >> >> >> L_disambig.fst is the same file in both cases. Looks like the G >> >> >> making it not determinizable, although I have no idea how it came >> >> >> to >> >> be. >> >> >> > >> >> >> > Anyone could share an advice on tracking down the problem? >> Thanks. >> >> >> >> >> >> You can send a signal to that program like kill -SIGUSR1 >> >> >> process-id and it will print out some info about the symbol >> >> >> sequences involved, I think it is like >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. >> >> >> Usually there is a particular word sequence that is problematic. >> >> >> Dan >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> > -kkm >> >> >> > >> >> >> > --------------------------------------------------------------- >> - >> >> >> > -- >> >> - >> >> >> > -- >> >> >> - >> >> >> > -------- _______________________________________________ >> >> >> > Kaldi-users mailing list >> >> >> > Kal...@li... >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users |
From: Kirill K. <kir...@sm...> - 2015-06-16 04:55:25
|
Thank you very much for your help Dan, but I am still stuck. First of all, a question: does the fstdeterminizestar algorithm depend on actual backoff and n-gram probabilities, i.e. will it behave differently if the numbers in arpa model file are different? Or does it depend only on arc labels but not weights? I am looking at the code but certainly I am far from being able to understand it. I cheated by looking at all if conditions in it, and this one in EpsilonClosure is seemingly the only one dealing with weights: if (! ApproxEqual(weight, iter->second.weight, delta_)) { // add extra part of weight to queue. (In ProcessFinal it also has "if (this_final_weight != Weight::Zero())" but I do not believe it is relevant?) I am trying to understand how to dig into the problem--are weights in the picture actually. Also, just for a test, I ran the grammar trough a "grep -v 'real real'", and indeed got a similar loop on the word "very" which is also often repeated. But the "real real" 2- and 3-grams are there in the "good" grammar too. Another thing I do not understand is the presence of the SIL ilabel in the backtrace. Here's the beginning of the trace that leads to the infinite loop as decoded with a little script I wrote (format is ilabel [ TAB olabel ]: #16 #0 V_B Y_I UW1_I Z_E views #2 SIL #0 AH0_B N_I SH_I unsure UH1_I R_E Note the presence of SIL at line 8. This is not in lexicon: $ grep SIL data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt !SIL 1 0.20 1.00 1.00 SIL_S $ Is this a hint? How did it get there at all? I am using a standard script to build the L_disambig.fst: phone_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/phones.txt) word_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/words.txt) utils/make_lexicon_fst_silprob.pl $lang/dict/lexiconp_silprob_disambig.txt \ data/local/dict/silprob.txt $silphone '#'$ndisambig | \ fstcompile --isymbols=$lang/phones.txt --osymbols=$lang/words.txt \ --keep_isymbols=false --keep_osymbols=false | \ fstaddselfloops "echo $phone_disambig_symbol |" "echo $word_disambig_symbol |" | \ fstarcsort --sort_type=olabel > $lang/L_disambig.fst || exit 1; I checked the lexicon, and there are indeed only real phones at the beginning of each word, no empty positions and no #N symbols. -kkm > -----Original Message----- > From: Daniel Povey [mailto:dp...@gm...] > Sent: 2015-06-15 1944 > To: Kirill Katsnelson > Cc: kal...@li... > Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > > I think the confusion is probably between two loops with "real" on them > in G.fst: one loop where you always take the bigram probability, and > one where you always take the unigram probability. Or maybe a similar > confusion between a loop where you use the trigram "real real real" and > the bigram "real real". Those loops are expected to exist. > Probably the issue is that something happened at the start of the > sequence which caused the FST to be confused about which of those two > states it was in. If you have any empty words (words with empty > pronunciation) in your lexicon this could possibly happen, as it would > be confused between taking a normal word, then the backoff symbol, vs. > taking a normal word, then the empty word, then the backoff symbol. > I think the current Kaldi graph-creation script check for empty words > in the lexicon, for this reason. > > Dan > > > > > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( ) > generally almost makes sense, given that #16 is the last one in table, > the silence disambiguation symbol. (Not sure why "real" is emitted at > L_E--I would rather expect it to be emitted at #1.) What I do not > understand is what exactly the debug trace represents, and what should > I make out if it. It is a path through the FST graph, but I do not > understand what is this path exactly, and what does this endless walk > of this loop mean. > > > > -kkm > > > >> -----Original Message----- > >> From: Daniel Povey [mailto:dp...@gm...] > >> Sent: 2015-06-15 1858 > >> To: Kirill Katsnelson > >> Cc: kal...@li... > >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > >> > >> Look into the "backoff disambiguation symbol", normally called #0. > >> The reason why it is needed should be explained in the hbka.pdf > paper. > >> Dan > >> > >> > >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson > >> <kir...@sm...> wrote: > >> > Thank you! The output consists of some sequences as you described, > >> quickly falling into a short ever repeated loop. > >> > > >> > The non-repeated section ends up with osymbols (excluding > epsilons) > >> "whatsoever on vacation up", and then the repeated part looks like " > >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word "real" > >> is spelled "R_B IY1_I L_E #1" in L_disambig. > >> > > >> > Both LMs contain a bigram for "vacation up" and a trigram > "vacation > >> up there". "up real" is a bigram in both, with 3-grams "up real > quick" > >> and "up real quickly". "up real" is also a tail of a few other > >> 3-grams, but these are also same in both models (up to their > weights). > >> > > >> > It looks I do not understand what should I make in the end out of > >> this > >> > debug data :( > >> > > >> > -kkm > >> > > >> >> -----Original Message----- > >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> Sent: 2015-06-15 1821 > >> >> To: Kirill Katsnelson > >> >> Cc: kal...@li... > >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >> >> completes > >> >> > >> >> > I have a small set of sentences with repeat counts, and > >> >> > generating an > >> >> LM out of it. One is generated by a horrible local tool I have > >> >> trouble tracing exactly how. For this one L*G composition takes > >> about > >> >> 20 seconds on my CPU. Another LM I just generated out of the same > >> >> files with srilm 1.7.1 ngram-count. This one has been sitting in > >> >> mkgraphs.sh on L_disambig*G composition step for about 30 > minutes, > >> >> and still churning. fstdeterminizestar --use-log=true is running > >> >> at > >> 100%. > >> >> L_disambig.fst is the same file in both cases. Looks like the G > >> >> making it not determinizable, although I have no idea how it came > >> >> to > >> be. > >> >> > > >> >> > Anyone could share an advice on tracking down the problem? > Thanks. > >> >> > >> >> You can send a signal to that program like kill -SIGUSR1 > >> >> process-id and it will print out some info about the symbol > >> >> sequences involved, I think it is like > >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. > >> >> Usually there is a particular word sequence that is problematic. > >> >> Dan > >> >> > >> >> > >> >> > >> >> > >> >> > > >> >> > -kkm > >> >> > > >> >> > --------------------------------------------------------------- > - > >> >> > -- > >> - > >> >> > -- > >> >> - > >> >> > -------- _______________________________________________ > >> >> > Kaldi-users mailing list > >> >> > Kal...@li... > >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users |