You can subscribe to this list here.
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(4) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(1) |
Dec
(14) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2012 |
Jan
(1) |
Feb
(8) |
Mar
|
Apr
(1) |
May
(3) |
Jun
(13) |
Jul
(7) |
Aug
(11) |
Sep
(6) |
Oct
(14) |
Nov
(16) |
Dec
(1) |
2013 |
Jan
(3) |
Feb
(8) |
Mar
(17) |
Apr
(21) |
May
(27) |
Jun
(11) |
Jul
(11) |
Aug
(21) |
Sep
(39) |
Oct
(17) |
Nov
(39) |
Dec
(28) |
2014 |
Jan
(36) |
Feb
(30) |
Mar
(35) |
Apr
(17) |
May
(22) |
Jun
(28) |
Jul
(23) |
Aug
(41) |
Sep
(17) |
Oct
(10) |
Nov
(22) |
Dec
(56) |
2015 |
Jan
(30) |
Feb
(32) |
Mar
(37) |
Apr
(28) |
May
(79) |
Jun
(18) |
Jul
(35) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Mirko H. <mir...@gm...> - 2015-02-03 23:43:05
|
Hi! I am trying to compile Kaldi on Cygwin with the newer OpenFST 1.4.1. All the tools have been compiled (OpenFST with --enable-static --disable-shared, g++ version 4.8.3), but there seems to be a bug in the file trunk/src/makefiles/cygwin.mk: Instead of -enable-auto-import, it should be --enable-auto-import (two times mentioned in the file) After fixing that, there seems to be some problems with the new C++11 standard library, or some compile-time symbols seem not to be set correctly: kaldi-math.cc:52:33: error: ‘rand_r’ was not declared in this scope return rand_r(&(state->seed)); kaldi-matrix.cc:1327:49: error: there are no arguments to ‘strcasecmp’ that depend on a template parameter, so a declaration of ‘strcasecmp’ must be available [-fpermissive] if (!KALDI_STRCASECMP(str.c_str(), "inf") || Do you know, which flags need to be set for Kaldi to compile? >From previous runs, I already experienced, that for Kaldi to compile, I have to use the -O2 option, otherwise I get "too many sections" errors from the linker. Thanks, Mirko |
From: Daniel P. <dp...@gm...> - 2015-02-03 21:09:57
|
> > Hi All, > I want to develop a simple ASR using Kaldi to recognize a limited set of > word (e.g One two .. nine). I have the acoustic models. I made a limited > Lexicon using these words and the G fst as a uni-gram. It work perfect if I > say any of the words but if I say e.g. Ten it gives me one of the words in > lexicon. I need to develop a system that gives me oov. What change in L or > G should I make? > Identifying OOVs is a difficult problem. You could try to randomly replace some of the instances of words in your training set with a new word you can call "oov", and recognize with that in the the vocabulary. But it probably wouldn't work that well. > Another question how can I find the start time and the end of a recognized > word/ I can use alignment after recognition but I believe that there exist > a simpler way ? > Not really, you need to use something like lattice-align-words. If you do it after getting the best path it will be quite efficient. Dan > > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming. The Go Parallel Website, > sponsored by Intel and developed in partnership with Slashdot Media, is > your > hub for all things parallel software development, from weekly thought > leadership blogs to news, videos, case studies, tutorials and more. Take a > look and join the conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > |
From: Saman M. <smo...@gm...> - 2015-02-03 11:17:43
|
Hi All, I want to develop a simple ASR using Kaldi to recognize a limited set of word (e.g One two .. nine). I have the acoustic models. I made a limited Lexicon using these words and the G fst as a uni-gram. It work perfect if I say any of the words but if I say e.g. Ten it gives me one of the words in lexicon. I need to develop a system that gives me oov. What change in L or G should I make? Another question how can I find the start time and the end of a recognized word/ I can use alignment after recognition but I believe that there exist a simpler way ? Best regards Saman |
From: Daniel P. <dp...@gm...> - 2015-01-31 01:05:36
|
I just want to inform people that I have just checked in this multi-threaded version of the online-nnet2 decoding. This should make it possible to decode in real-time with larger models and graphs than before, because the decoding and the nnet evaluation are in separate threads and can be done in parallel. The usage is the same as online2-wav-nnet2-latgen-faster. The C++ level interface looks similar, but behaves a little different because the decoding happens in background threads, so you don't have to call AdvanceDecoding() any more, it just happens in the background. Dan ---------- Forwarded message ---------- From: Repository Kaldi code <no...@co...> Date: Fri, Jan 30, 2015 at 8:01 PM Subject: [kaldi:code] [r4844] - danielpovey: trunk: add multi-threaded online-nnet2 decoding program, online2-wav-nnet2-latgen-threaded, which does decoding and nnet evaluation in different threads. Usage is otherwise similar to online2-wav-nnet2-latgen-faster. To: Repository Kaldi code <no...@co...> trunk: add multi-threaded online-nnet2 decoding program, online2-wav-nnet2-latgen-threaded, which does decoding and nnet evaluation in different threads. Usage is otherwise similar to online2-wav-nnet2-latgen-faster. http://sourceforge.net/p/kaldi/code/4844/ ------------------------------ Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/kaldi/code/ To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/ |
From: Daniel P. <dp...@gm...> - 2015-01-23 17:53:16
|
Oh OK, maybe the sem_t is more heavyweight than it needs to be because it is designed to work across processes. Dan On Fri, Jan 23, 2015 at 12:47 PM, Nagendra Goel <nag...@go...> wrote: > This is from: > http://www.experts-exchange.com/Programming/System/Linux/Q_20375101.html > > > > semaphores can be used between different processes to synchronise access > to some shared object i.e. a file or shared memory. > > If you have two processes which require read/write access to some resource > you need to make sure that both are not trying to change the shared > resource at the same time. You can then use a semaphore to protect the > access to the shared resource i.e each process has to acquire the semaphore > before being allowed to perform an update on the shared resource. Once the > update is complete the process then releases the semaphore to allow another > process to access/update the shared resource. > > pthread_mutex_t is a similar concept but shared between multiple threads > of a single process. If for example you have a multithreaded program which > contains global data accessible/updatable by multiple threads then you > would a mutex to protect the access in the same way. > > > > *From:* Daniel Povey [mailto:dp...@gm...] > *Sent:* Friday, January 23, 2015 12:36 PM > *To:* kal...@li... > *Subject:* [Kaldi-developers] Fwd: sem_t > > > > I'm wondering if there is anyone on the list who is familiar with issues > such as semaphores? > > The semaphore implementation in kaldi-semaphore.h uses two variables: > > pthread_mutex_t mutex_; > > pthread_cond_t cond_; > > but I just noticed there is something called sem_t, a semaphore type, from > the POSIX standard. > > Does anyone know what portability issues this would bring, and what the > reasons might be for > > using the two things from pthreads instead? > Dan > > > > ---------- Forwarded message ---------- > From: *Karel Veselý* <ive...@fi...> > Date: Fri, Jan 23, 2015 at 6:27 AM > Subject: Re: sem_t > To: dp...@gm... > > > Hi Dan, > frankly, it is hard for me to answer, years ago we learned how to > implement semaphore in pthreads, so I used the code I had in a school > project. > Perhaps sem_t can do the same job, and perhaps it internally uses same > functions as my implementation. > Karel. > > Dne 23. 1. 2015 v 4:13 Daniel Povey napsal(a): > > > > Karel, > I just found out there is a semaphore type in POSIX, called sem-t. > Why doesn't your semaphore code use that, do you know? > Dan > > > > -- > Karel Vesely, Brno University of Technology > ive...@fi..., +420-54114-1300 > > > |
From: Nagendra G. <nag...@go...> - 2015-01-23 17:47:07
|
This is from: http://www.experts-exchange.com/Programming/System/Linux/Q_20375101.html semaphores can be used between different processes to synchronise access to some shared object i.e. a file or shared memory. If you have two processes which require read/write access to some resource you need to make sure that both are not trying to change the shared resource at the same time. You can then use a semaphore to protect the access to the shared resource i.e each process has to acquire the semaphore before being allowed to perform an update on the shared resource. Once the update is complete the process then releases the semaphore to allow another process to access/update the shared resource. pthread_mutex_t is a similar concept but shared between multiple threads of a single process. If for example you have a multithreaded program which contains global data accessible/updatable by multiple threads then you would a mutex to protect the access in the same way. From: Daniel Povey [mailto:dp...@gm...] Sent: Friday, January 23, 2015 12:36 PM To: kal...@li... Subject: [Kaldi-developers] Fwd: sem_t I'm wondering if there is anyone on the list who is familiar with issues such as semaphores? The semaphore implementation in kaldi-semaphore.h uses two variables: pthread_mutex_t mutex_; pthread_cond_t cond_; but I just noticed there is something called sem_t, a semaphore type, from the POSIX standard. Does anyone know what portability issues this would bring, and what the reasons might be for using the two things from pthreads instead? Dan ---------- Forwarded message ---------- From: Karel Veselý <ive...@fi...> Date: Fri, Jan 23, 2015 at 6:27 AM Subject: Re: sem_t To: dp...@gm... Hi Dan, frankly, it is hard for me to answer, years ago we learned how to implement semaphore in pthreads, so I used the code I had in a school project. Perhaps sem_t can do the same job, and perhaps it internally uses same functions as my implementation. Karel. Dne 23. 1. 2015 v 4:13 Daniel Povey napsal(a): Karel, I just found out there is a semaphore type in POSIX, called sem-t. Why doesn't your semaphore code use that, do you know? Dan -- Karel Vesely, Brno University of Technology ive...@fi..., +420-54114-1300 <tel:%2B420-54114-1300> |
From: Daniel P. <dp...@gm...> - 2015-01-23 17:36:16
|
I'm wondering if there is anyone on the list who is familiar with issues such as semaphores? The semaphore implementation in kaldi-semaphore.h uses two variables: pthread_mutex_t mutex_; pthread_cond_t cond_; but I just noticed there is something called sem_t, a semaphore type, from the POSIX standard. Does anyone know what portability issues this would bring, and what the reasons might be for using the two things from pthreads instead? Dan ---------- Forwarded message ---------- From: Karel Veselý <ive...@fi...> Date: Fri, Jan 23, 2015 at 6:27 AM Subject: Re: sem_t To: dp...@gm... Hi Dan, frankly, it is hard for me to answer, years ago we learned how to implement semaphore in pthreads, so I used the code I had in a school project. Perhaps sem_t can do the same job, and perhaps it internally uses same functions as my implementation. Karel. Dne 23. 1. 2015 v 4:13 Daniel Povey napsal(a): Karel, > I just found out there is a semaphore type in POSIX, called sem-t. > Why doesn't your semaphore code use that, do you know? > Dan > -- Karel Vesely, Brno University of Technology ive...@fi..., +420-54114-1300 |
From: Vesely K. <ve...@gm...> - 2015-01-22 20:01:42
|
Hi Jerry, yes, yes, that's a great idea, I'll happily look at it and thanks for the detailed description! Karel. On 01/22/2015 06:59 PM, Daniel Povey wrote: > Karel, since Jerry is offering that we can use his nnet1 LSTM code in > Kaldi, how do you feel about doing code review on it right now, since > it's part of the nnet1 framework? If you don't have time right now, > though, I could find someone else. > Dan > > > On Thu, Jan 22, 2015 at 11:11 AM, dophist <do...@gm... > <mailto:do...@gm...>> wrote: > > Hi Daniel & Kaldi developers, > > I saw there was a thread about “WER of LSTM & DNN” in Kaldi > sourceforge forum, I’m the author of the LSTM codes. This morning > the thread creator Alim emailed me asking if I’d like to share my > LSTM implementation with the Kaldi community, my answer is > definitely a “yes” of course. > > The code is on github: https://github.com/dophist/kaldi-lstm > > 1.The implementation is under Karel's nnet1 framework. The whole > LSTM architecture is condensed into a single configurable > component. So in the forum thread, Daniel asked about the > “external tool” Alim used, it’s actually “internal”, all Kaldi > users will find it easy to compile and use. > > 2.There are two versions of my implementation, “standard” & > “google”. The “standard” version can be seen as a general purpose > LSTM tool with epoch-wise BPTT, you can even adapt it to train an > LSTM-LM if you want, but currently I used it only for sequential > training and decoding tool(nnet-forward). The “google” version is > primarily used for Cross-Entropy training in my experiments. There > are docs in my github repo with detailed descriptions. > > 3.Testing. The code has been tested as on an industry-size speech > corpus around 4000+ hours that is not publicly available, my > experiment reproduced google's results and their conclusions are > solid. In the last few months I have got feedbacks from Siri > group and Cambridge Lab and many others, I suppose they have > already got similar results. > > 4.Legal stuff. Although I’m now working at Baidu, the coding is > done in my personal spare time, so I have the freedom to make it > open-sourced, under Kaldi’s license. > > Known issues: > > 1.Gradient explosion. Gradient explosion is far from solved in RNN > training, gradient clipping seems to be the best practice from my > own experience, it is implemented in “standard” version, but > tuning the clipping threshold can be painful towards different > tasks. “google” version is less likely to explode because they > limit the BPTT expansion to 20, but explosion still exists in > certain cases. > > 2.Training speed. Training LSTM is slow, indeed, especially when > most institutes don’t have huge infrastructures like DistBelief at > google. My current implementation is based on nnet1 so it only > use 1 GPU card(or CPU), the training might take months to converge > on industrial-size dataset. Multiple GPU cards in single host > server won’t scale as the dataset is getting larger and larger. > And parallelizing SGD on GPU cluster is still an open issue, most > GPU cluster solutions I know requires InfiniBand network, Yann > LeCun group’s EA-SGD seems most promising to me, but I don’t have > time to try it. Daniel’s nnet2 averaging strategy can be another > promising option but I can be sure if it will work on LSTM. > > These remaining issues (particularly training speedup) might > require great effort to solve, and I’m not sure if I have enough > time to do it. At least I hope my LSTM implementation can be a > quick starting point towards RNN acoustic modeling for Kaldi > community. > > if anyone have questions about the code, feel free to email me. > > jer...@gm... <mailto:jer...@gm...> > > And since china gov occasionally blocks gmail, my back-up email > address: > > jer...@qq... <mailto:jer...@qq...> > > Best, > > Jerry (Jiayu DU) > > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in > Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely > compliant. > http://p.sf.net/sfu/gigenet > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > <mailto:Kal...@li...> > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > |
From: Daniel P. <dp...@gm...> - 2015-01-22 17:59:18
|
Karel, since Jerry is offering that we can use his nnet1 LSTM code in Kaldi, how do you feel about doing code review on it right now, since it's part of the nnet1 framework? If you don't have time right now, though, I could find someone else. Dan On Thu, Jan 22, 2015 at 11:11 AM, dophist <do...@gm...> wrote: > Hi Daniel & Kaldi developers, > > > > I saw there was a thread about “WER of LSTM & DNN” in Kaldi sourceforge > forum, I’m the author of the LSTM codes. This morning the thread creator > Alim emailed me asking if I’d like to share my LSTM implementation with the > Kaldi community, my answer is definitely a “yes” of course. > > > > The code is on github: https://github.com/dophist/kaldi-lstm > > > > 1. The implementation is under Karel's nnet1 framework. The whole > LSTM architecture is condensed into a single configurable component. So in > the forum thread, Daniel asked about the “external tool” Alim used, it’s > actually “internal”, all Kaldi users will find it easy to compile and use. > > > > 2. There are two versions of my implementation, “standard” & > “google”. The “standard” version can be seen as a general purpose LSTM tool > with epoch-wise BPTT, you can even adapt it to train an LSTM-LM if you > want, but currently I used it only for sequential training and decoding > tool(nnet-forward). The “google” version is primarily used for > Cross-Entropy training in my experiments. There are docs in my github repo > with detailed descriptions. > > > > 3. Testing. The code has been tested as on an industry-size speech > corpus around 4000+ hours that is not publicly available, my experiment > reproduced google's results and their conclusions are solid. In the last > few months I have got feedbacks from Siri group and Cambridge Lab and many > others, I suppose they have already got similar results. > > > > 4. Legal stuff. Although I’m now working at Baidu, the coding is > done in my personal spare time, so I have the freedom to make it > open-sourced, under Kaldi’s license. > > > > Known issues: > > 1. Gradient explosion. Gradient explosion is far from solved in RNN > training, gradient clipping seems to be the best practice from my own > experience, it is implemented in “standard” version, but tuning the > clipping threshold can be painful towards different tasks. “google” > version is less likely to explode because they limit the BPTT expansion to > 20, but explosion still exists in certain cases. > > > > 2. Training speed. Training LSTM is slow, indeed, especially when > most institutes don’t have huge infrastructures like DistBelief at google. > My current implementation is based on nnet1 so it only use 1 GPU card(or > CPU), the training might take months to converge on industrial-size > dataset. Multiple GPU cards in single host server won’t scale as the > dataset is getting larger and larger. And parallelizing SGD on GPU cluster > is still an open issue, most GPU cluster solutions I know requires > InfiniBand network, Yann LeCun group’s EA-SGD seems most promising to me, > but I don’t have time to try it. Daniel’s nnet2 averaging strategy can be > another promising option but I can be sure if it will work on LSTM. > > > > These remaining issues (particularly training speedup) might require great > effort to solve, and I’m not sure if I have enough time to do it. At least > I hope my LSTM implementation can be a quick starting point towards RNN > acoustic modeling for Kaldi community. > > > > if anyone have questions about the code, feel free to email me. > > jer...@gm... > > And since china gov occasionally blocks gmail, my back-up email address: > > jer...@qq... > > > > Best, > > Jerry (Jiayu DU) > > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > http://p.sf.net/sfu/gigenet > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > |
From: dophist <do...@gm...> - 2015-01-22 16:11:40
|
Hi Daniel & Kaldi developers, I saw there was a thread about “WER of LSTM & DNN” in Kaldi sourceforge forum, I’m the author of the LSTM codes. This morning the thread creator Alim emailed me asking if I’d like to share my LSTM implementation with the Kaldi community, my answer is definitely a “yes” of course. The code is on github: https://github.com/dophist/kaldi-lstm 1. The implementation is under Karel's nnet1 framework. The whole LSTM architecture is condensed into a single configurable component. So in the forum thread, Daniel asked about the “external tool” Alim used, it’s actually “internal”, all Kaldi users will find it easy to compile and use. 2. There are two versions of my implementation, “standard” & “google”. The “standard” version can be seen as a general purpose LSTM tool with epoch-wise BPTT, you can even adapt it to train an LSTM-LM if you want, but currently I used it only for sequential training and decoding tool(nnet-forward). The “google” version is primarily used for Cross-Entropy training in my experiments. There are docs in my github repo with detailed descriptions. 3. Testing. The code has been tested as on an industry-size speech corpus around 4000+ hours that is not publicly available, my experiment reproduced google's results and their conclusions are solid. In the last few months I have got feedbacks from Siri group and Cambridge Lab and many others, I suppose they have already got similar results. 4. Legal stuff. Although I’m now working at Baidu, the coding is done in my personal spare time, so I have the freedom to make it open-sourced, under Kaldi’s license. Known issues: 1. Gradient explosion. Gradient explosion is far from solved in RNN training, gradient clipping seems to be the best practice from my own experience, it is implemented in “standard” version, but tuning the clipping threshold can be painful towards different tasks. “google” version is less likely to explode because they limit the BPTT expansion to 20, but explosion still exists in certain cases. 2. Training speed. Training LSTM is slow, indeed, especially when most institutes don’t have huge infrastructures like DistBelief at google. My current implementation is based on nnet1 so it only use 1 GPU card(or CPU), the training might take months to converge on industrial-size dataset. Multiple GPU cards in single host server won’t scale as the dataset is getting larger and larger. And parallelizing SGD on GPU cluster is still an open issue, most GPU cluster solutions I know requires InfiniBand network, Yann LeCun group’s EA-SGD seems most promising to me, but I don’t have time to try it. Daniel’s nnet2 averaging strategy can be another promising option but I can be sure if it will work on LSTM. These remaining issues (particularly training speedup) might require great effort to solve, and I’m not sure if I have enough time to do it. At least I hope my LSTM implementation can be a quick starting point towards RNN acoustic modeling for Kaldi community. if anyone have questions about the code, feel free to email me. jer...@gm... And since china gov occasionally blocks gmail, my back-up email address: jer...@qq... Best, Jerry (Jiayu DU) |
From: Daniel P. <dp...@gm...> - 2015-01-21 18:23:49
|
The only way it could possibly deal with this well is if you had songs in the training data, and maybe had a special phone for songs that would eat up that data. However, you may want to look at the audio-segmentation scripts, search for "reseg" in the egs/wsj/s5/run.sh. This may do what you want; and Guoguo recently added an option to filter the segments it discovers by word error rate; this may have the effect of filtering out the segment of speech that had the song in. Also, once you have the segmentation you could try steps/cleanup/find_bad_utts.sh to find segments that seem to have difficulty aligning to their audio. Dan On Wed, Jan 21, 2015 at 7:36 AM, Saman Mousazadeh <smo...@gm...> wrote: > Hi All, > I have trained a model for alignment and I want to use that model for > aligning an audio file. > The audio contains song i.e. it can be seen as > > speech.......SONG...Speech > > I use gmm-aligned compiled with a pre-trained model. > > My problem is that it alignment is fine at the beginning and end but in > the middle i have problem. > > Is there any suggestion to deal with this issue ? > Thanks in advance. > Attached the wav and transcript file. > Best > Saman > > > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > http://p.sf.net/sfu/gigenet > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > |
From: Alexander S. <aso...@gm...> - 2015-01-19 17:17:39
|
Thank you guys for such a lot of information! On Tue, Jan 20, 2015 at 12:47 AM, Blaise Potard <bp...@id...> wrote: > TL; DR: FFTW (GPL) is a pain to integrate to BSD / Apache code. Use FFTS? > > To expand upon Matthew's answer, FFTW is released under the GPL (not > LGPL), so according to GNU > (http://www.gnu.org/licenses/gpl-faq.html#IfLibraryIsGPL): > > "If a library is released under the GPL, it means that any software > which uses it has to be under the GPL or a GPL-compatible license." > > While LGPL code is quite easy to integrate in other projects, GPL is > definitely tricky if your code is not GPL. To summarise: > - if you want to use the GPL code in your project, you need to make > sure your project uses a license compatible with the GPL, > - if you want to distribute a compiled version of your project in a > license other than GPL, you need to make sure the dependency to the GPL > code is not mandatory. > > That does not really mean it is impossible to use it in kaldi, it just > mean you would have to be extra careful on how you do the binding with > FFTW. Arguably, all source files that have direct references to FFTW > would have to be released as GPL, but that's ok, you could just write a > wrapper (released as GPL), with a header (released as Apache). Provided > the rest of the source files use a license compatible with the GPL - > which the license used by kaldi fortunately is, cf. > http://www.apache.org/licenses/GPL-compatibility.html - it is not too > big of a problem. However any version of kaldi compiled with support for > FFTW would be automatically GPL. > > If FFTW was a mandatory dependency of Kaldi, the main issue would arise > when distributing a compiled version of the code; you would then have to > distribute it as GPL, and therefore provide all the sources, even the > changes you would have liked to keep for yourself. Which is a hassle, > most likely not commercially viable, and severely impacts kaldi's > distributability. > > So if you want to be able to distribute kaldi (or its derivatives) > commercially, you really need to make sure linking to fftw is optional, > i.e. make sure kaldi can use other fft libraries. Which implies you > would need to write a FFT wrapper class, with a header in the normal > kaldi license, and a separate implementation for each of the FFT > libraries to support - the one for FFTW would have to be GPL-licensed, etc. > > Making this effort could make sense if the FFT was one of the most > time-consuming component of kaldi, but as far as I am aware, it is only > used for feature extraction, which is not really one of the most > time-consuming component. > > For HTK, it don't think it would be possible to use FFTW with the > current HTK license, unless Microsoft bought a license for a commercial > FFTW inclusion. > > For Sphinx, they probably could do the same as for kaldi, but for the > same reason as above, it does not really make sense to make this effort, > especially since almost no distributed version would want to use it. > > This being said, I would say including FFTS > (https://github.com/anthonix/ffts) - which has a BSD license - could > make sense, especially since it appears to be even faster than FFTW: > http://www.cs.waikato.ac.nz/~ihw/papers/13-AMB-IHW-MJC-FastFourier.pdf > > Blaise > > On 19/01/15 11:19, Matthew Aylett wrote: > > Licensing constraints make it impossible to use fftw in apache style/bsd > > style software. > > > > Best > > > > Matthew > > > > > > On Mon, Jan 19, 2015 at 9:22 AM, Alexander Solovets <aso...@gm... > > <mailto:aso...@gm...>> wrote: > > > > Hi, > > > > I recently came across the fact that Kaldi, Sphinx and HTK have > > their own FFT implementation rather instead of using fftw library. > > Is there a strong reason for that in general case as well as for > > Kaldi in particular? > > > > Thanks! > > > > -- > > Sincerely, Alexander > > > > > ------------------------------------------------------------------------------ > > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > > GigeNET is offering a free month of service with a new server in > > Ashburn. > > Choose from 2 high performing configs, both with 100TB of bandwidth. > > Higher redundancy.Lower latency.Increased capacity.Completely > compliant. > > http://p.sf.net/sfu/gigenet > > _______________________________________________ > > Kaldi-developers mailing list > > Kal...@li... > > <mailto:Kal...@li...> > > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > > > > > > > > > > ------------------------------------------------------------------------------ > > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > > GigeNET is offering a free month of service with a new server in Ashburn. > > Choose from 2 high performing configs, both with 100TB of bandwidth. > > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > > http://p.sf.net/sfu/gigenet > > > > > > > > _______________________________________________ > > Kaldi-developers mailing list > > Kal...@li... > > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > > > > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > http://p.sf.net/sfu/gigenet > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > -- Sincerely, Alexander |
From: Blaise P. <bp...@id...> - 2015-01-19 14:48:07
|
TL; DR: FFTW (GPL) is a pain to integrate to BSD / Apache code. Use FFTS? To expand upon Matthew's answer, FFTW is released under the GPL (not LGPL), so according to GNU (http://www.gnu.org/licenses/gpl-faq.html#IfLibraryIsGPL): "If a library is released under the GPL, it means that any software which uses it has to be under the GPL or a GPL-compatible license." While LGPL code is quite easy to integrate in other projects, GPL is definitely tricky if your code is not GPL. To summarise: - if you want to use the GPL code in your project, you need to make sure your project uses a license compatible with the GPL, - if you want to distribute a compiled version of your project in a license other than GPL, you need to make sure the dependency to the GPL code is not mandatory. That does not really mean it is impossible to use it in kaldi, it just mean you would have to be extra careful on how you do the binding with FFTW. Arguably, all source files that have direct references to FFTW would have to be released as GPL, but that's ok, you could just write a wrapper (released as GPL), with a header (released as Apache). Provided the rest of the source files use a license compatible with the GPL - which the license used by kaldi fortunately is, cf. http://www.apache.org/licenses/GPL-compatibility.html - it is not too big of a problem. However any version of kaldi compiled with support for FFTW would be automatically GPL. If FFTW was a mandatory dependency of Kaldi, the main issue would arise when distributing a compiled version of the code; you would then have to distribute it as GPL, and therefore provide all the sources, even the changes you would have liked to keep for yourself. Which is a hassle, most likely not commercially viable, and severely impacts kaldi's distributability. So if you want to be able to distribute kaldi (or its derivatives) commercially, you really need to make sure linking to fftw is optional, i.e. make sure kaldi can use other fft libraries. Which implies you would need to write a FFT wrapper class, with a header in the normal kaldi license, and a separate implementation for each of the FFT libraries to support - the one for FFTW would have to be GPL-licensed, etc. Making this effort could make sense if the FFT was one of the most time-consuming component of kaldi, but as far as I am aware, it is only used for feature extraction, which is not really one of the most time-consuming component. For HTK, it don't think it would be possible to use FFTW with the current HTK license, unless Microsoft bought a license for a commercial FFTW inclusion. For Sphinx, they probably could do the same as for kaldi, but for the same reason as above, it does not really make sense to make this effort, especially since almost no distributed version would want to use it. This being said, I would say including FFTS (https://github.com/anthonix/ffts) - which has a BSD license - could make sense, especially since it appears to be even faster than FFTW: http://www.cs.waikato.ac.nz/~ihw/papers/13-AMB-IHW-MJC-FastFourier.pdf Blaise On 19/01/15 11:19, Matthew Aylett wrote: > Licensing constraints make it impossible to use fftw in apache style/bsd > style software. > > Best > > Matthew > > > On Mon, Jan 19, 2015 at 9:22 AM, Alexander Solovets <aso...@gm... > <mailto:aso...@gm...>> wrote: > > Hi, > > I recently came across the fact that Kaldi, Sphinx and HTK have > their own FFT implementation rather instead of using fftw library. > Is there a strong reason for that in general case as well as for > Kaldi in particular? > > Thanks! > > -- > Sincerely, Alexander > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in > Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > http://p.sf.net/sfu/gigenet > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > <mailto:Kal...@li...> > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > > > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > http://p.sf.net/sfu/gigenet > > > > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > |
From: Matthew A. <mat...@gm...> - 2015-01-19 10:19:37
|
Licensing constraints make it impossible to use fftw in apache style/bsd style software. Best Matthew On Mon, Jan 19, 2015 at 9:22 AM, Alexander Solovets <aso...@gm...> wrote: > Hi, > > I recently came across the fact that Kaldi, Sphinx and HTK have their own > FFT implementation rather instead of using fftw library. Is there a strong > reason for that in general case as well as for Kaldi in particular? > > Thanks! > > -- > Sincerely, Alexander > > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > http://p.sf.net/sfu/gigenet > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > |
From: Alexander S. <aso...@gm...> - 2015-01-19 09:22:35
|
Hi, I recently came across the fact that Kaldi, Sphinx and HTK have their own FFT implementation rather instead of using fftw library. Is there a strong reason for that in general case as well as for Kaldi in particular? Thanks! -- Sincerely, Alexander |
From: Nagendra G. <nag...@go...> - 2015-01-14 11:35:48
|
I have seen work on syllables (as opposed to phonemes) and there were some publications from IBM in 90's where they joined some word pairs into a new lexicon entry and it helped ( I think on voice mail task) On Jan 13, 2015 6:49 PM, "Nickolay Shmyrev" <nsh...@gm...> wrote: > > > 14 янв. 2015 г., в 2:37, <Dan...@pa...> <Dan...@pa...> > написал(а): > > > > Hello Nicolay, > > > > Thanks very much for your thoughtful answer. My context was that I > wondered whether there might be occasionally be an advantage to mapping > words to word phrases in G rather than assigning probabilities to words. I > assumed that someone had tried it and it was known not to work well since > no one seemed to do it. I couldn't find a record of anyone trying it, so > thought I'd ask. > > In that context it’s probably worth to describe how recognition works. > Many newbies have confusion about that which you might have too. People > imagine that audio is converted to phones, then phones converted to words > and then words converted to phrases. It is not like that because there are > many many ways to do such conversion. Phone boundaries are blurred and > often you can not decide easily which phone correspond to which word. > Consider famous «wreck a nice beach» example which can be confused with > «recognize speech». You can not do a local conversion decision, but you > need a global 1-best result. > > So instead of doing that straightforward process we consider all possible > conversions and select the one of them with global minimum weight. So > decoding is not the straightforward transducer application but scoring of > all the possible paths with an acceptor. This is where acceptor is required > and where you need to assign probabilities to results. > > Decoding result is not > > G(L(audio)) > > it is in simplified form > > min_{over all possible audio splits} G(L(audio split)) > > Not a good discussions for kaldi-developers mailing list, maybe we can > move that off-list. > > > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > http://p.sf.net/sfu/gigenet > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > |
From: Nickolay S. <nsh...@gm...> - 2015-01-13 23:49:06
|
> 14 янв. 2015 г., в 2:37, <Dan...@pa...> <Dan...@pa...> написал(а): > > Hello Nicolay, > > Thanks very much for your thoughtful answer. My context was that I wondered whether there might be occasionally be an advantage to mapping words to word phrases in G rather than assigning probabilities to words. I assumed that someone had tried it and it was known not to work well since no one seemed to do it. I couldn't find a record of anyone trying it, so thought I'd ask. In that context it’s probably worth to describe how recognition works. Many newbies have confusion about that which you might have too. People imagine that audio is converted to phones, then phones converted to words and then words converted to phrases. It is not like that because there are many many ways to do such conversion. Phone boundaries are blurred and often you can not decide easily which phone correspond to which word. Consider famous «wreck a nice beach» example which can be confused with «recognize speech». You can not do a local conversion decision, but you need a global 1-best result. So instead of doing that straightforward process we consider all possible conversions and select the one of them with global minimum weight. So decoding is not the straightforward transducer application but scoring of all the possible paths with an acceptor. This is where acceptor is required and where you need to assign probabilities to results. Decoding result is not G(L(audio)) it is in simplified form min_{over all possible audio splits} G(L(audio split)) Not a good discussions for kaldi-developers mailing list, maybe we can move that off-list. |
From: <Dan...@pa...> - 2015-01-13 23:37:55
|
Hello Nicolay, Thanks very much for your thoughtful answer. My context was that I wondered whether there might be occasionally be an advantage to mapping words to word phrases in G rather than assigning probabilities to words. I assumed that someone had tried it and it was known not to work well since no one seemed to do it. I couldn't find a record of anyone trying it, so thought I'd ask. Dan -----Original Message----- From: Nickolay Shmyrev [mailto:nsh...@gm...] Sent: Tuesday, January 13, 2015 3:01 PM To: Davies, Dan <Dan...@pa...> Cc: Kal...@li... Subject: Re: [Kaldi-developers] Kaldi comparison with Hydra? > 13 янв. 2015 г., в 3:35, Dan...@pa... написал(а): > > I apologize in advance for asking a newbie question. I’ve been googling around and haven’t seen an obvious answer. > > In the same sense that the Lexicon maps phonemes to words, what happens if the Language Model is set up as a Finite State Transducer instead of a Finite State Acceptor and maps words to word phrases? Most of the phrases would be very short (1-2 words), but when used in constrained applications, there might be an interesting number of longer phrases. For example, “How may I help you today?” and “We’ll be back after these messages.” Hello Dan In ASR task we search for the most likely output label sequence given the input feature sequence. If your transducer has phrases as output labels you’ll have those phrases as output, it should be no problem. Phrases might differ from the actual word, for example two words «back after» recognized in engine might output whole «We’ll be back after these messages.» Sort of semantic recognition instead of just recognition of a word sequence. If you are just interested in using grammars, you can use them in acceptor form. Maybe you could provide some more context so we can clarify. |
From: 石伟 <sh...@sz...> - 2015-01-13 15:29:12
|
Dan, please see the patch. I firstly use BaseFloatVectorWriter so that the parameters of ali-to-phones will remain unchanged. But I found the resulting output are not quite ctm one(there are parentheses '[' and ']'). So I changed the last parameter so it can be wspecifier or wxfilename, which depends on whether you set ctm-output option or not. -Xavier: you can test the patch in your local machine. By using pipelines like lattice-1best | nbest-to-linear | ali-to-phones --ctm-output, I can get phone alignments like: 2 1 0.00 0.36 SIL 2 1 0.36 0.08 l_B 2 1 0.44 0.06 i_E 2 1 0.50 0.05 k_B 2 1 0.55 0.09 e_I 2 1 0.64 0.07 j_I 2 1 0.71 0.20 iang_E 2 1 0.91 0.08 zh_B 2 1 0.99 0.08 u_I 2 1 1.07 0.06 ch_I 2 1 1.13 0.04 ib_E 2 1 1.17 0.06 zh_B 2 1 1.23 0.11 ao_I 2 1 1.34 0.09 k_I 2 1 1.43 0.13 ai_E 2 1 1.56 0.06 g_B 2 1 1.62 0.06 uo_I 2 1 1.68 0.07 w_I 2 1 1.75 0.06 u_I 2 1 1.81 0.09 uxs_I 2 1 1.90 0.10 an_E 2 1 2.00 0.14 ch_B 2 1 2.14 0.10 ang_I 2 1 2.24 0.08 w_I 2 1 2.32 0.04 u_E 2 1 2.36 0.11 h_B 2 1 2.47 0.06 ui_I 2 1 2.53 0.07 y_I 2 1 2.60 0.14 i_E 2 1 2.74 0.40 SIL I believe this is what you want. Wei ------------------ Original ------------------ From: "Daniel Povey"<dp...@gm...>; Date: Tue, Jan 6, 2015 04:21 AM To: "Xavier Anguera"<xan...@gm...>; "wei.shi"<we...@im...>; "shiwei"<sh...@sz...>; Cc: "kal...@li..."<kal...@li...>; Subject: Re: [Kaldi-developers] Phonetic decoding Your whole pipeline is based on using the words in the lattices, not the phones. In your case the words *are* the phones, because you're using a phone bigram LM. So you need to do lattice-align-words, not lattice-align-phones. The confidence algorithm only works on words so you need to use words. Alternatively, if you don't need the confidences, a more efficient way to do it without lattice-align-words is to simply do lattice-1best | nbest-to-linear [only keeping the alignment output] | ali-to-phones (--write-lengths=true). You'll have to write a script to convert the output of ali-to-phones to ctm format. Wei, if you have time, could you please work on adding a boolean option --ctm-output to the program ali-to-phones (and an option --frame-shift, default 0.01, to control the times of the ctm output)? The confidences can just be 1. This issue seems to come up repeatedly. Dan On Mon, Jan 5, 2015 at 12:10 PM, Xavier Anguera <xan...@gm...> wrote: > Hi, > I am trying to perform phonetic decoding in Kaldi where I would like to > obtain a final ctm file with a time-aligned 1-best phone sequence given my > input audio. I must be missing something, as the decoded phones look good > but their timings are not accurate at all. Here is what I am doing: > > 1) I create a phone bigram LM with utils/make_phone_bigram_lang.sh > 2) I combine LM and acoustic models into a recognition graph with > utils/mkgraph.sh > 3) I perform the decoding of the input audio with steps/decode_si.sh > 4) Obtain the 1-best CTM using the following command: > lattice-align-phones --output-error-lats=true $hmm/final.mdl "ark:gunzip > -c $decodedir/lat.*.gz |" ark:- | \ > lattice-to-ctm-conf --decode-mbr=true --acoustic-scale=$acwt ark:- - | > \ > utils/int2sym.pl -f 5 $graph_or_lang/words.txt > $odir/$name.ctm || > exit 1; > > Note that when using the same acoustic models for word decoding I get very > good word-starting times. In this case I am using, in step 4, > lattice-align-words instead, could this be the problem? > > Thanks, > > X. Anguera > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming! The Go Parallel Website, > sponsored by Intel and developed in partnership with Slashdot Media, is your > hub for all things parallel software development, from weekly thought > leadership blogs to news, videos, case studies, tutorials and more. Take a > look and join the conversation now. http://goparallel.sourceforge.net > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > |
From: <Dan...@pa...> - 2015-01-13 00:36:00
|
I apologize in advance for asking a newbie question. I've been googling around and haven't seen an obvious answer. In the same sense that the Lexicon maps phonemes to words, what happens if the Language Model is set up as a Finite State Transducer instead of a Finite State Acceptor and maps words to word phrases? Most of the phrases would be very short (1-2 words), but when used in constrained applications, there might be an interesting number of longer phrases. For example, "How may I help you today?" and "We'll be back after these messages." Dan |
From: Daniel P. <dp...@gm...> - 2015-01-08 22:12:25
|
No, there is no way to find out earlier. However, in the egs/wsj/s5 setup, see the example script local/run_segmentation.sh which demonstrates how you can segment large segments of audio into smaller pieces (thanks to Guoguo Chen who added it). Dan On Thu, Jan 8, 2015 at 3:06 AM, Saman Mousazadeh <smo...@gm...> wrote: > Hi Everybody, > I use Kaldi to align text and audio. I have a pre-trained model. My audio > files are long. Sometimes the gmm-aligned-compiled cannot reach the end > state hence it fails to align the data. Is there any way to find-out it > earlier. I mean not waiting till the end of the processing. > Best > Saman > > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming! The Go Parallel Website, > sponsored by Intel and developed in partnership with Slashdot Media, is your > hub for all things parallel software development, from weekly thought > leadership blogs to news, videos, case studies, tutorials and more. Take a > look and join the conversation now. http://goparallel.sourceforge.net > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > |
From: Saman M. <smo...@gm...> - 2015-01-08 08:06:36
|
Hi Everybody, I use Kaldi to align text and audio. I have a pre-trained model. My audio files are long. Sometimes the gmm-aligned-compiled cannot reach the end state hence it fails to align the data. Is there any way to find-out it earlier. I mean not waiting till the end of the processing. Best Saman |
From: Daniel P. <dp...@gm...> - 2015-01-07 05:33:10
|
It's not that much of an issue because you can easily produce lattices using a small-ish language model, within real-time, and rescore them with a much larger one. Dan On Tue, Jan 6, 2015 at 7:43 PM, <Dan...@pa...> wrote: > Thanks! I won't have to spend a lot of time fretting about the relative merits of Kaldi and whatever ASR system pops up next. Your outline for using Kaldi's decoder is greatly appreciated. > > The Hydra folks claim that there's significant value in having a language model that's so large that it's unreasonable to incorporate it into the HCLG WFST. However, figures 3 and 4 in http://www.cs.cmu.edu/~ianlane/publications/2012_Kim_Interspeech.pdf appear to show that a standard bigram language model achieves comparable accuracy with a real time factor that's still comfortably under 1. Is that a fair conclusion? > > I'm trying to set expectations in my organization. With a Word Accuracy of 95%, the user still has to fix every 20th word on average. That's about one error in each line of text. Does it seem likely that someone will make substantial improvements in accuracy (like 99%) with an online decoder that has a Real Time Factor that's less than 1? With any Real Time Factor? > > Dan > > > -----Original Message----- > From: Daniel Povey [mailto:dp...@gm...] > Sent: Tuesday, January 06, 2015 4:37 PM > To: Davies, Dan <Dan...@pa...> > Cc: kal...@li... > Subject: Re: [Kaldi-developers] Kaldi comparison with Hydra? > > Interestingly, Hydra is what I wanted to call the Kaldi project (I was outvoted). > > It's not really possible to compare the two. Hydra is a closed-source decoder, and it's only a decoder, it doesn't have a system for building models like Kaldi does. > > I would imagine that the Hydra decoder is faster, since they've obviously put a lot of effort into it, but the online-nnet2 decoder is sufficiently fast, in that you can get it to decode in real-time fairly easily, without much loss in accuracy by using suitable beams and a large enough chunk-size (e.g. 20 frames), and by configuring your matrix library (ATLAS, OpenBlas, MKL) to use, say, 2 threads. > Although it would be very easy to use GPUs for the neural net part of the computation, there hasn't been much demand for it because if you can decode in real-time using a couple of cores of CPU, it'll generally be more efficient in terms of hardware cost than using one core of CPU, plus a GPU. > > Note that the online-nnet2 decoder is not really a decoder per se, it just calls the standard decoding code in lattice-faster-decoder.h, which isn't that complicated; but the online-nnet2 code takes care of various online feature estimation issues and of batching up the features into suitable size chunks so that matrix operations in the neural net code will be fast. > > Dan > > > On Tue, Jan 6, 2015 at 4:02 PM, <Dan...@pa...> wrote: >> CMU’s Hydra ASR decoder made a splash out here. From the references >> below (or any other info you can find), does anyone have a feeling for >> how this compares with the Kaldi nnet2 online decoder in speed and accuracy? >> >> >> >> http://www.cs.cmu.edu/~ianlane/publications/2012_Kim_Interspeech.pdf >> >> http://on-demand.gputechconf.com/gtc/2013/presentations/S3406-HYDRA-Hy >> brid-CPU-GPU-Speech-Recognition-Engine.pdf >> >> http://www.cs.cmu.edu/~ianlane/publications/SLT_JungsukKim.pdf >> >> http://www.nvidia.com/content/cuda/spotlights/ian-lane-cmu.html >> >> >> >> Dan >> >> >> >> >> ---------------------------------------------------------------------- >> -------- Dive into the World of Parallel Programming! The Go Parallel >> Website, sponsored by Intel and developed in partnership with Slashdot >> Media, is your hub for all things parallel software development, from >> weekly thought leadership blogs to news, videos, case studies, >> tutorials and more. Take a look and join the conversation now. >> http://goparallel.sourceforge.net >> _______________________________________________ >> Kaldi-developers mailing list >> Kal...@li... >> https://lists.sourceforge.net/lists/listinfo/kaldi-developers >> |
From: <Dan...@pa...> - 2015-01-07 03:43:33
|
Thanks! I won't have to spend a lot of time fretting about the relative merits of Kaldi and whatever ASR system pops up next. Your outline for using Kaldi's decoder is greatly appreciated. The Hydra folks claim that there's significant value in having a language model that's so large that it's unreasonable to incorporate it into the HCLG WFST. However, figures 3 and 4 in http://www.cs.cmu.edu/~ianlane/publications/2012_Kim_Interspeech.pdf appear to show that a standard bigram language model achieves comparable accuracy with a real time factor that's still comfortably under 1. Is that a fair conclusion? I'm trying to set expectations in my organization. With a Word Accuracy of 95%, the user still has to fix every 20th word on average. That's about one error in each line of text. Does it seem likely that someone will make substantial improvements in accuracy (like 99%) with an online decoder that has a Real Time Factor that's less than 1? With any Real Time Factor? Dan -----Original Message----- From: Daniel Povey [mailto:dp...@gm...] Sent: Tuesday, January 06, 2015 4:37 PM To: Davies, Dan <Dan...@pa...> Cc: kal...@li... Subject: Re: [Kaldi-developers] Kaldi comparison with Hydra? Interestingly, Hydra is what I wanted to call the Kaldi project (I was outvoted). It's not really possible to compare the two. Hydra is a closed-source decoder, and it's only a decoder, it doesn't have a system for building models like Kaldi does. I would imagine that the Hydra decoder is faster, since they've obviously put a lot of effort into it, but the online-nnet2 decoder is sufficiently fast, in that you can get it to decode in real-time fairly easily, without much loss in accuracy by using suitable beams and a large enough chunk-size (e.g. 20 frames), and by configuring your matrix library (ATLAS, OpenBlas, MKL) to use, say, 2 threads. Although it would be very easy to use GPUs for the neural net part of the computation, there hasn't been much demand for it because if you can decode in real-time using a couple of cores of CPU, it'll generally be more efficient in terms of hardware cost than using one core of CPU, plus a GPU. Note that the online-nnet2 decoder is not really a decoder per se, it just calls the standard decoding code in lattice-faster-decoder.h, which isn't that complicated; but the online-nnet2 code takes care of various online feature estimation issues and of batching up the features into suitable size chunks so that matrix operations in the neural net code will be fast. Dan On Tue, Jan 6, 2015 at 4:02 PM, <Dan...@pa...> wrote: > CMU’s Hydra ASR decoder made a splash out here. From the references > below (or any other info you can find), does anyone have a feeling for > how this compares with the Kaldi nnet2 online decoder in speed and accuracy? > > > > http://www.cs.cmu.edu/~ianlane/publications/2012_Kim_Interspeech.pdf > > http://on-demand.gputechconf.com/gtc/2013/presentations/S3406-HYDRA-Hy > brid-CPU-GPU-Speech-Recognition-Engine.pdf > > http://www.cs.cmu.edu/~ianlane/publications/SLT_JungsukKim.pdf > > http://www.nvidia.com/content/cuda/spotlights/ian-lane-cmu.html > > > > Dan > > > > > ---------------------------------------------------------------------- > -------- Dive into the World of Parallel Programming! The Go Parallel > Website, sponsored by Intel and developed in partnership with Slashdot > Media, is your hub for all things parallel software development, from > weekly thought leadership blogs to news, videos, case studies, > tutorials and more. Take a look and join the conversation now. > http://goparallel.sourceforge.net > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > |
From: 石伟 <sh...@sz...> - 2015-01-07 03:10:37
|
Sorry, I didn't check my mail box last day. Dan, Is this still in need? Wei ------------------ Original ------------------ From: "Daniel Povey"<dp...@gm...>; Date: Tue, Jan 6, 2015 04:21 AM To: "Xavier Anguera"<xan...@gm...>; "wei.shi"<we...@im...>; "shiwei"<sh...@sz...>; Cc: "kal...@li..."<kal...@li...>; Subject: Re: [Kaldi-developers] Phonetic decoding Your whole pipeline is based on using the words in the lattices, not the phones. In your case the words *are* the phones, because you're using a phone bigram LM. So you need to do lattice-align-words, not lattice-align-phones. The confidence algorithm only works on words so you need to use words. Alternatively, if you don't need the confidences, a more efficient way to do it without lattice-align-words is to simply do lattice-1best | nbest-to-linear [only keeping the alignment output] | ali-to-phones (--write-lengths=true). You'll have to write a script to convert the output of ali-to-phones to ctm format. Wei, if you have time, could you please work on adding a boolean option --ctm-output to the program ali-to-phones (and an option --frame-shift, default 0.01, to control the times of the ctm output)? The confidences can just be 1. This issue seems to come up repeatedly. Dan On Mon, Jan 5, 2015 at 12:10 PM, Xavier Anguera <xan...@gm...> wrote: > Hi, > I am trying to perform phonetic decoding in Kaldi where I would like to > obtain a final ctm file with a time-aligned 1-best phone sequence given my > input audio. I must be missing something, as the decoded phones look good > but their timings are not accurate at all. Here is what I am doing: > > 1) I create a phone bigram LM with utils/make_phone_bigram_lang.sh > 2) I combine LM and acoustic models into a recognition graph with > utils/mkgraph.sh > 3) I perform the decoding of the input audio with steps/decode_si.sh > 4) Obtain the 1-best CTM using the following command: > lattice-align-phones --output-error-lats=true $hmm/final.mdl "ark:gunzip > -c $decodedir/lat.*.gz |" ark:- | \ > lattice-to-ctm-conf --decode-mbr=true --acoustic-scale=$acwt ark:- - | > \ > utils/int2sym.pl -f 5 $graph_or_lang/words.txt > $odir/$name.ctm || > exit 1; > > Note that when using the same acoustic models for word decoding I get very > good word-starting times. In this case I am using, in step 4, > lattice-align-words instead, could this be the problem? > > Thanks, > > X. Anguera > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming! The Go Parallel Website, > sponsored by Intel and developed in partnership with Slashdot Media, is your > hub for all things parallel software development, from weekly thought > leadership blogs to news, videos, case studies, tutorials and more. Take a > look and join the conversation now. http://goparallel.sourceforge.net > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > |