From: Daniel P. <dp...@gm...> - 2015-03-11 21:24:48
|
Hi, I don't recall exactly how much memory that arpa conversion for the 4-gram takes, but I guess it must be more than 32G. Just skip the 4-gram models for now. Regarding the patches: please send them to me at dp...@gm.... After I look at them we'll discuss the best way to get them checked in. Guoguo, I just had a look at the code for the const-arpa LM conversion and I noticed a couple of things that can be improved in const-arpa-lm.cc, for both speed and memory consumption. Here: KALDI_ASSERT(seq_to_state_.find(hist) != seq_to_state_.end()); seq_to_state_[hist]->AddChild(std::make_pair(word, lm_state)); it would be better to re-use the iterator returned by find in order to avoid two consecutive lookups in the seq_to_state_ array. In addition, you could have a variable "std::vector<int32> cur_hist" equal to the previous history-state, and if hist is the same as cur_hist you can avoid the associative array lookup- this might be a little faster in the normal case (worth testing though). Secondly about memory consumption, I noticed that you create this: LmState *lm_state = new LmState(is_unigram, logprob, backoff_logprob); regardless of whether order == ngram_order_ or not. (note: I would rename the variable order to cur_order). If order == ngram_order_, there is no reason to allocate this or to insert it into the seq_to_state_ table. This is probably responsible for the bulk of the memory consumption. Dan On Wed, Mar 11, 2015 at 4:42 PM, Kirill Katsnelson < kir...@sm...> wrote: > I am running quite out of RAM in arpa-to-const-arpa in librispeech/s5 for > the 4-gram model. > > The input argument to arpa-to-const-arpa is the massaged data from > data/local/lm/lm_fglarge.arpa.gz (61M 4-grams additional). > > The 3-gram file passed with ~16G of peak commit size. The 4-gram crashed > with OOM overnight, running out of 32G available to it. > > What memory usage should I expect? > > On the windows port side: pipes fixed, build system works, and I have > advanced as far as decoding the unigram GMM model through the recipe. The > troubles I am getting are from Cygwin files. First, absolute paths do not > work, as files in Cygwin are essentially chrooted to a virtual root path. > Second of all, links do not work. I am tweaking the scripts so far to get > past the problems, but there is a general solution to handle the paths in > code. I want my experiments gone through first however, already spending a > lot of time on the technical stuff. > > Progress is here <https://github.com/kkm000/kaldi/compare/winbuild>, but > the history is messy, I'll itemize it. > > I fixed a weird bug that would not be caught with gcc because of > constructor elision, and also supported WAVEFORMATEXTENSIBLE in wave files > (my flac 1.3.1 sends this format to stdout). How can I send patches for > these changes? Let's start with windows-unrelated patches now. > > -kkm > > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming The Go Parallel Website, > sponsored > by Intel and developed in partnership with Slashdot Media, is your hub for > all > things parallel software development, from weekly thought leadership blogs > to > news, videos, case studies, tutorials and more. Take a look and join the > conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > |