|
From: Daniel P. <dp...@gm...> - 2015-03-11 21:24:48
|
Hi,
I don't recall exactly how much memory that arpa conversion for the 4-gram
takes, but I guess it must be more than 32G. Just skip the 4-gram models
for now. Regarding the patches: please send them to me at dp...@gm....
After I look at them we'll discuss the best way to get them checked in.
Guoguo, I just had a look at the code for the const-arpa LM conversion and
I noticed a couple of things that can be improved in const-arpa-lm.cc, for
both speed and memory consumption. Here:
KALDI_ASSERT(seq_to_state_.find(hist) != seq_to_state_.end());
seq_to_state_[hist]->AddChild(std::make_pair(word, lm_state));
it would be better to re-use the iterator returned by find in order to
avoid two consecutive lookups in the seq_to_state_ array. In addition, you
could have a variable "std::vector<int32> cur_hist" equal to the previous
history-state, and if hist is the same as cur_hist you can avoid the
associative array lookup- this might be a little faster in the normal case
(worth testing though).
Secondly about memory consumption, I noticed that you create this:
LmState *lm_state = new LmState(is_unigram, logprob,
backoff_logprob);
regardless of whether order == ngram_order_ or not. (note: I would rename
the variable order to cur_order). If order == ngram_order_, there is no
reason to allocate this or to insert it into the seq_to_state_ table. This
is probably responsible for the bulk of the memory consumption.
Dan
On Wed, Mar 11, 2015 at 4:42 PM, Kirill Katsnelson <
kir...@sm...> wrote:
> I am running quite out of RAM in arpa-to-const-arpa in librispeech/s5 for
> the 4-gram model.
>
> The input argument to arpa-to-const-arpa is the massaged data from
> data/local/lm/lm_fglarge.arpa.gz (61M 4-grams additional).
>
> The 3-gram file passed with ~16G of peak commit size. The 4-gram crashed
> with OOM overnight, running out of 32G available to it.
>
> What memory usage should I expect?
>
> On the windows port side: pipes fixed, build system works, and I have
> advanced as far as decoding the unigram GMM model through the recipe. The
> troubles I am getting are from Cygwin files. First, absolute paths do not
> work, as files in Cygwin are essentially chrooted to a virtual root path.
> Second of all, links do not work. I am tweaking the scripts so far to get
> past the problems, but there is a general solution to handle the paths in
> code. I want my experiments gone through first however, already spending a
> lot of time on the technical stuff.
>
> Progress is here <https://github.com/kkm000/kaldi/compare/winbuild>, but
> the history is messy, I'll itemize it.
>
> I fixed a weird bug that would not be caught with gcc because of
> constructor elision, and also supported WAVEFORMATEXTENSIBLE in wave files
> (my flac 1.3.1 sends this format to stdout). How can I send patches for
> these changes? Let's start with windows-unrelated patches now.
>
> -kkm
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Kaldi-developers mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>
|