Re: [Kaldi-developers] arpa-to-const-arpa RAM use

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,
I don't recall exactly how much memory that arpa conversion for the 4-gram
takes, but I guess it must be more than 32G.  Just skip the 4-gram models
for now.  Regarding the patches: please send them to me at dp...@gm....
After I look at them we'll discuss the best way to get them checked in.

Guoguo, I just had a look at the code for the const-arpa LM conversion and
I noticed a couple of things that can be improved in const-arpa-lm.cc, for
both speed and memory consumption.  Here:

          KALDI_ASSERT(seq_to_state_.find(hist) != seq_to_state_.end());
          seq_to_state_[hist]->AddChild(std::make_pair(word, lm_state));

it would be better to re-use the iterator returned by find in order to
avoid two consecutive lookups in the seq_to_state_ array.  In addition, you
could have a variable "std::vector<int32> cur_hist" equal to the previous
history-state, and if hist is the same as cur_hist you can avoid the
associative array lookup- this might be a little faster in the normal case
(worth testing though).

Secondly about memory consumption, I noticed that you create this:
        LmState *lm_state = new LmState(is_unigram, logprob,
backoff_logprob);
regardless of whether order == ngram_order_ or not.  (note: I would rename
the variable order to cur_order).  If order == ngram_order_, there is no
reason to allocate this or to insert it into the seq_to_state_ table.  This
is probably responsible for the bulk of the memory consumption.

Dan

On Wed, Mar 11, 2015 at 4:42 PM, Kirill Katsnelson <
kir...@sm...> wrote:

> I am running quite out of RAM in arpa-to-const-arpa in librispeech/s5 for
> the 4-gram model.
>
> The input argument to arpa-to-const-arpa is the massaged data from
> data/local/lm/lm_fglarge.arpa.gz (61M 4-grams additional).
>
> The 3-gram file passed with ~16G of peak commit size. The 4-gram crashed
> with OOM overnight, running out of 32G available to it.
>
> What memory usage should I expect?
>
> On the windows port side: pipes fixed, build system works, and I have
> advanced as far as decoding the unigram GMM model through the recipe. The
> troubles I am getting are from Cygwin files. First, absolute paths do not
> work, as files in Cygwin are essentially chrooted to a virtual root path.
> Second of all, links do not work. I am tweaking the scripts so far to get
> past the problems, but there is a general solution to handle the paths in
> code. I want my experiments gone through first however, already spending a
> lot of time on the technical stuff.
>
> Progress is here <https://github.com/kkm000/kaldi/compare/winbuild>, but
> the history is messy, I'll itemize it.
>
> I fixed a weird bug that would not be caught with gcc because of
> constructor elision, and also supported WAVEFORMATEXTENSIBLE in wave files
> (my flac 1.3.1 sends this format to stdout). How can I send patches for
> these changes? Let's start with windows-unrelated patches now.
>
>  -kkm
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Kaldi-developers mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>