From: Vassil P. <vas...@gm...> - 2012-06-24 08:41:32
|
Hi, I don't have much experience with wsj/s3 myself, but as far as I can see the version in Kaldi's trunk has this environment variable set in egs/wsj/s3/path.sh . I think if you "source" (. ./path.sh) this script before running the rest of the recipe LC_ALL should be already set for you. By the way the currently recommended version of WSJ recipe is "s5", which I think is stable already. Vassil On Sat, Jun 23, 2012 at 5:14 PM, Andrew Rosenberg <an...@cs...> wrote: > Hi all, > > I've run into a problem with language model training in the wsj > training recipe s3. During the LM training an error shows up that i'm > not quite sure how to fix. > > Train LM 3gram min count > Getting raw N-gram counts > generating n grams > discount_ngrams: for n-gram order 1, D=0.000000, tau=0.000000 phi=1.000000 > discount_ngrams: for n-gram order 2, D=0.000000, tau=0.000000 phi=1.000000 > discount_ngrams: for n-gram order 3, D=1.000000, tau=0.000000 phi=1.000000 > error: histories are not in sorted order, "?? ۰" > "? ??" > merge_ngrams: merge_ngrams.cc:141: void process_line(char*): Assertion > `comp > 0 || (comp == 0 && > entry.predicted.compare(stack.back().predicted) >= 0)' failed. > > Looking at the "error: ..." line in emacs rather than the console, to > see what the '?' characters actually were, it was clear that the issue > was with the sorting of the ngram tokens. > > The recipe runs error free until > local/wsj_train_lms.sh > > within this the line that gives the problem (first) is > > train_lm.sh --arpa --lmtype 3gram-mincount $dir > > digging deeper into train_lm.sh the line that generates the error is > > gunzip -c $dir/train.gz | tail -n +$heldout_sents | \ > get_raw_ngrams 3 | sort | uniq -c | uniq_to_ngrams | \ > sort | discount_ngrams $subdir/config.get_ngrams | \ > sort | merge_ngrams | gzip -c > $subdir/ngrams.gz > > Suspecting this is a problem with how sort is working, i tried sort > -n to see if it would fix the issue to no avail. > > Ultimately the fix was to "export LC_ALL=C" to ensure POSIX style sorting. > > This is clearly an environment problem, but I figured you guys would > want to know about it. I'm doing this in bash on CentOS. (if more > environment information would be useful, let me know.) > > Thanks very much for putting this tool together. I'm really enjoying > getting to know it. > > Andrew > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers |