From: Charles C. <cha...@nv...> - 2015-06-23 14:53:51
|
Both HTK and SRILM toolkits document the ARPA file format (credit to Doug Paul of MIT) See for example the following documentation links: SRILM: http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html HTK: http://www1.icsi.berkeley.edu/Speech/docs/HTKBook3.2/node213_mn.html It probably dates back at least as far as the original WSJ Corpus (early 1990's) -- e.g., " The Design for the Wall Street Journal-based CSR Corpus" by Doug Paul and Janet Baker (of Dragon fame) published in "HLT '91 Proceedings of the workshop on Speech and Natural Language" Perhaps there is a history buff on this mailing list who can provide the definitive answer... -Charles -----Original Message----- From: Daniel Povey [mailto:dp...@gm...] Sent: Monday, June 22, 2015 1:57 PM To: Kirill Katsnelson; Guoguo Chen Cc: kal...@li... Subject: Re: [Kaldi-developers] spaces in ngram declarations in the \data\ section I don't know if there is a formal definition of the ARPA format; things like this come up occasionally. The easiest thing is to just allow the format ngram= 12344 as well as ngram = 12345, and also print a warning for any lines after the \data\ marker that are not interpretable. Guoguo, could you do this? Dan On Mon, Jun 22, 2015 at 3:45 PM, Kirill Katsnelson <kir...@sm...> wrote: > Some LM files have spaces in ngram declarations in the \data\ section: > > > \data\ > ngram 1=150000 > ngram 2= 9774628 > ngram 3= 44845299 > > > \1-grams: > -7.89095 <s> -2.06214 > -2.92635 don't -1.85988 > > arpa-to-const-arpa does not like them in a peculiar way. Namely, it bombs out on the 1st unigram with a backoff weight, because it decided 1 is the final order. Looking at the code, the library code in src/lm/ const-arpa-lm.cc does not expect any space except between "ngram" and the rest of line, silently skipping any lines that begin with "ngram" and tokenized on space into less or more than 2 tokens. See line 316 if (keyword_found && col.size() == 2 && col[0] == "ngram") { -- not even a warning if "col.size() == 2" is false. > > Are these spaces legit? Should the tool be fixed, or the grammar? I never saw a formal spec of ARPA LM in my life. > > -kkm > > ---------------------------------------------------------------------- > -------- Monitor 25 network devices or servers for free with > OpManager! > OpManager is web-based network management software that monitors > network devices and physical & virtual servers, alerts via email & sms > for fault. Monitor 25 devices for free with no restriction. Download > now http://ad.doubleclick.net/ddm/clk/292181274;119417398;o > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers ---------------------------------------------------------------------------- -- Monitor 25 network devices or servers for free with OpManager! OpManager is web-based network management software that monitors network devices and physical & virtual servers, alerts via email & sms for fault. Monitor 25 devices for free with no restriction. Download now http://ad.doubleclick.net/ddm/clk/292181274;119417398;o _______________________________________________ Kaldi-developers mailing list Kal...@li... https://lists.sourceforge.net/lists/listinfo/kaldi-developers |