I have some problem with language modelling tool kit.

Vocab generated by v2 of the CMU-Cambridge Statistcal

Language Modeling toolkit.

Includes 67731 words ##

The language modelling tool kit produces files upto text2wngram correctly. But
wngram2idngram fails to produce any output. It shows 0 bytes.

The same language modelling tool kit I used for 32k vocabulary produced lm
output. Is it very strict rule that vocabulary should not be more than 65,000?

I am using the following program to create language model. The same program I
have used for all my experiments.

!/usr/bin/perl

Create "grammars" for Asterisk-Sphinx integration.

(c) 2009, Christopher Jansen

Based on SimpleLM by Ricky Houghton

use strict;
use FileHandle;

if(!$ARGV)
{
print("usage: $0 INTEXT INOUTDICT OUTGRAMMAR\nexample: $0 ./test.txt
./text.dict mygrammar\n");
print(" INTEXT: plain-text file with one sentence to recognize per line\n");
print(" INOUTDICT: dictionary file to create; note that switching grammars
requires all grammars to share a dictionary\n");
print(" OUTGRAMMAR: grammar file to create\n\n");
print("Edit script to change location of master dictionary, temp files, or
sphinx binaries.\n");
exit(0);
}

my $intext = shift;
my $outdictfile = shift;
my $outgrammarfile = shift;

my $sphinxbindir = "/home/bharadwaj/speech/cmuclmtk/bin";
my $workdir = "/home/bharadwaj/speech/";
my $indictfile = "/home/bharadwaj/speech/";
my $CLEANUP = 0; # Remove temporary files?

parse input - format for use (remove all non-text chars, lowercase, wrap in

)
print "Creating $workdir/text\n";
my $ifh = new FileHandle("< $intext") || die("Cannot read: $intext");
my $tfh = new FileHandle("> $workdir/text") || die("Cannot write:
$workdir/text");
while(my $line = <$ifh>)
{
chomp $line;
$line =~ s/[^ ]//go;
$tfh->print(" ~~$line~~ \n");
}
$ifh->close;
$tfh->close;

Create CCS file (Cargo-culted from original script - I have no idea what

this is for.)
if(!-s "$workdir/ccs.ccs")
{
print "Creating $workdir/ccs.css\n";
my $ccs = new FileHandle("> $workdir/ccs.ccs") || die("Cannot write to
$workdir/ccs.ccs\n");
$ccs->print("");
$ccs->close;
}

Create wfreq Word Frequency

print "Creating $workdir/wfreq\n";
my $cmdline = "cat $workdir/text | $sphinxbindir/text2wfreq 2>/dev/null | sort
-T . > $workdir/wfreq";
my $progoutput = qx/$cmdline/;

Create vocab

print "Creating $workdir/vocab\n";
$cmdline = "cat $workdir/wfreq | $sphinxbindir/wfreq2vocab -top 75000
2>/dev/null > $workdir/vocab";
$progoutput = qx/$cmdline/;

Create wngram

print "Creating $workdir/wngram\n";
$cmdline = "cat $workdir/text | $sphinxbindir/text2wngram -n 3 -temp $workdir
2>/dev/null > $workdir/wngram";
$progoutput = qx/$cmdline/;

Create idngram

print "Creating $workdir/idngram\n";
$cmdline = "cat $workdir/wngram | $sphinxbindir/wngram2idngram -vocab
$workdir/vocab -n 3 -temp $workdir 2>/dev/null > $workdir/idngram";
$progoutput = qx/$cmdline/;

Create grammar(lm)

print "Creating $outgrammarfile\n";
$cmdline = "$sphinxbindir/idngram2lm -vocab $workdir/vocab -idngram
$workdir/idngram -arpa $outgrammarfile -vocab_type 1 -good_turing -n 3
-calc_mem -context $workdir/ccs.ccs -four_byte_counts -verbosity 1 2>/dev/null
";
$progoutput = qx/$cmdline/;

Optionally delete interim files.

if($CLEANUP)
{
foreach my $tfn (qw{ccs.ccs text wngram vocab idngram wfreq})
{
printf("Deleting: $workdir/$tfn\n");
unlink("$workdir/$tfn");
}
}

Problem with CMU LMTK

Speech Recognition Toolkit

Forums

Help

Problem with CMU LMTK

Vocab generated by v2 of the CMU-Cambridge Statistcal

Language Modeling toolkit.

Includes 67731 words ##

!/usr/bin/perl

Create "grammars" for Asterisk-Sphinx integration.

(c) 2009, Christopher Jansen

Based on SimpleLM by Ricky Houghton

parse input - format for use (remove all non-text chars, lowercase, wrap in

Create CCS file (Cargo-culted from original script - I have no idea what

Create wfreq Word Frequency

Create vocab

Create wngram

Create idngram

Create grammar(lm)

Optionally delete interim files.

Problem with CMU LMTK

Speech Recognition Toolkit

Forums

Help

Problem with CMU LMTK document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Vocab generated by v2 of the CMU-Cambridge Statistcal

Language Modeling toolkit.

Includes 67731 words ##

!/usr/bin/perl

Create "grammars" for Asterisk-Sphinx integration.

(c) 2009, Christopher Jansen

Based on SimpleLM by Ricky Houghton

parse input - format for use (remove all non-text chars, lowercase, wrap in

Create CCS file (Cargo-culted from original script - I have no idea what

Create wfreq Word Frequency

Create vocab

Create wngram

Create idngram

Create grammar(lm)

Optionally delete interim files.

Problem with CMU LMTK