I have some problem with language modelling tool kit.
Vocab generated by v2 of the CMU-Cambridge Statistcal
Language Modeling toolkit.
Includes 67731 words ##
The language modelling tool kit produces files upto text2wngram correctly. But
wngram2idngram fails to produce any output. It shows 0 bytes.
The same language modelling tool kit I used for 32k vocabulary produced lm
output. Is it very strict rule that vocabulary should not be more than 65,000?
I am using the following program to create language model. The same program I
have used for all my experiments.
!/usr/bin/perl
Create "grammars" for Asterisk-Sphinx integration.
(c) 2009, Christopher Jansen
Based on SimpleLM by Ricky Houghton
use strict;
use FileHandle;
if(!$ARGV)
{
print("usage: $0 INTEXT INOUTDICT OUTGRAMMAR\nexample: $0 ./test.txt
./text.dict mygrammar\n");
print(" INTEXT: plain-text file with one sentence to recognize per line\n");
print(" INOUTDICT: dictionary file to create; note that switching grammars
requires all grammars to share a dictionary\n");
print(" OUTGRAMMAR: grammar file to create\n\n");
print("Edit script to change location of master dictionary, temp files, or
sphinx binaries.\n");
exit(0);
}
my $intext = shift;
my $outdictfile = shift;
my $outgrammarfile = shift;
my $sphinxbindir = "/home/bharadwaj/speech/cmuclmtk/bin";
my $workdir = "/home/bharadwaj/speech/";
my $indictfile = "/home/bharadwaj/speech/";
my $CLEANUP = 0; # Remove temporary files?
parse input - format for use (remove all non-text chars, lowercase, wrap in
Create CCS file (Cargo-culted from original script - I have no idea what
this is for.)
if(!-s "$workdir/ccs.ccs")
{
print "Creating $workdir/ccs.css\n";
my $ccs = new FileHandle("> $workdir/ccs.ccs") || die("Cannot write to
$workdir/ccs.ccs\n");
$ccs->print("");
$ccs->close;
}
The same language modelling tool kit I used for 32k vocabulary produced lm
output. Is it very strict rule that vocabulary should not be more than 65,000?
Yes, you need to use other tools like SRILM for example to build bigger model.
And it will not be very useful since only sphinx3 supports large language
models.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have some problem with language modelling tool kit.
Vocab generated by v2 of the CMU-Cambridge Statistcal
Language Modeling toolkit.
Includes 67731 words ##
The language modelling tool kit produces files upto text2wngram correctly. But
wngram2idngram fails to produce any output. It shows 0 bytes.
The same language modelling tool kit I used for 32k vocabulary produced lm
output. Is it very strict rule that vocabulary should not be more than 65,000?
I am using the following program to create language model. The same program I
have used for all my experiments.
!/usr/bin/perl
Create "grammars" for Asterisk-Sphinx integration.
(c) 2009, Christopher Jansen
Based on SimpleLM by Ricky Houghton
use strict;
use FileHandle;
if(!$ARGV)
{
print("usage: $0 INTEXT INOUTDICT OUTGRAMMAR\nexample: $0 ./test.txt
./text.dict mygrammar\n");
print(" INTEXT: plain-text file with one sentence to recognize per line\n");
print(" INOUTDICT: dictionary file to create; note that switching grammars
requires all grammars to share a dictionary\n");
print(" OUTGRAMMAR: grammar file to create\n\n");
print("Edit script to change location of master dictionary, temp files, or
sphinx binaries.\n");
exit(0);
}
my $intext = shift;
my $outdictfile = shift;
my $outgrammarfile = shift;
my $sphinxbindir = "/home/bharadwaj/speech/cmuclmtk/bin";
my $workdir = "/home/bharadwaj/speech/";
my $indictfile = "/home/bharadwaj/speech/";
my $CLEANUP = 0; # Remove temporary files?
parse input - format for use (remove all non-text chars, lowercase, wrap in
)print "Creating $workdir/text\n";
my $ifh = new FileHandle("< $intext") || die("Cannot read: $intext");
my $tfh = new FileHandle("> $workdir/text") || die("Cannot write:
$workdir/text");
while(my $line = <$ifh>)
{
chomp $line;
$line =~ s/[^ ]//go;
$tfh->print("
$line\n");}
$ifh->close;
$tfh->close;
Create CCS file (Cargo-culted from original script - I have no idea what
this is for.)
if(!-s "$workdir/ccs.ccs")
{
print "Creating $workdir/ccs.css\n";
my $ccs = new FileHandle("> $workdir/ccs.ccs") || die("Cannot write to
$workdir/ccs.ccs\n");
$ccs->print("
");$ccs->close;
}
Create wfreq Word Frequency
print "Creating $workdir/wfreq\n";
my $cmdline = "cat $workdir/text | $sphinxbindir/text2wfreq 2>/dev/null | sort
-T . > $workdir/wfreq";
my $progoutput = qx/$cmdline/;
Create vocab
print "Creating $workdir/vocab\n";
$cmdline = "cat $workdir/wfreq | $sphinxbindir/wfreq2vocab -top 75000
2>/dev/null > $workdir/vocab";
$progoutput = qx/$cmdline/;
Create wngram
print "Creating $workdir/wngram\n";
$cmdline = "cat $workdir/text | $sphinxbindir/text2wngram -n 3 -temp $workdir
2>/dev/null > $workdir/wngram";
$progoutput = qx/$cmdline/;
Create idngram
print "Creating $workdir/idngram\n";
$cmdline = "cat $workdir/wngram | $sphinxbindir/wngram2idngram -vocab
$workdir/vocab -n 3 -temp $workdir 2>/dev/null > $workdir/idngram";
$progoutput = qx/$cmdline/;
Create grammar(lm)
print "Creating $outgrammarfile\n";
$cmdline = "$sphinxbindir/idngram2lm -vocab $workdir/vocab -idngram
$workdir/idngram -arpa $outgrammarfile -vocab_type 1 -good_turing -n 3
-calc_mem -context $workdir/ccs.ccs -four_byte_counts -verbosity 1 2>/dev/null
";
$progoutput = qx/$cmdline/;
Optionally delete interim files.
if($CLEANUP){
foreach my $tfn (qw{ccs.ccs text wngram vocab idngram wfreq})
{
printf("Deleting: $workdir/$tfn\n");
unlink("$workdir/$tfn");
}
}
Yes, you need to use other tools like SRILM for example to build bigger model.
And it will not be very useful since only sphinx3 supports large language
models.