Menu

Problem with CMU LMTK

2010-10-06
2012-09-22
  • vijayabharadwaj gsr

    I have some problem with language modelling tool kit.

    Vocab generated by v2 of the CMU-Cambridge Statistcal

    Language Modeling toolkit.

    Includes 67731 words ##

    The language modelling tool kit produces files upto text2wngram correctly. But
    wngram2idngram fails to produce any output. It shows 0 bytes.

    The same language modelling tool kit I used for 32k vocabulary produced lm
    output. Is it very strict rule that vocabulary should not be more than 65,000?

    I am using the following program to create language model. The same program I
    have used for all my experiments.

    !/usr/bin/perl

    Create "grammars" for Asterisk-Sphinx integration.

    (c) 2009, Christopher Jansen

    Based on SimpleLM by Ricky Houghton

    use strict;
    use FileHandle;

    if(!$ARGV)
    {
    print("usage: $0 INTEXT INOUTDICT OUTGRAMMAR\nexample: $0 ./test.txt
    ./text.dict mygrammar\n");
    print(" INTEXT: plain-text file with one sentence to recognize per line\n");
    print(" INOUTDICT: dictionary file to create; note that switching grammars
    requires all grammars to share a dictionary\n");
    print(" OUTGRAMMAR: grammar file to create\n\n");
    print("Edit script to change location of master dictionary, temp files, or
    sphinx binaries.\n");
    exit(0);
    }

    my $intext = shift;
    my $outdictfile = shift;
    my $outgrammarfile = shift;

    my $sphinxbindir = "/home/bharadwaj/speech/cmuclmtk/bin";
    my $workdir = "/home/bharadwaj/speech/";
    my $indictfile = "/home/bharadwaj/speech/";
    my $CLEANUP = 0; # Remove temporary files?

    parse input - format for use (remove all non-text chars, lowercase, wrap in

    )
    print "Creating $workdir/text\n";
    my $ifh = new FileHandle("< $intext") || die("Cannot read: $intext");
    my $tfh = new FileHandle("> $workdir/text") || die("Cannot write:
    $workdir/text");
    while(my $line = <$ifh>)
    {
    chomp $line;
    $line =~ s/[^ ]//go;
    $tfh->print(" $line \n");
    }
    $ifh->close;
    $tfh->close;

    Create CCS file (Cargo-culted from original script - I have no idea what

    this is for.)
    if(!-s "$workdir/ccs.ccs")
    {
    print "Creating $workdir/ccs.css\n";
    my $ccs = new FileHandle("> $workdir/ccs.ccs") || die("Cannot write to
    $workdir/ccs.ccs\n");
    $ccs->print("");
    $ccs->close;
    }

    Create wfreq Word Frequency

    print "Creating $workdir/wfreq\n";
    my $cmdline = "cat $workdir/text | $sphinxbindir/text2wfreq 2>/dev/null | sort
    -T . > $workdir/wfreq";
    my $progoutput = qx/$cmdline/;

    Create vocab

    print "Creating $workdir/vocab\n";
    $cmdline = "cat $workdir/wfreq | $sphinxbindir/wfreq2vocab -top 75000
    2>/dev/null > $workdir/vocab";
    $progoutput = qx/$cmdline/;

    Create wngram

    print "Creating $workdir/wngram\n";
    $cmdline = "cat $workdir/text | $sphinxbindir/text2wngram -n 3 -temp $workdir
    2>/dev/null > $workdir/wngram";
    $progoutput = qx/$cmdline/;

    Create idngram

    print "Creating $workdir/idngram\n";
    $cmdline = "cat $workdir/wngram | $sphinxbindir/wngram2idngram -vocab
    $workdir/vocab -n 3 -temp $workdir 2>/dev/null > $workdir/idngram";
    $progoutput = qx/$cmdline/;

    Create grammar(lm)

    print "Creating $outgrammarfile\n";
    $cmdline = "$sphinxbindir/idngram2lm -vocab $workdir/vocab -idngram
    $workdir/idngram -arpa $outgrammarfile -vocab_type 1 -good_turing -n 3
    -calc_mem -context $workdir/ccs.ccs -four_byte_counts -verbosity 1 2>/dev/null
    ";
    $progoutput = qx/$cmdline/;

    Optionally delete interim files.

    if($CLEANUP)
    {
    foreach my $tfn (qw{ccs.ccs text wngram vocab idngram wfreq})
    {
    printf("Deleting: $workdir/$tfn\n");
    unlink("$workdir/$tfn");
    }
    }

     
  • Nickolay V. Shmyrev

    The same language modelling tool kit I used for 32k vocabulary produced lm
    output. Is it very strict rule that vocabulary should not be more than 65,000?

    Yes, you need to use other tools like SRILM for example to build bigger model.
    And it will not be very useful since only sphinx3 supports large language
    models.

     

Log in to post a comment.