CMU Sphinx / Forums / Speech Recognition Theory: building a language model

R. Paul McCarty - 2000-02-01

In order to make any use of the sphinx package, you really need to build a new language model, correct? I've seen the web page where you can submit a corpus:

http://alf14.speech.cs.cmu.edu:8044/lmtool.html

and get back a model, but my first attempt failed to give me anything. But assuming it works, it returns a .lm file correct? Then we need the pronunciation dictionary, which we can generate using:

http://alf14.speech.cs.cmu.edu:8044/pronounce.html

What else is needed?
What are the *.map *.phone files for? Can these be used in a new model? or are new ones generated by the page above?

-Paul

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Kevin A. Lenzo - 2000-02-02
  
  Actually, the LM building page gives you a tarball that contains both a dictionary (.dict) and a language model (.lm) [and other things, but ignore those for now :)].
  
  It turns out that there is a script to turn what the CMU pronouncing dictionary has in it into what sphinx2 wants. Getting the pronunciation and language modeling tools ready is one of the front-burner projects; people need to be able to do things on their own machines without depending on the net in order to do dynamic language models in domains that have new words -- like window managers.
  
  kevin
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Paul Fenwick - 2000-02-02
    
    Kevin, do you have the location of the script to convert the CMU dictionary into a .dic and/or .lm files that sphinx2 wants? You mentioned it exists, but not where it is.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Paul Fenwick - 2000-02-02
      
      The following is a very simple script which can take a cmudict file, and a list of words, and generate a dic file suitable for sphinx2. It's only had very very minor testing.
      
      With permission from a project admin I'll submit this into the CVS source-tree.
      
      #!/usr/bin/perl -w
      use strict;
      
      =head1 NAME
      
      make-dic - make a sphinx2 dictionary from the CMU dictionary.
      
      =head1 SYNOPSIS
      
      make-dic [-d dictionary-location] < wordlist > mywords.dic
      
      =head1 DESCRIPTION
      
      make-dic takes a list of words (one per line) and converts them into dic
      format as used by sphinx2. It requires the cmu dictionary (or similar).
      By default it looks in /usr/local/share/sphinx2/cmudict for the dictionary,
      but this can be overridden using the -d option.
      
      =head1 BUGS
      
      The script is still very simple. It expects one word per line and
      gets unhappy if it gets anything else.
      
      Punctuation is not handled correctly.
      
      =head1 AUTHOR
      
      Paul Fenwick <pjf@schools.net.au>
      
      =cut
      
      use Getopt::Std;
      my %options = ('d'=>0);
      
      getopts('d:h', \%options);
      
      if ($options{'h'}) {
              print "Try 'perldoc $0' for help.\n";
              exit 0;
      }
      
      my $dictionary = $options{'d'} || "/usr/local/share/sphinx2/cmudict";
      
      open DICT, $dictionary or die "Cannot open $dictionary - $!";
      
      # Find all the words in the file we've been given.
      my %words;
      my @line;
      while (<>) {
              @line = ();
              chomp;
              @line = split(/\s+/,$_);
              next unless @line;
              foreach my $word (@line) {
                      $word = uc $word;
                      $words{$word} = 1;
              }
      }
      
      # Take our list of words, and go through the dictionary to find them.
      while (<DICT>) {
              chomp;
              my ($word, $instance, $phones) = /^(\w+)($\d+$)?\s+(.*)/;
              next unless $word;
              $phones =~ tr/0-9//d;
              if (exists $words{$word}) {
                      if ($instance) {
                              print "$word$instance\t$phones\n";
                      } else {
                              print "$word\t$phones\n";
                      }
              }
      }
      __END__
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Kevin A. Lenzo - 2000-02-03
        
        There is one catch -- the phone set in CMUDICT is slightly different from the one in the current hmm/4k models. When i get back from NYC i'll make sure a transformation script is made available if someone doesn't beat me to it :)
        
        The phoneset differs in two major ways: the 4k model phoneset doesn't use lexical stress, and it inserts "deletable stops" when a stop-consonant (p,t,k,b,d,g) precedes another stop or is word (/sentence) final. There is also a conflation AX.
        
        We should really do some tests on the usefulness of deletable stops in as base phones of triphones. I think these are legacies from the days before triphones and training advanced as far as they have; of course, they've been working for us.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Kevin A. Lenzo - 2000-02-02
  
  The .map and .phone files are vesitigial right now. There are generic map and phone files in the model/hmm/4k directory that apply to that set of acoustic models. The phoneset in the map and phone files that the LM page generates are tailored to the particular task, but are currently incompatible with the 4k HMMs in the distribution.
  
  kevin
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ricky Houghton - 2000-02-03
  
  I completed the first pass at a perl script for generating the necessary sphinx LM data files. I put the necessary two files in csv tonight under SimpleLM. The script requires the CMU-Cambridge LM Toolkit binaries as well as cmudict.
  
  SimpleLM.pl takes as input a text document, does necessary pre-processing (but not text normalization) and ouputs a .dict, a .arpabo and a .DMP file for SPHINX. There are a few internal parameters, the most important being the ability to limit the final size of the dictionary. It is set to 3000.
  
  I will release a version that does weighted merging of documents once we can generate pronunciations for words not in cmudict.
  
  Ricky
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Kevin A. Lenzo - 2000-02-03
    
    CMUDICT has a slightly different phone set than the current one in the hmm/4k models; for instance, there are (somewhat vestigial) deletable stops. There is a script (somewhere) called new2sphinx that used to do this mapping.
    
    We should seriously consider removing deletable stops. It's my opinion that these are left over from the pre-triphone days.
    In any event, the lmtool page mistakenly makes all word-final stops deletable, so when these pronunciations are interpolated, the phone sequence is wrong. (not actually "wrong," since they're "deletable," but at least inaccurate).
    
    kevin
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Alex Rudnicky - 2000-02-03
      
      Note that the transform from cmudict to sphinx dict involves the application of about a dozen phonological rules and so is not a straight mapping. We'll make the relevant script available.
      
      I agree that deletable stops are, well, disposable. Once a new set of models are available the dictionaries can adjusted. Of course the fun really starts when people start to experiment with arbitrary phone sets. Each new phone set implies a new rule set.
      
      You're right that the tool more-or-less simply contatenates pronunciations in the case of compound words. It does currently use splicing rules (e.g., to deal with geminates) but it looks like these should be extended to deal with other cases.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- R. Paul McCarty - 2000-02-09
  
  I've tried the web page to generate a model from a plain text file corpus of sentances and every time I run it I get nothing back. Just a page which starts loading and read, "building corpus" and when the page stops loading nothing happens (takes about 5 minutes or so before it stops).
  
  Can someone check the logs and tell me if there is something wrong?
  
  Any suggestions?
  
  -Paul
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Kevin A. Lenzo - 2000-02-10
    
    Can you give us a URL or something to the file you're uploading? I can do some diagnosis with that.
    
    kevin
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2000-03-17
  
  We would like to build a large vocabulary language model. I read over most of the documentation of the CMU/CSL Modeling Toolkit, but I don't see how this will work with sphinx. I've created just about all the types of files that are specified in the documentation. Now which ones do I need for sphinx? Our vocabulary is of the 65535 words allowed by the tools. We wanted to use the CMU Dictionary, but it's too big, although we could change set the 4-bit flag specified in the documentation. Any ideas? Please help!
  
  Thanks Edgar
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - c0re - 2000-03-25
    
    I download the language models I uploaded to the server, untar-gz them to a dir. Edit sphinx2-demo to point to that directory... It doesn't get past the Initializing.... message.. Am I missing something? Is there any kind of documentation on this?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Kevin A. Lenzo - 2000-03-30
      
      Try running sphinx2-simple instead of sphinx2-demo. It gives a lot more diagnostic info.
      
      kevin
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

building a language model

Speech Recognition Toolkit

Forums

Help

building a language model

building a language model

Speech Recognition Toolkit

Forums

Help

building a language model document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

building a language model