CMU Sphinx / Forums / Sphinx4 Help: saving Dynamic Language Model in file?

Florian - 2015-03-03

Hello,

I've extended the DynamicLanguageModel Class so you can give it a larger text corpus to create new language models on-the-fly. Now I'd like to save it to a file so I can load it with pocketsphinx too.
Any ideas how to do this?
I know the probabilities for certain ngrams (1-3) are saved in the logProbs HashMap variable and I can plot a list of all the ngrams and their linear probability with this code:

HashMap<WordSequence, Float> logProbs = dynLangModel.getLogProbs(); Iterator<WordSequence> iterator = logProbs.keySet().iterator(); double value; WordSequence key; while (iterator.hasNext()) { key = iterator.next(); value = LogMath.getLogMath().logToLinear(logProbs.get(key)); System.out.println(key.toString().replaceAll("\\[", "").replaceAll("\\]", " ") + String.format("%.4f",value)); }

But if I look at a simple trigram model (created with a web service) I see the word sequences given with 2 numbers (one before and one after the word sequence) and I don't fully understand what that means. Also I don't know what to do with the logBackoffs.

Any expert for language model around? :-)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Alexander Solovets - 2015-03-03
  
  http://cmusphinx.sourceforge.net/wiki/sphinx4:standardgrammarformats
  http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html
  
  Last edit: Nickolay V. Shmyrev 2015-03-04
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Florian - 2015-03-04
    
    ty! with help of the links I managed to write the store() method :-D
    I made several changes to the DynamicTrigramModel Class but tried to keep it compatible to the old one. I'm not sure though if I succeeded ^^ (see attachment).
    
    DynamicTrigramModel.java
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Alexander Solovets - 2015-03-04
      
      It's hard to assess your changes, because it is not a diff, but from what I see I can say that the probability calculation is broken. Next time consider to send a pull request on github.
      
      Last edit: Nickolay V. Shmyrev 2015-03-04
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Florian - 2015-03-04
        
        Hi Alexander,
        thanks for looking at the code!
        Can you be a bit more specific maybe? The additional code I added at the top is basically just another for-loop to run through the complete text and add all the sentences one by one to the nGrams HashMap. It ends before the calculation of the probability and I didn't change anything there. The rest is only about saving the model.
        I checked some of the probabilities manually and they seem to be reasonable also when I use the model I get very good results with a vocabulary of around 500 words created from around 600 sentences. So to me it looks like it's working :-)
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2015-03-04
      
      Dear Florian
      
      Thank you for your contribution! I could commit it but I hope you enjoy to fix few remaining issues.
      
      1) It is better to throw exception not to ignore it when you save the model
      
      2) Method names could be simple verbs (save) instead of (saveIt)
      
      3) It is better to save to a stream for compatibility, not just to a file
      
      4) It is better to use PrintWriter to store text files
      
      5) It is better to exit early instead of increasing with one big nested part. instead of
      
      if (!allocated) { allocate; }
      
      it's better to use something like
      
      if (allocated) { return; } allocate;
      
      6) I'm not sure why you replaced split on whitespace symbols \s+ with split on space
      
      7) I am not sure why do you need another map with bigrams while you can just access the keyset in logProbs, that would give you the same list of sequencies.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Florian - 2015-03-07
        
        I hope you enjoy to fix few remaining issues
        
        I will :-)
        
        7) I am not sure why do you need another map with bigrams while you can just access the keyset in logProbs, that would give you the same list of sequencies.
        
        this is kind of an awkward workaround because I needed a sorted list and when I used the previous loop (as you can see in my comments) it was not sorted. Maybe I'm missing something here, how would you do it?
        
        Last edit: Florian 2015-03-07
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2015-03-09
        
        this is kind of an awkward workaround because I needed a sorted list and when I used the previous loop (as you can see in my comments) it was not sorted. Maybe I'm missing something here, how would you do it?
        
        I would sort bigrams when you dump the model. Copy them to the list and sort. There is no need to store them in memory during recognition in a separate map, you only need a sorted order during dump.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexander Solovets - 2015-03-04

Sorry, but it does not make sense to me why you needed to modify anything in the loading code. The class was quite complete, except the method that saves the model, which I didn't check. Overall it's not clear why you need to save the model at all. N-gram models are usually static, and there are tools as well as web-services to generate them out of text files.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Florian - 2015-03-04
  
  For ILA I need to create larger language models on-the-fly from all the data saved inside the program. This was not possible with the original Class because it cannot handle more than one independant sentence but I need a real "corpus" ...
  I've now integrated the pocketsphinx command line tool into ILA (Java) but it can only use the language model when I save it before so I can use sphinx-4 and pocketsphinx parallel (for keyphrase recognition e.g.).
  Makes sense now? :-)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

saving Dynamic Language Model in file?

Speech Recognition Toolkit

Forums

Help

saving Dynamic Language Model in file? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

saving Dynamic Language Model in file?