Menu

saving Dynamic Language Model in file?

Florian
2015-03-03
2015-03-09
  • Florian

    Florian - 2015-03-03

    Hello,

    I've extended the DynamicLanguageModel Class so you can give it a larger text corpus to create new language models on-the-fly. Now I'd like to save it to a file so I can load it with pocketsphinx too.
    Any ideas how to do this?
    I know the probabilities for certain ngrams (1-3) are saved in the logProbs HashMap variable and I can plot a list of all the ngrams and their linear probability with this code:

    HashMap<WordSequence, Float> logProbs = dynLangModel.getLogProbs();
                Iterator<WordSequence> iterator = logProbs.keySet().iterator();
                double value;           WordSequence key;
                while (iterator.hasNext()) {
                    key = iterator.next();
                    value = LogMath.getLogMath().logToLinear(logProbs.get(key));
                    System.out.println(key.toString().replaceAll("\\[", "").replaceAll("\\]", " ") + String.format("%.4f",value));
                }
    

    But if I look at a simple trigram model (created with a web service) I see the word sequences given with 2 numbers (one before and one after the word sequence) and I don't fully understand what that means. Also I don't know what to do with the logBackoffs.

    Any expert for language model around? :-)

     
    • Alexander Solovets

       

      Last edit: Nickolay V. Shmyrev 2015-03-04
      • Florian

        Florian - 2015-03-04

        ty! with help of the links I managed to write the store() method :-D
        I made several changes to the DynamicTrigramModel Class but tried to keep it compatible to the old one. I'm not sure though if I succeeded ^^ (see attachment).

         
        • Alexander Solovets

          It's hard to assess your changes, because it is not a diff, but from what I see I can say that the probability calculation is broken. Next time consider to send a pull request on github.

           

          Last edit: Nickolay V. Shmyrev 2015-03-04
          • Florian

            Florian - 2015-03-04

            Hi Alexander,
            thanks for looking at the code!
            Can you be a bit more specific maybe? The additional code I added at the top is basically just another for-loop to run through the complete text and add all the sentences one by one to the nGrams HashMap. It ends before the calculation of the probability and I didn't change anything there. The rest is only about saving the model.
            I checked some of the probabilities manually and they seem to be reasonable also when I use the model I get very good results with a vocabulary of around 500 words created from around 600 sentences. So to me it looks like it's working :-)

             
        • Nickolay V. Shmyrev

          Dear Florian

          Thank you for your contribution! I could commit it but I hope you enjoy to fix few remaining issues.

          1) It is better to throw exception not to ignore it when you save the model

          2) Method names could be simple verbs (save) instead of (saveIt)

          3) It is better to save to a stream for compatibility, not just to a file

          4) It is better to use PrintWriter to store text files

          5) It is better to exit early instead of increasing with one big nested part. instead of

           if (!allocated) {
               allocate;
           }
          

          it's better to use something like

           if (allocated) {
                return;
           }
           allocate;
          

          6) I'm not sure why you replaced split on whitespace symbols \s+ with split on space

          7) I am not sure why do you need another map with bigrams while you can just access the keyset in logProbs, that would give you the same list of sequencies.

           
          • Florian

            Florian - 2015-03-07

            I hope you enjoy to fix few remaining issues

            I will :-)

            7) I am not sure why do you need another map with bigrams while you can just access the keyset in logProbs, that would give you the same list of sequencies.

            this is kind of an awkward workaround because I needed a sorted list and when I used the previous loop (as you can see in my comments) it was not sorted. Maybe I'm missing something here, how would you do it?

             

            Last edit: Florian 2015-03-07
            • Nickolay V. Shmyrev

              this is kind of an awkward workaround because I needed a sorted list and when I used the previous loop (as you can see in my comments) it was not sorted. Maybe I'm missing something here, how would you do it?

              I would sort bigrams when you dump the model. Copy them to the list and sort. There is no need to store them in memory during recognition in a separate map, you only need a sorted order during dump.

               
  • Alexander Solovets

    Sorry, but it does not make sense to me why you needed to modify anything in the loading code. The class was quite complete, except the method that saves the model, which I didn't check. Overall it's not clear why you need to save the model at all. N-gram models are usually static, and there are tools as well as web-services to generate them out of text files.

     
    • Florian

      Florian - 2015-03-04

      For ILA I need to create larger language models on-the-fly from all the data saved inside the program. This was not possible with the original Class because it cannot handle more than one independant sentence but I need a real "corpus" ...
      I've now integrated the pocketsphinx command line tool into ILA (Java) but it can only use the language model when I save it before so I can use sphinx-4 and pocketsphinx parallel (for keyphrase recognition e.g.).
      Makes sense now? :-)

       

Log in to post a comment.

MongoDB Logo MongoDB