I've extended the DynamicLanguageModel Class so you can give it a larger text corpus to create new language models on-the-fly. Now I'd like to save it to a file so I can load it with pocketsphinx too.
Any ideas how to do this?
I know the probabilities for certain ngrams (1-3) are saved in the logProbs HashMap variable and I can plot a list of all the ngrams and their linear probability with this code:
But if I look at a simple trigram model (created with a web service) I see the word sequences given with 2 numbers (one before and one after the word sequence) and I don't fully understand what that means. Also I don't know what to do with the logBackoffs.
Any expert for language model around? :-)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
ty! with help of the links I managed to write the store() method :-D
I made several changes to the DynamicTrigramModel Class but tried to keep it compatible to the old one. I'm not sure though if I succeeded ^^ (see attachment).
It's hard to assess your changes, because it is not a diff, but from what I see I can say that the probability calculation is broken. Next time consider to send a pull request on github.
Last edit: Nickolay V. Shmyrev 2015-03-04
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Alexander,
thanks for looking at the code!
Can you be a bit more specific maybe? The additional code I added at the top is basically just another for-loop to run through the complete text and add all the sentences one by one to the nGrams HashMap. It ends before the calculation of the probability and I didn't change anything there. The rest is only about saving the model.
I checked some of the probabilities manually and they seem to be reasonable also when I use the model I get very good results with a vocabulary of around 500 words created from around 600 sentences. So to me it looks like it's working :-)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for your contribution! I could commit it but I hope you enjoy to fix few remaining issues.
1) It is better to throw exception not to ignore it when you save the model
2) Method names could be simple verbs (save) instead of (saveIt)
3) It is better to save to a stream for compatibility, not just to a file
4) It is better to use PrintWriter to store text files
5) It is better to exit early instead of increasing with one big nested part. instead of
if(!allocated){
allocate;}
it's better to use something like
if(allocated){
return;}
allocate;
6) I'm not sure why you replaced split on whitespace symbols \s+ with split on space
7) I am not sure why do you need another map with bigrams while you can just access the keyset in logProbs, that would give you the same list of sequencies.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
7) I am not sure why do you need another map with bigrams while you can just access the keyset in logProbs, that would give you the same list of sequencies.
this is kind of an awkward workaround because I needed a sorted list and when I used the previous loop (as you can see in my comments) it was not sorted. Maybe I'm missing something here, how would you do it?
Last edit: Florian 2015-03-07
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
this is kind of an awkward workaround because I needed a sorted list and when I used the previous loop (as you can see in my comments) it was not sorted. Maybe I'm missing something here, how would you do it?
I would sort bigrams when you dump the model. Copy them to the list and sort. There is no need to store them in memory during recognition in a separate map, you only need a sorted order during dump.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry, but it does not make sense to me why you needed to modify anything in the loading code. The class was quite complete, except the method that saves the model, which I didn't check. Overall it's not clear why you need to save the model at all. N-gram models are usually static, and there are tools as well as web-services to generate them out of text files.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For ILA I need to create larger language models on-the-fly from all the data saved inside the program. This was not possible with the original Class because it cannot handle more than one independant sentence but I need a real "corpus" ...
I've now integrated the pocketsphinx command line tool into ILA (Java) but it can only use the language model when I save it before so I can use sphinx-4 and pocketsphinx parallel (for keyphrase recognition e.g.).
Makes sense now? :-)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I've extended the DynamicLanguageModel Class so you can give it a larger text corpus to create new language models on-the-fly. Now I'd like to save it to a file so I can load it with pocketsphinx too.
Any ideas how to do this?
I know the probabilities for certain ngrams (1-3) are saved in the logProbs HashMap variable and I can plot a list of all the ngrams and their linear probability with this code:
But if I look at a simple trigram model (created with a web service) I see the word sequences given with 2 numbers (one before and one after the word sequence) and I don't fully understand what that means. Also I don't know what to do with the logBackoffs.
Any expert for language model around? :-)
http://cmusphinx.sourceforge.net/wiki/sphinx4:standardgrammarformats
http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html
Last edit: Nickolay V. Shmyrev 2015-03-04
ty! with help of the links I managed to write the store() method :-D
I made several changes to the DynamicTrigramModel Class but tried to keep it compatible to the old one. I'm not sure though if I succeeded ^^ (see attachment).
It's hard to assess your changes, because it is not a diff, but from what I see I can say that the probability calculation is broken. Next time consider to send a pull request on github.
Last edit: Nickolay V. Shmyrev 2015-03-04
Hi Alexander,
thanks for looking at the code!
Can you be a bit more specific maybe? The additional code I added at the top is basically just another for-loop to run through the complete text and add all the sentences one by one to the nGrams HashMap. It ends before the calculation of the probability and I didn't change anything there. The rest is only about saving the model.
I checked some of the probabilities manually and they seem to be reasonable also when I use the model I get very good results with a vocabulary of around 500 words created from around 600 sentences. So to me it looks like it's working :-)
Dear Florian
Thank you for your contribution! I could commit it but I hope you enjoy to fix few remaining issues.
1) It is better to throw exception not to ignore it when you save the model
2) Method names could be simple verbs (save) instead of (saveIt)
3) It is better to save to a stream for compatibility, not just to a file
4) It is better to use PrintWriter to store text files
5) It is better to exit early instead of increasing with one big nested part. instead of
it's better to use something like
6) I'm not sure why you replaced split on whitespace symbols \s+ with split on space
7) I am not sure why do you need another map with bigrams while you can just access the keyset in logProbs, that would give you the same list of sequencies.
I will :-)
this is kind of an awkward workaround because I needed a sorted list and when I used the previous loop (as you can see in my comments) it was not sorted. Maybe I'm missing something here, how would you do it?
Last edit: Florian 2015-03-07
I would sort bigrams when you dump the model. Copy them to the list and sort. There is no need to store them in memory during recognition in a separate map, you only need a sorted order during dump.
Sorry, but it does not make sense to me why you needed to modify anything in the loading code. The class was quite complete, except the method that saves the model, which I didn't check. Overall it's not clear why you need to save the model at all. N-gram models are usually static, and there are tools as well as web-services to generate them out of text files.
For ILA I need to create larger language models on-the-fly from all the data saved inside the program. This was not possible with the original Class because it cannot handle more than one independant sentence but I need a real "corpus" ...
I've now integrated the pocketsphinx command line tool into ILA (Java) but it can only use the language model when I save it before so I can use sphinx-4 and pocketsphinx parallel (for keyphrase recognition e.g.).
Makes sense now? :-)