Can anybody point me to the cause of the following error in lm_combine:
"Error - Repeated 2-gram in ARPA format language model."
when trying to combine two tri-gram lm.
There isn't any repeated bi-gram entry in either lm.
Thanks for any information.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Apparently, this error message occurs when building a lm with the cmuclmtk idngram2lm option -vocab_type 0.
Rebuilding the two lm with -vocab_type 1 or 2 does not cause the error, but segfaults a bit later.
(Partial) output is:
combine lms
Reading in a 3-gram language model.
Number of 1-grams = 44.
Number of 2-grams = 42.
Number of 3-grams = 42.
Reading unigrams...
Reading 2-grams...
Reading 3-grams...
loading context cues.
recaculate oov probabilities.
check probabilities
Processing 1-gram
Segmentation fault.
As anyone encountered this kind of error? (I am probably not passing correct lm to lm_combine.)
Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sure. Thanks for your help.
Here are the lms. This is a quick test so they are pretty crude (ex: figures appear in numeric form, etc. Also, they are based on french language texts, I'll try with other texts.)
Some other info:
- initially, I had not included <s>, </s> in the tiny corpus, but adding them resulted in the same segfault (they probably get added by the cmucmltk?).
- Recompiling lm_combine with VERY_VERBOSE does not display anything obvious to me (but I can provide it).
- I notice that when check_prob calls ids2words for uni-grams, voc[id[0]]=(nil) (null pointer). I can't see how that happens yet.
--lm1:
\data\
ngram 1=24
ngram 2=23
ngram 3=23
\1-grams:
-1.1244 <UNK> 0.0000
-1.4254 5.75 -0.4605
-1.7264 </s> 0.0000
-1.4254 <s> -0.4605
-1.4254 Fed -0.4605
-1.4254 Le -0.4605
-1.4254 a -0.4605
-1.4254 abaissé -0.4605
-1.4254 continue -0.4605
-1.4254 d'entraîner -0.4605
-1.4254 d'escompte -0.4376
-1.4254 effectué -0.4605
-1.4254 financiers -0.4376
-1.4254 geste -0.4605
-1.7264 hausse. 0.0000
-1.0607 la -0.6738
-1.4254 les -0.4605
-1.4254 marchés -0.4605
-1.4254 par -0.4376
-1.4254 qui -0.4605
-1.4254 son -0.4605
-1.4254 taux -0.4605
-1.4254 vendredi -0.4605
-1.0607 à -0.6412
\2-grams:
-0.1761 5.75 continue 0.1761
-0.1761 <s> Le 0.1761
-0.1761 Fed qui 0.1761
-0.1761 Le geste 0.1761
-0.1761 a abaissé 0.1761
-0.1761 abaissé vendredi 0.1761
-0.1761 continue d'entraîner 0.1761
-0.1761 d'entraîner les 0.1761
-0.1761 d'escompte à -0.0792
-0.1761 effectué par 0.1761
-0.1761 financiers à -0.0792
-0.1761 geste effectué 0.1761
-0.3979 la Fed 0.1761
-0.3979 la hausse. -0.2928
-0.1761 les marchés 0.1761
-0.1761 marchés financiers 0.1761
-0.1761 par la -0.0792
-0.1761 qui a 0.1761
-0.1761 son taux 0.1761
-0.1761 taux d'escompte 0.1761
-0.1761 vendredi son 0.1761
-0.3979 à 5.75 0.1761
-0.3979 à la -0.0792
\3-grams:
-0.3010 5.75 continue d'entraîner
-0.3010 <s> Le geste
-0.3010 Fed qui a
-0.3010 Le geste effectué
-0.3010 a abaissé vendredi
-0.3010 abaissé vendredi son
-0.3010 continue d'entraîner les
-0.3010 d'entraîner les marchés
-0.3010 d'escompte à 5.75
-0.3010 effectué par la
-0.3010 financiers à la
-0.3010 geste effectué par
-0.3010 la Fed qui
-0.3010 la hausse. </s>
-0.3010 les marchés financiers
-0.3010 marchés financiers à
-0.3010 par la Fed
-0.3010 qui a abaissé
-0.3010 son taux d'escompte
-0.3010 taux d'escompte à
-0.3010 vendredi son taux
-0.3010 à 5.75 continue
-0.3010 à la hausse.
\end\
--lm2:
\data\
ngram 1=25
ngram 2=23
ngram 3=23
\1-grams:
-1.1187 <UNK> 0.0000
-1.4197 <s>Révélatrice -0.4603
-1.4197 Evrard -0.4603
-1.4197 Le -0.4603
-1.4197 au -0.4603
-1.4197 aujourd'hui. -0.4603
-1.4197 centre -0.4603
-1.4197 contre -0.4376
-1.7207 d'exemples 0.0000
-1.4197 d'une -0.4603
-1.4197 dans -0.4376
-1.4197 des -0.4603
-1.4197 dysfonctionnements -0.4603
-1.4197 gouvernement -0.4603
-1.4197 interministérielle -0.4603
-1.4197 l'affaire -0.4603
-1.0607 la -0.6646
-1.4197 lutte -0.4603
-1.4197 pourrait -0.4603
-1.4197 pédophilie, -0.4603
-1.4197 réunion -0.4603
-1.4197 s'inspirer -0.4688
-1.4197 va -0.4603
-1.7207 étrangers.</s> 0.0000
-1.4197 être -0.4603
\2-grams:
-0.1761 <s>Révélatrice des 0.1761
-0.1761 Evrard va 0.1761
-0.1761 Le gouvernement 0.1761
-0.1761 au centre 0.1761
-0.1761 aujourd'hui. Le 0.1761
-0.1761 centre d'une 0.1761
-0.1761 contre la -0.0792
-0.1761 d'une réunion 0.1761
-0.1761 dans la -0.0792
-0.1761 des dysfonctionnements 0.1761
-0.1761 dysfonctionnements dans 0.1761
-0.1761 gouvernement pourrait 0.1761
-0.1761 interministérielle aujourd'hui. 0.1761
-0.1761 l'affaire Evrard 0.1761
-0.3979 la lutte 0.1761
-0.3979 la pédophilie, 0.1761
-0.1761 lutte contre 0.1761
-0.1761 pourrait s'inspirer 0.1761
-0.1761 pédophilie, l'affaire 0.1761
-0.1761 réunion interministérielle 0.1761
-0.1761 s'inspirer d'exemples -0.2927
-0.1761 va être 0.1761
-0.1761 être au 0.1761
\3-grams:
-0.3010 <s>Révélatrice des dysfonctionnements
-0.3010 Evrard va être
-0.3010 Le gouvernement pourrait
-0.3010 au centre d'une
-0.3010 aujourd'hui. Le gouvernement
-0.3010 centre d'une réunion
-0.3010 contre la pédophilie,
-0.3010 d'une réunion interministérielle
-0.3010 dans la lutte
-0.3010 des dysfonctionnements dans
-0.3010 dysfonctionnements dans la
-0.3010 gouvernement pourrait s'inspirer
-0.3010 interministérielle aujourd'hui. Le
-0.3010 l'affaire Evrard va
-0.3010 la lutte contre
-0.3010 la pédophilie, l'affaire
-0.3010 lutte contre la
-0.3010 pourrait s'inspirer d'exemples
-0.3010 pédophilie, l'affaire Evrard
-0.3010 réunion interministérielle aujourd'hui.
-0.3010 s'inspirer d'exemples étrangers.</s>
-0.3010 va être au
-0.3010 être au centre
\end\
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Looking for a reason why vocab[0] is null (previous post), I found something strange in file lm_combine.c, function combine_lm(): [...]
printf("Reading unigrams...\n");
i = 1;
begin_browse_union(lm1,lm2,i,&bru);
while (get_next_ngram_union(words,&bru)) {
word_copy = salloc(words[0]);
/ Do checks about open or closed vocab /
check_open_close_vocab(arpa_lm,word_copy,&i);
} [...]
is maybe a problem, since check_open_close_vocab expects i to start at 0 for open vocabulary.
I tried passing 1 to begin_browse_union(), and 0 to check_open_close_vocab(), and the program runs properly and appears to work correctly (I need to verify the results of the combination).
Could this be the cause to, or in relation with the pb?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
After the modification above, I do not get expected results. For example: when I combine the same LM with itself, with 0.5 weights for each, I do not get the same LM. Which I suppose is the expected behavior. (both LM where built with vocab_type 2, and are different to the one posted above.)
So, I changed lm_combine back to what it was, but I get the segmentation fault I already mentioned. I tried this with vocab types 2 and 0.
So, I'm supposing lm_combine does not work as I expect it to (or maybe not at all.)
Any help or advice is welcome! In the meantime, as I need this kind of functionality, I will look for other alternative programs.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Version 3, Copyright (c) 2006, Carnegie Mellon University
Contributors includes Wen Xu, Ananlada Chotimongkol,
David Huggins-Daines, Arthur Chan and Alan Black
=============================================================================
=============== This file was produced by the CMU-Cambridge ===============
=============== Statistical Language Modeling Toolkit ===============
=============================================================================
This is a 3-gram language model, based on a vocabulary of 45 words,
which begins "5.75", "</s>", "<s>"...
This file is in the ARPA-standard format introduced by Doug Paul.
\1-grams:
-2.4734 <UNK> 0.0000
-1.6882 5.75 -0.1682
-1.9540 </s> 0.0000
-1.6882 <s> -0.1719
-1.6850 <s>Révélatrice -0.1681
-1.6850 Evrard -0.1681
-1.6882 Fed -0.1682
-1.4225 Le -0.4604
-1.6882 a -0.1682
-1.6882 abaissé -0.1682
-1.6850 au -0.1681
-1.6850 aujourd'hui. -0.1718
-1.6850 centre -0.1681
-1.6882 continue -0.1682
-1.6850 contre -0.1659
-1.6882 d'entraîner -0.1682
-1.6882 d'escompte -0.1571
-1.9529 d'exemples 0.0000
-1.6850 d'une -0.1681
-1.6850 dans -0.1659
-1.6850 des -0.1681
-1.6850 dysfonctionnements -0.1681
-1.6882 effectué -0.1682
-1.6882 financiers -0.1571
-1.6882 geste -0.1682
-1.6850 gouvernement -0.1681
-1.9540 hausse. 0.0000
-1.6850 interministérielle -0.1681
-1.6850 l'affaire -0.1681
-1.0607 la -0.6694
-1.6882 les -0.1682
-1.6850 lutte -0.1681
-1.6882 marchés -0.1682
-1.6882 par -0.1659
-1.6850 pourrait -0.1681
-1.6850 pédophilie, -0.1681
-1.6882 qui -0.1682
-1.6850 réunion -0.1681
-1.6850 s'inspirer -0.1723
-1.6882 son -0.1682
-1.6882 taux -0.1682
-1.6850 va -0.1681
-1.6882 vendredi -0.1682
-1.3448 à 0.0000
-1.9529 étrangers.</s> 0.0000
-1.6850 être 0.0000
\2-grams:
-0.4749 5.75 continue 0.0513
-0.4530 <s> Le 0.0512
-0.4750 <s>Révélatrice des 0.0513
-0.4750 Evrard va 0.0513
-0.4749 Fed qui 0.0513
-0.4764 Le geste 0.0513
-0.4764 Le gouvernement 0.0513
-0.4749 a abaissé 0.0513
-0.4749 abaissé vendredi 0.0513
-0.4750 au centre 0.0513
-0.4533 aujourd'hui. Le 0.0512
-0.4750 centre d'une 0.0513
-0.4749 continue d'entraîner 0.0513
-0.4239 contre la -0.0280
-0.4749 d'entraîner les 0.0513
-0.4749 d'escompte à -0.1170
-0.4750 d'une réunion 0.0513
-0.4239 dans la -0.0280
-0.4750 des dysfonctionnements 0.0513
-0.4750 dysfonctionnements dans 0.0545
-0.4749 effectué par 0.0545
-0.4749 financiers à -0.1114
-0.4749 geste effectué 0.0513
-0.4750 gouvernement pourrait 0.0513
-0.4750 interministérielle aujourd'hui. 0.0525
-0.4750 l'affaire Evrard 0.0513
-0.6981 la Fed 0.0513
-0.6981 la hausse. -0.1211
-0.6982 la lutte 0.0513
-0.6982 la pédophilie, 0.0513
-0.4749 les marchés 0.0513
-0.4750 lutte contre 0.0545
-0.4749 marchés financiers 0.0513
-0.4239 par la -0.0280
-0.4750 pourrait s'inspirer 0.0513
-0.4750 pédophilie, l'affaire 0.0513
-0.4749 qui a 0.0513
-0.4750 réunion interministérielle 0.0513
-0.4750 s'inspirer d'exemples -0.1210
-0.4749 son taux 0.0513
-0.4749 taux d'escompte 0.0513
-0.4750 va être -0.1168
-0.4749 vendredi son 0.0513
-1.6882 à 5.75 0.0000
-1.0607 à la 0.0000
-1.6850 être au 0.0000
\3-grams:
-0.5990 5.75 continue d'entraîner
-0.6010 <s> Le geste
-0.5992 <s>Révélatrice des dysfonctionnements
-0.5992 Evrard va être
-0.5990 Fed qui a
-0.5990 Le geste effectué
-0.5992 Le gouvernement pourrait
-0.5990 a abaissé vendredi
-0.5990 abaissé vendredi son
-0.5992 au centre d'une
-0.6010 aujourd'hui. Le gouvernement
-0.5992 centre d'une réunion
-0.5990 continue d'entraîner les
-0.6014 contre la pédophilie,
-0.5990 d'entraîner les marchés
-0.5990 d'escompte à 5.75
-0.5992 d'une réunion interministérielle
-0.6014 dans la lutte
-0.5992 des dysfonctionnements dans
-0.5324 dysfonctionnements dans la
-0.5324 effectué par la
-0.5324 financiers à la
-0.5990 geste effectué par
-0.5992 gouvernement pourrait s'inspirer
-0.5706 interministérielle aujourd'hui. Le
-0.5992 l'affaire Evrard va
-0.5990 la Fed qui
-0.5990 la hausse. </s>
-0.5992 la lutte contre
-0.5992 la pédophilie, l'affaire
-0.5990 les marchés financiers
-0.5324 lutte contre la
-0.5990 marchés financiers à
-0.6014 par la Fed
-0.5992 pourrait s'inspirer d'exemples
-0.5992 pédophilie, l'affaire Evrard
-0.5990 qui a abaissé
-0.5992 réunion interministérielle aujourd'hui.
-0.5992 s'inspirer d'exemples étrangers.</s>
-0.5990 son taux d'escompte
-0.5990 taux d'escompte à
-0.5992 va être au
-0.5990 vendredi son taux
-0.4749 à 5.75 continue
-0.6981 à la hausse.
-0.4750 être au centre
\end\
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Looking back at this error message: the initial error message ("repeated 2-gram in ARPA") was correct, as the lm data was incorrect (not the one posted above). Plus the vocab types also needed adjusting. Then, I'm not sure what happened. The code change in lm_combine.c I mentioned above is correct in the svn version. So, I'm probably the one who introduced the error when trying to figure out things...
I still get some errors when combining another lm, but this is probably an error in the lm.
Sorry for the hastle. And thanks for the help!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
Can anybody point me to the cause of the following error in lm_combine:
"Error - Repeated 2-gram in ARPA format language model."
when trying to combine two tri-gram lm.
There isn't any repeated bi-gram entry in either lm.
Thanks for any information.
Apparently, this error message occurs when building a lm with the cmuclmtk idngram2lm option -vocab_type 0.
Rebuilding the two lm with -vocab_type 1 or 2 does not cause the error, but segfaults a bit later.
(Partial) output is:
combine lms
Reading in a 3-gram language model.
Number of 1-grams = 44.
Number of 2-grams = 42.
Number of 3-grams = 42.
Reading unigrams...
Reading 2-grams...
Reading 3-grams...
loading context cues.
recaculate oov probabilities.
check probabilities
Processing 1-gram
Segmentation fault.
As anyone encountered this kind of error? (I am probably not passing correct lm to lm_combine.)
Thanks.
Hm, can you provide the sources to reproduce this crash? Since you have only 42 bigrams they shouldn't be big.
Hy Nshmyrev,
What happened after this?
I am also getting segmentation fault in my lm compilation. Please help!!!
Sure. Thanks for your help.
Here are the lms. This is a quick test so they are pretty crude (ex: figures appear in numeric form, etc. Also, they are based on french language texts, I'll try with other texts.)
Some other info:
- initially, I had not included <s>, </s> in the tiny corpus, but adding them resulted in the same segfault (they probably get added by the cmucmltk?).
- Recompiling lm_combine with VERY_VERBOSE does not display anything obvious to me (but I can provide it).
- I notice that when check_prob calls ids2words for uni-grams, voc[id[0]]=(nil) (null pointer). I can't see how that happens yet.
--lm1:
\data\
ngram 1=24
ngram 2=23
ngram 3=23
\1-grams:
-1.1244 <UNK> 0.0000
-1.4254 5.75 -0.4605
-1.7264 </s> 0.0000
-1.4254 <s> -0.4605
-1.4254 Fed -0.4605
-1.4254 Le -0.4605
-1.4254 a -0.4605
-1.4254 abaissé -0.4605
-1.4254 continue -0.4605
-1.4254 d'entraîner -0.4605
-1.4254 d'escompte -0.4376
-1.4254 effectué -0.4605
-1.4254 financiers -0.4376
-1.4254 geste -0.4605
-1.7264 hausse. 0.0000
-1.0607 la -0.6738
-1.4254 les -0.4605
-1.4254 marchés -0.4605
-1.4254 par -0.4376
-1.4254 qui -0.4605
-1.4254 son -0.4605
-1.4254 taux -0.4605
-1.4254 vendredi -0.4605
-1.0607 à -0.6412
\2-grams:
-0.1761 5.75 continue 0.1761
-0.1761 <s> Le 0.1761
-0.1761 Fed qui 0.1761
-0.1761 Le geste 0.1761
-0.1761 a abaissé 0.1761
-0.1761 abaissé vendredi 0.1761
-0.1761 continue d'entraîner 0.1761
-0.1761 d'entraîner les 0.1761
-0.1761 d'escompte à -0.0792
-0.1761 effectué par 0.1761
-0.1761 financiers à -0.0792
-0.1761 geste effectué 0.1761
-0.3979 la Fed 0.1761
-0.3979 la hausse. -0.2928
-0.1761 les marchés 0.1761
-0.1761 marchés financiers 0.1761
-0.1761 par la -0.0792
-0.1761 qui a 0.1761
-0.1761 son taux 0.1761
-0.1761 taux d'escompte 0.1761
-0.1761 vendredi son 0.1761
-0.3979 à 5.75 0.1761
-0.3979 à la -0.0792
\3-grams:
-0.3010 5.75 continue d'entraîner
-0.3010 <s> Le geste
-0.3010 Fed qui a
-0.3010 Le geste effectué
-0.3010 a abaissé vendredi
-0.3010 abaissé vendredi son
-0.3010 continue d'entraîner les
-0.3010 d'entraîner les marchés
-0.3010 d'escompte à 5.75
-0.3010 effectué par la
-0.3010 financiers à la
-0.3010 geste effectué par
-0.3010 la Fed qui
-0.3010 la hausse. </s>
-0.3010 les marchés financiers
-0.3010 marchés financiers à
-0.3010 par la Fed
-0.3010 qui a abaissé
-0.3010 son taux d'escompte
-0.3010 taux d'escompte à
-0.3010 vendredi son taux
-0.3010 à 5.75 continue
-0.3010 à la hausse.
\end\
--lm2:
\data\
ngram 1=25
ngram 2=23
ngram 3=23
\1-grams:
-1.1187 <UNK> 0.0000
-1.4197 <s>Révélatrice -0.4603
-1.4197 Evrard -0.4603
-1.4197 Le -0.4603
-1.4197 au -0.4603
-1.4197 aujourd'hui. -0.4603
-1.4197 centre -0.4603
-1.4197 contre -0.4376
-1.7207 d'exemples 0.0000
-1.4197 d'une -0.4603
-1.4197 dans -0.4376
-1.4197 des -0.4603
-1.4197 dysfonctionnements -0.4603
-1.4197 gouvernement -0.4603
-1.4197 interministérielle -0.4603
-1.4197 l'affaire -0.4603
-1.0607 la -0.6646
-1.4197 lutte -0.4603
-1.4197 pourrait -0.4603
-1.4197 pédophilie, -0.4603
-1.4197 réunion -0.4603
-1.4197 s'inspirer -0.4688
-1.4197 va -0.4603
-1.7207 étrangers.</s> 0.0000
-1.4197 être -0.4603
\2-grams:
-0.1761 <s>Révélatrice des 0.1761
-0.1761 Evrard va 0.1761
-0.1761 Le gouvernement 0.1761
-0.1761 au centre 0.1761
-0.1761 aujourd'hui. Le 0.1761
-0.1761 centre d'une 0.1761
-0.1761 contre la -0.0792
-0.1761 d'une réunion 0.1761
-0.1761 dans la -0.0792
-0.1761 des dysfonctionnements 0.1761
-0.1761 dysfonctionnements dans 0.1761
-0.1761 gouvernement pourrait 0.1761
-0.1761 interministérielle aujourd'hui. 0.1761
-0.1761 l'affaire Evrard 0.1761
-0.3979 la lutte 0.1761
-0.3979 la pédophilie, 0.1761
-0.1761 lutte contre 0.1761
-0.1761 pourrait s'inspirer 0.1761
-0.1761 pédophilie, l'affaire 0.1761
-0.1761 réunion interministérielle 0.1761
-0.1761 s'inspirer d'exemples -0.2927
-0.1761 va être 0.1761
-0.1761 être au 0.1761
\3-grams:
-0.3010 <s>Révélatrice des dysfonctionnements
-0.3010 Evrard va être
-0.3010 Le gouvernement pourrait
-0.3010 au centre d'une
-0.3010 aujourd'hui. Le gouvernement
-0.3010 centre d'une réunion
-0.3010 contre la pédophilie,
-0.3010 d'une réunion interministérielle
-0.3010 dans la lutte
-0.3010 des dysfonctionnements dans
-0.3010 dysfonctionnements dans la
-0.3010 gouvernement pourrait s'inspirer
-0.3010 interministérielle aujourd'hui. Le
-0.3010 l'affaire Evrard va
-0.3010 la lutte contre
-0.3010 la pédophilie, l'affaire
-0.3010 lutte contre la
-0.3010 pourrait s'inspirer d'exemples
-0.3010 pédophilie, l'affaire Evrard
-0.3010 réunion interministérielle aujourd'hui.
-0.3010 s'inspirer d'exemples étrangers.</s>
-0.3010 va être au
-0.3010 être au centre
\end\
Looking for a reason why vocab[0] is null (previous post), I found something strange in file lm_combine.c, function combine_lm():
[...]
printf("Reading unigrams...\n");
i = 1;
begin_browse_union(lm1,lm2,i,&bru);
while (get_next_ngram_union(words,&bru)) {
word_copy = salloc(words[0]);
/ Do checks about open or closed vocab /
check_open_close_vocab(arpa_lm,word_copy,&i);
}
[...]
is maybe a problem, since check_open_close_vocab expects i to start at 0 for open vocabulary.
I tried passing 1 to begin_browse_union(), and 0 to check_open_close_vocab(), and the program runs properly and appears to work correctly (I need to verify the results of the combination).
Could this be the cause to, or in relation with the pb?
After the modification above, I do not get expected results. For example: when I combine the same LM with itself, with 0.5 weights for each, I do not get the same LM. Which I suppose is the expected behavior. (both LM where built with vocab_type 2, and are different to the one posted above.)
So, I changed lm_combine back to what it was, but I get the segmentation fault I already mentioned. I tried this with vocab types 2 and 0.
So, I'm supposing lm_combine does not work as I expect it to (or maybe not at all.)
Any help or advice is welcome! In the meantime, as I need this kind of functionality, I will look for other alternative programs.
Sorry, I've just checkout latest trunk of cmulmtk and started your lm's everything seems to work fine, here is the result I've got:
Copyright (c) 1996, Carnegie Mellon University, Cambridge University,
Ronald Rosenfeld and Philip Clarkson
Version 3, Copyright (c) 2006, Carnegie Mellon University
Contributors includes Wen Xu, Ananlada Chotimongkol,
David Huggins-Daines, Arthur Chan and Alan Black
=============================================================================
=============== This file was produced by the CMU-Cambridge ===============
=============== Statistical Language Modeling Toolkit ===============
=============================================================================
This is a 3-gram language model, based on a vocabulary of 45 words,
which begins "5.75", "</s>", "<s>"...
This file is in the ARPA-standard format introduced by Doug Paul.
p(wd3|wd1,wd2)= if(trigram exists) p_3(wd1,wd2,wd3)
else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2)
else p(wd3|w2)
p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2)
else bo_wt_1(wd1)*p_1(wd2)
All probs and back-off weights (bo_wt) are given in log10 form.
Data formats:
Beginning of data mark: \data\
ngram 1=nr # number of 1-grams
ngram 2=nr # number of 2-grams
ngram 3=nr # number of 3-grams
\1-grams:
p_1 wd_1 bo_wt_1
\2-grams:
p_2 wd_1 wd_2 bo_wt_2
\3-grams:
p_3 wd_1 wd_2 wd_3
end of data mark: \end\
| the header of the first LM
| the header of the second LM
\data\
ngram 1=46
ngram 2=46
ngram 3=46
\1-grams:
-2.4734 <UNK> 0.0000
-1.6882 5.75 -0.1682
-1.9540 </s> 0.0000
-1.6882 <s> -0.1719
-1.6850 <s>Révélatrice -0.1681
-1.6850 Evrard -0.1681
-1.6882 Fed -0.1682
-1.4225 Le -0.4604
-1.6882 a -0.1682
-1.6882 abaissé -0.1682
-1.6850 au -0.1681
-1.6850 aujourd'hui. -0.1718
-1.6850 centre -0.1681
-1.6882 continue -0.1682
-1.6850 contre -0.1659
-1.6882 d'entraîner -0.1682
-1.6882 d'escompte -0.1571
-1.9529 d'exemples 0.0000
-1.6850 d'une -0.1681
-1.6850 dans -0.1659
-1.6850 des -0.1681
-1.6850 dysfonctionnements -0.1681
-1.6882 effectué -0.1682
-1.6882 financiers -0.1571
-1.6882 geste -0.1682
-1.6850 gouvernement -0.1681
-1.9540 hausse. 0.0000
-1.6850 interministérielle -0.1681
-1.6850 l'affaire -0.1681
-1.0607 la -0.6694
-1.6882 les -0.1682
-1.6850 lutte -0.1681
-1.6882 marchés -0.1682
-1.6882 par -0.1659
-1.6850 pourrait -0.1681
-1.6850 pédophilie, -0.1681
-1.6882 qui -0.1682
-1.6850 réunion -0.1681
-1.6850 s'inspirer -0.1723
-1.6882 son -0.1682
-1.6882 taux -0.1682
-1.6850 va -0.1681
-1.6882 vendredi -0.1682
-1.3448 à 0.0000
-1.9529 étrangers.</s> 0.0000
-1.6850 être 0.0000
\2-grams:
-0.4749 5.75 continue 0.0513
-0.4530 <s> Le 0.0512
-0.4750 <s>Révélatrice des 0.0513
-0.4750 Evrard va 0.0513
-0.4749 Fed qui 0.0513
-0.4764 Le geste 0.0513
-0.4764 Le gouvernement 0.0513
-0.4749 a abaissé 0.0513
-0.4749 abaissé vendredi 0.0513
-0.4750 au centre 0.0513
-0.4533 aujourd'hui. Le 0.0512
-0.4750 centre d'une 0.0513
-0.4749 continue d'entraîner 0.0513
-0.4239 contre la -0.0280
-0.4749 d'entraîner les 0.0513
-0.4749 d'escompte à -0.1170
-0.4750 d'une réunion 0.0513
-0.4239 dans la -0.0280
-0.4750 des dysfonctionnements 0.0513
-0.4750 dysfonctionnements dans 0.0545
-0.4749 effectué par 0.0545
-0.4749 financiers à -0.1114
-0.4749 geste effectué 0.0513
-0.4750 gouvernement pourrait 0.0513
-0.4750 interministérielle aujourd'hui. 0.0525
-0.4750 l'affaire Evrard 0.0513
-0.6981 la Fed 0.0513
-0.6981 la hausse. -0.1211
-0.6982 la lutte 0.0513
-0.6982 la pédophilie, 0.0513
-0.4749 les marchés 0.0513
-0.4750 lutte contre 0.0545
-0.4749 marchés financiers 0.0513
-0.4239 par la -0.0280
-0.4750 pourrait s'inspirer 0.0513
-0.4750 pédophilie, l'affaire 0.0513
-0.4749 qui a 0.0513
-0.4750 réunion interministérielle 0.0513
-0.4750 s'inspirer d'exemples -0.1210
-0.4749 son taux 0.0513
-0.4749 taux d'escompte 0.0513
-0.4750 va être -0.1168
-0.4749 vendredi son 0.0513
-1.6882 à 5.75 0.0000
-1.0607 à la 0.0000
-1.6850 être au 0.0000
\3-grams:
-0.5990 5.75 continue d'entraîner
-0.6010 <s> Le geste
-0.5992 <s>Révélatrice des dysfonctionnements
-0.5992 Evrard va être
-0.5990 Fed qui a
-0.5990 Le geste effectué
-0.5992 Le gouvernement pourrait
-0.5990 a abaissé vendredi
-0.5990 abaissé vendredi son
-0.5992 au centre d'une
-0.6010 aujourd'hui. Le gouvernement
-0.5992 centre d'une réunion
-0.5990 continue d'entraîner les
-0.6014 contre la pédophilie,
-0.5990 d'entraîner les marchés
-0.5990 d'escompte à 5.75
-0.5992 d'une réunion interministérielle
-0.6014 dans la lutte
-0.5992 des dysfonctionnements dans
-0.5324 dysfonctionnements dans la
-0.5324 effectué par la
-0.5324 financiers à la
-0.5990 geste effectué par
-0.5992 gouvernement pourrait s'inspirer
-0.5706 interministérielle aujourd'hui. Le
-0.5992 l'affaire Evrard va
-0.5990 la Fed qui
-0.5990 la hausse. </s>
-0.5992 la lutte contre
-0.5992 la pédophilie, l'affaire
-0.5990 les marchés financiers
-0.5324 lutte contre la
-0.5990 marchés financiers à
-0.6014 par la Fed
-0.5992 pourrait s'inspirer d'exemples
-0.5992 pédophilie, l'affaire Evrard
-0.5990 qui a abaissé
-0.5992 réunion interministérielle aujourd'hui.
-0.5992 s'inspirer d'exemples étrangers.</s>
-0.5990 son taux d'escompte
-0.5990 taux d'escompte à
-0.5992 va être au
-0.5990 vendredi son taux
-0.4749 à 5.75 continue
-0.6981 à la hausse.
-0.4750 être au centre
\end\
You are right, it does work fine.
Looking back at this error message: the initial error message ("repeated 2-gram in ARPA") was correct, as the lm data was incorrect (not the one posted above). Plus the vocab types also needed adjusting. Then, I'm not sure what happened. The code change in lm_combine.c I mentioned above is correct in the svn version. So, I'm probably the one who introduced the error when trying to figure out things...
I still get some errors when combining another lm, but this is probably an error in the lm.
Sorry for the hastle. And thanks for the help!