I am getting many errors using lm_combine. I am trying to combine a language model with itself as a test. I got an error saying there were repeated n-grams and the code exited and I commented that part of the code to run it. Now I get a new error saying
"Error : Cannot generate probability for <UNK> since this is a closed vocabulary model."
I need help with usage of lm_combine as I don't understand most of the code. I spent a lot of time trying to figure this out myself but I am stuck now.
the command I am using is
lm_combine.exe -lm1 5431.lm -lm2 5431.lm -weight prob.wt -lm out.lm
where the prob.wt file contains
5431.lm 0.2
5431.lm 0.8
The lm file is generated using the online lm tool and it is appended below
\data\
ngram 1=33
ngram 2=58
ngram 3=57
\1-grams:
-1.0370 </s> -0.3010
-1.0370 <s> -0.2315
-1.8151 ABOUT -0.2852
-2.2923 AND -0.2592
-2.2923 AT -0.2921
-2.2923 BALK -0.2988
-2.2923 BASEBALL -0.2592
-2.2923 BLOCK -0.2592
-2.2923 BOCK -0.2592
-2.2923 CAN'T -0.2943
-2.2923 CAULK -0.2943
-1.6902 CHALK -0.2129
-2.2923 DON'T -0.2988
-2.2923 HAWK -0.2592
-2.2923 IS -0.2943
-2.2923 LET'S -0.2592
-2.2923 MOCK -0.2988
-1.9912 PEOPLE -0.2567
-2.2923 PHONE -0.2592
-2.2923 SOME -0.2966
-2.2923 TAKE -0.2592
-1.0370 TALK -0.1641
-1.9912 THE -0.2943
-2.2923 TIC -0.2592
-1.8151 TO -0.2943
-1.9912 TOMORROW -0.2518
-2.2923 TURKEY -0.2988
-2.2923 UHH -0.2592
-2.2923 US -0.2943
-1.9912 VOICE -0.2592
-1.8151 WALK -0.2567
-2.2923 WITH -0.2966
-1.9912 YOUR -0.2966
\2-grams:
-1.5563 <s> CAULK 0.0000
-1.5563 <s> CHALK -0.2430
-1.5563 <s> DON'T 0.0000
-1.5563 <s> LET'S 0.0000
-1.5563 <s> MOCK 0.0000
-1.5563 <s> SOME 0.0000
-1.5563 <s> TAKE 0.0000
-0.5563 <s> TALK -0.1249
-1.5563 <s> TIC 0.0000
-0.7782 ABOUT BASEBALL 0.0000
-0.7782 ABOUT CHALK -0.0969
-0.7782 ABOUT YOUR 0.0000
-0.3010 AND TALK -0.2499
-0.3010 AT CHALK -0.0969
-0.3010 BALK AT 0.0000
-0.3010 BASEBALL </s> -0.3010
-0.3010 BLOCK </s> -0.3010
-0.3010 BOCK </s> -0.3010
-0.3010 CAN'T WALK -0.2218
-0.3010 CAULK WALK -0.1249
-0.4260 CHALK </s> -0.3010
-0.9031 CHALK TALK -0.2888
-0.3010 DON'T BALK 0.0000
-0.3010 HAWK </s> -0.3010
-0.3010 IS ABOUT -0.2218
-0.3010 LET'S TALK -0.2888
-0.3010 MOCK BOCK 0.0000
-0.6021 PEOPLE </s> -0.3010
-0.6021 PEOPLE CAN'T 0.0000
-0.3010 PHONE </s> -0.3010
-0.3010 SOME PEOPLE -0.1761
-0.3010 TAKE TALK -0.2632
-0.9542 TALK </s> -0.3010
-1.5563 TALK BLOCK 0.0000
-1.5563 TALK CHALK -0.0969
-1.5563 TALK HAWK 0.0000
-1.5563 TALK IS 0.0000
-1.2553 TALK TALK -0.2499
-1.0792 TALK TO 0.0000
-1.2553 TALK TOMORROW 0.0000
-1.5563 TALK TURKEY 0.0000
-1.5563 TALK UHH 0.0000
-1.5563 TALK WALK -0.1249
-0.6021 THE PEOPLE -0.1761
-0.6021 THE PHONE 0.0000
-0.3010 TIC TALK -0.2499
-0.4771 TO THE 0.0000
-0.7782 TO US 0.0000
-0.6021 TOMORROW </s> -0.3010
-0.6021 TOMORROW ABOUT -0.2218
-0.3010 TURKEY WITH 0.0000
-0.3010 UHH TALK -0.2762
-0.3010 US ABOUT -0.2218
-0.3010 VOICE </s> -0.3010
-0.4771 WALK </s> -0.3010
-0.7782 WALK AND 0.0000
-0.3010 WITH YOUR 0.0000
-0.3010 YOUR VOICE 0.0000
\3-grams:
-0.3010 <s> CAULK WALK
-0.3010 <s> CHALK TALK
-0.3010 <s> DON'T BALK
-0.3010 <s> LET'S TALK
-0.3010 <s> MOCK BOCK
-0.3010 <s> SOME PEOPLE
-0.3010 <s> TAKE TALK
-1.3010 <s> TALK BLOCK
-1.3010 <s> TALK CHALK
-1.3010 <s> TALK HAWK
-1.0000 <s> TALK TALK
-1.0000 <s> TALK TO
-1.3010 <s> TALK TOMORROW
-1.3010 <s> TALK TURKEY
-1.3010 <s> TALK WALK
-0.3010 <s> TIC TALK
-0.3010 ABOUT BASEBALL </s>
-0.3010 ABOUT CHALK </s>
-0.3010 ABOUT YOUR VOICE
-0.3010 AND TALK </s>
-0.3010 AT CHALK </s>
-0.3010 BALK AT CHALK
-0.3010 CAN'T WALK AND
-0.3010 CAULK WALK </s>
-0.3010 CHALK TALK IS
-0.3010 DON'T BALK AT
-0.3010 IS ABOUT BASEBALL
-0.3010 LET'S TALK UHH
-0.3010 MOCK BOCK </s>
-0.3010 PEOPLE CAN'T WALK
-0.3010 SOME PEOPLE CAN'T
-0.3010 TAKE TALK TO
-0.3010 TALK BLOCK </s>
-0.3010 TALK CHALK </s>
-0.3010 TALK HAWK </s>
-0.3010 TALK IS ABOUT
-0.3010 TALK TALK </s>
-0.4771 TALK TO THE
-0.7782 TALK TO US
-0.6021 TALK TOMORROW </s>
-0.6021 TALK TOMORROW ABOUT
-0.3010 TALK TURKEY WITH
-0.3010 TALK UHH TALK
-0.3010 TALK WALK </s>
-0.3010 THE PEOPLE </s>
-0.3010 THE PHONE </s>
-0.3010 TIC TALK </s>
-0.6021 TO THE PEOPLE
-0.6021 TO THE PHONE
-0.3010 TO US ABOUT
-0.3010 TOMORROW ABOUT YOUR
-0.3010 TURKEY WITH YOUR
-0.3010 UHH TALK TOMORROW
-0.3010 US ABOUT CHALK
-0.3010 WALK AND TALK
-0.3010 WITH YOUR VOICE
-0.3010 YOUR VOICE </s>
\end\
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am getting many errors using lm_combine. I am trying to combine a language model with itself as a test. I got an error saying there were repeated n-grams and the code exited and I commented that part of the code to run it. Now I get a new error saying
"Error : Cannot generate probability for <UNK> since this is a closed vocabulary model."
I need help with usage of lm_combine as I don't understand most of the code. I spent a lot of time trying to figure this out myself but I am stuck now.
the command I am using is
lm_combine.exe -lm1 5431.lm -lm2 5431.lm -weight prob.wt -lm out.lm
where the prob.wt file contains
5431.lm 0.2
5431.lm 0.8
The lm file is generated using the online lm tool and it is appended below
\data\
ngram 1=33
ngram 2=58
ngram 3=57
\1-grams:
-1.0370 </s> -0.3010
-1.0370 <s> -0.2315
-1.8151 ABOUT -0.2852
-2.2923 AND -0.2592
-2.2923 AT -0.2921
-2.2923 BALK -0.2988
-2.2923 BASEBALL -0.2592
-2.2923 BLOCK -0.2592
-2.2923 BOCK -0.2592
-2.2923 CAN'T -0.2943
-2.2923 CAULK -0.2943
-1.6902 CHALK -0.2129
-2.2923 DON'T -0.2988
-2.2923 HAWK -0.2592
-2.2923 IS -0.2943
-2.2923 LET'S -0.2592
-2.2923 MOCK -0.2988
-1.9912 PEOPLE -0.2567
-2.2923 PHONE -0.2592
-2.2923 SOME -0.2966
-2.2923 TAKE -0.2592
-1.0370 TALK -0.1641
-1.9912 THE -0.2943
-2.2923 TIC -0.2592
-1.8151 TO -0.2943
-1.9912 TOMORROW -0.2518
-2.2923 TURKEY -0.2988
-2.2923 UHH -0.2592
-2.2923 US -0.2943
-1.9912 VOICE -0.2592
-1.8151 WALK -0.2567
-2.2923 WITH -0.2966
-1.9912 YOUR -0.2966
\2-grams:
-1.5563 <s> CAULK 0.0000
-1.5563 <s> CHALK -0.2430
-1.5563 <s> DON'T 0.0000
-1.5563 <s> LET'S 0.0000
-1.5563 <s> MOCK 0.0000
-1.5563 <s> SOME 0.0000
-1.5563 <s> TAKE 0.0000
-0.5563 <s> TALK -0.1249
-1.5563 <s> TIC 0.0000
-0.7782 ABOUT BASEBALL 0.0000
-0.7782 ABOUT CHALK -0.0969
-0.7782 ABOUT YOUR 0.0000
-0.3010 AND TALK -0.2499
-0.3010 AT CHALK -0.0969
-0.3010 BALK AT 0.0000
-0.3010 BASEBALL </s> -0.3010
-0.3010 BLOCK </s> -0.3010
-0.3010 BOCK </s> -0.3010
-0.3010 CAN'T WALK -0.2218
-0.3010 CAULK WALK -0.1249
-0.4260 CHALK </s> -0.3010
-0.9031 CHALK TALK -0.2888
-0.3010 DON'T BALK 0.0000
-0.3010 HAWK </s> -0.3010
-0.3010 IS ABOUT -0.2218
-0.3010 LET'S TALK -0.2888
-0.3010 MOCK BOCK 0.0000
-0.6021 PEOPLE </s> -0.3010
-0.6021 PEOPLE CAN'T 0.0000
-0.3010 PHONE </s> -0.3010
-0.3010 SOME PEOPLE -0.1761
-0.3010 TAKE TALK -0.2632
-0.9542 TALK </s> -0.3010
-1.5563 TALK BLOCK 0.0000
-1.5563 TALK CHALK -0.0969
-1.5563 TALK HAWK 0.0000
-1.5563 TALK IS 0.0000
-1.2553 TALK TALK -0.2499
-1.0792 TALK TO 0.0000
-1.2553 TALK TOMORROW 0.0000
-1.5563 TALK TURKEY 0.0000
-1.5563 TALK UHH 0.0000
-1.5563 TALK WALK -0.1249
-0.6021 THE PEOPLE -0.1761
-0.6021 THE PHONE 0.0000
-0.3010 TIC TALK -0.2499
-0.4771 TO THE 0.0000
-0.7782 TO US 0.0000
-0.6021 TOMORROW </s> -0.3010
-0.6021 TOMORROW ABOUT -0.2218
-0.3010 TURKEY WITH 0.0000
-0.3010 UHH TALK -0.2762
-0.3010 US ABOUT -0.2218
-0.3010 VOICE </s> -0.3010
-0.4771 WALK </s> -0.3010
-0.7782 WALK AND 0.0000
-0.3010 WITH YOUR 0.0000
-0.3010 YOUR VOICE 0.0000
\3-grams:
-0.3010 <s> CAULK WALK
-0.3010 <s> CHALK TALK
-0.3010 <s> DON'T BALK
-0.3010 <s> LET'S TALK
-0.3010 <s> MOCK BOCK
-0.3010 <s> SOME PEOPLE
-0.3010 <s> TAKE TALK
-1.3010 <s> TALK BLOCK
-1.3010 <s> TALK CHALK
-1.3010 <s> TALK HAWK
-1.0000 <s> TALK TALK
-1.0000 <s> TALK TO
-1.3010 <s> TALK TOMORROW
-1.3010 <s> TALK TURKEY
-1.3010 <s> TALK WALK
-0.3010 <s> TIC TALK
-0.3010 ABOUT BASEBALL </s>
-0.3010 ABOUT CHALK </s>
-0.3010 ABOUT YOUR VOICE
-0.3010 AND TALK </s>
-0.3010 AT CHALK </s>
-0.3010 BALK AT CHALK
-0.3010 CAN'T WALK AND
-0.3010 CAULK WALK </s>
-0.3010 CHALK TALK IS
-0.3010 DON'T BALK AT
-0.3010 IS ABOUT BASEBALL
-0.3010 LET'S TALK UHH
-0.3010 MOCK BOCK </s>
-0.3010 PEOPLE CAN'T WALK
-0.3010 SOME PEOPLE CAN'T
-0.3010 TAKE TALK TO
-0.3010 TALK BLOCK </s>
-0.3010 TALK CHALK </s>
-0.3010 TALK HAWK </s>
-0.3010 TALK IS ABOUT
-0.3010 TALK TALK </s>
-0.4771 TALK TO THE
-0.7782 TALK TO US
-0.6021 TALK TOMORROW </s>
-0.6021 TALK TOMORROW ABOUT
-0.3010 TALK TURKEY WITH
-0.3010 TALK UHH TALK
-0.3010 TALK WALK </s>
-0.3010 THE PEOPLE </s>
-0.3010 THE PHONE </s>
-0.3010 TIC TALK </s>
-0.6021 TO THE PEOPLE
-0.6021 TO THE PHONE
-0.3010 TO US ABOUT
-0.3010 TOMORROW ABOUT YOUR
-0.3010 TURKEY WITH YOUR
-0.3010 UHH TALK TOMORROW
-0.3010 US ABOUT CHALK
-0.3010 WALK AND TALK
-0.3010 WITH YOUR VOICE
-0.3010 YOUR VOICE </s>
\end\