I am trying to develop both French and Spanish Language models using the a wikipedia dump corpus (formatted as mentioned in the tutorial along with the and tags & properly edited according)
After having successfully created the vocabulary files for each language, I have come across an error while running the text2idngram command. I have tried large and small vocabularies (65000 and 2,000,000) and I get the same error for both Spanish and french.
Assuming FrenchLM.txt is my corpus, below is the log error I get when running
text2idngram -vocab french.vocab -idngram FrenchLM.idngram < FrenchLM.txt
FrenchLM.txt is a large file (around 2.4 GB).
This is the remainder part of the execution (since the first part is working fine)
The error is in the last line.
[...]
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/29
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/30
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/31
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/32
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/33
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/34
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/35
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/36
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/37
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/38
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/39
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/40
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/41
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/42
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/43
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/44
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/45
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/46
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/47
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/48
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/49
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/50
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/51
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
.....................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/52
Merging 52 temporary files...
Error reading temp file cmuclmtk-a07004/1
D:\FYP\French>
And afterwards the command stops executing.
What would you recommend?
Thank you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I am trying to develop both French and Spanish Language models using the a wikipedia dump corpus (formatted as mentioned in the tutorial along with the
andtags & properly edited according)After having successfully created the vocabulary files for each language, I have come across an error while running the text2idngram command. I have tried large and small vocabularies (65000 and 2,000,000) and I get the same error for both Spanish and french.
Assuming FrenchLM.txt is my corpus, below is the log error I get when running
text2idngram -vocab french.vocab -idngram FrenchLM.idngram < FrenchLM.txt
FrenchLM.txt is a large file (around 2.4 GB).
This is the remainder part of the execution (since the first part is working fine)
The error is in the last line.
[...]
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/29
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/30
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/31
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/32
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/33
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/34
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/35
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/36
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/37
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/38
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/39
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/40
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/41
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/42
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/43
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/44
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/45
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/46
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/47
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/48
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/49
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/50
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
..................................................
................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/51
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
..................................................
..................................................
..................................................
.....................
Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-a07004/52
Merging 52 temporary files...
Error reading temp file cmuclmtk-a07004/1
D:\FYP\French>
And afterwards the command stops executing.
What would you recommend?
Thank you.
Use srilm
This really helped. Thank you!!