First of all I created a reference text, with no punctuation and other stranger characters and one sentence per line. Here is my first question: It's really necessary the \ and \ delimiters? there is any problem if I don't add it?, when I add it, the vocabulary file generated (with CMUCLMTK) contains as words the and , I thought that the tool would ignore this.
A sample of my reference text:
Se conoce como micronutrientes a aquellas sustancias que el organismo de los seres vivos necesita en pequeñas dosis
Son indispensables para los diferentes procesos bioquímicos y metabólicos de los organismos vivos y sin ellos morirán
En los animales engloba las vitaminas y minerales y estos últimos se dividen en minerales y oligoelementos
Se ha podido estudiar bien en ellas cuáles necesitan gracias a cultivos sin suelo que pudiesen alterar los resultados
Then with the CMUCLMTK tools text2wfreq and wfreq2vocab I obtain my vocabulary file, I cheek it and I remove any strange character if it's present.
And finally I create the language model. When I cheeked it for the first time, I noticed that there were some <unk> words. In this page I found that the word <unk> represents any word that isn't present in the vocabulary, I'm right? So, as this word, and are contempled by an ARPA language model, I asume that this is not an error on my language model, I'm right?
Here is a sample of my language model:</unk></unk>
As my language model is created using a 25k words vocabulary, this message is repeated a lot, and my program crashes or spends a lot of minutes throwing this message.
I thought that <unk> would be omitted, as it's an ARPA word for notate "any other word". It's possible to configure the CMUCLMTK tools so that does not include <unk> in the language model?</unk>
**There is any problem (in terms of acuracy) if I remove it from the language model? **</unk>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all!
I just finished my spanish language model. I followed the tutorial building a language model.
First of all I created a reference text, with no punctuation and other stranger characters and one sentence per line. Here is my first question: It's really necessary the \
and \delimiters? there is any problem if I don't add it?, when I add it, the vocabulary file generated (with CMUCLMTK) contains as words theand, I thought that the tool would ignore this.A sample of my reference text:
Then with the CMUCLMTK tools text2wfreq and wfreq2vocab I obtain my vocabulary file, I cheek it and I remove any strange character if it's present.
And finally I create the language model. When I cheeked it for the first time, I noticed that there were some <unk> words. In this page I found that the word <unk> represents any word that isn't present in the vocabulary, I'm right? So, as this word, and
are contempled by an ARPA language model, I asume that this is not an error on my language model, I'm right?Here is a sample of my language model:</unk></unk>
For include my language model in my sphinx4 code, I create a dictonary using the vocabulary file and the g2p-seq2seq tool.
When I run the program, I get the following messages:
As my language model is created using a 25k words vocabulary, this message is repeated a lot, and my program crashes or spends a lot of minutes throwing this message.
I thought that <unk> would be omitted, as it's an ARPA word for notate "any other word". It's possible to configure the CMUCLMTK tools so that does not include <unk> in the language model?</unk>
**There is any problem (in terms of acuracy) if I remove it from the language model? **</unk>
okay, with -vocab_type 0 option in the comand idngram2lm I can create a closed language model with out the <unk></unk>
But, I would need to use <unk>, how I can configure sphinx to understand this as a non existing word in the vocabulary?</unk>