First of all I created a reference text, with no punctuation and other stranger characters and one sentence per line. Here is my first question: It's really necessary the \<s> and \</s> delimiters? there is any problem if I don't add it?, when I add it, the vocabulary file generated (with CMUCLMTK) contains as words the and , I thought that the tool would ignore this.
A sample of my reference text:
Se conoce como micronutrientes a aquellas sustancias que el organismo de los seres vivos necesita en pequeñas dosis
Son indispensables para los diferentes procesos bioquímicos y metabólicos de los organismos vivos y sin ellos morirán
En los animales engloba las vitaminas y minerales y estos últimos se dividen en minerales y oligoelementos
Se ha podido estudiar bien en ellas cuáles necesitan gracias a cultivos sin suelo que pudiesen alterar los resultados
Then with the CMUCLMTK tools text2wfreq and wfreq2vocab I obtain my vocabulary file, I cheek it and I remove any strange character if it's present.
And finally I create the language model. When I cheeked it for the first time, I noticed that there were some <UNK> words. In this page I found that the word <UNK> represents any word that isn't present in the vocabulary, I'm right? So, as this word, and are contempled by an ARPA language model, I asume that this is not an error on my language model, I'm right?
Here is a sample of my language model:
1-grams:
-1.1974 <UNK> -0.4059
-1.7203 a -1.4692
-5.3345 aaron -0.3948
-5.2994 ab -0.4322
-5.5114 aba -0.3841
-5.0201 abad -0.4608
-4.2065 abajo -0.7631
-4.7754 abandona -0.9374
-4.9210 abandonada -0.6420
-5.2559 abandonadas -0.5052
-4.5763 abandonado -0.8070
-5.1057 abandonados -0.5412
-5.3605 abandonan -0.7941
[...]
-2.2587 <UNK> intencionalmente una
-1.4825 <UNK> intenciones <UNK>
-2.1145 <UNK> intenciones agresivas
-2.1145 <UNK> intenciones aunque
For include my language model in my sphinx4 code, I create a dictonary using the vocabulary file and the g2p-seq2seq tool.
When I run the program, I get the following messages:
As my language model is created using a 25k words vocabulary, this message is repeated a lot, and my program crashes or spends a lot of minutes throwing this message.
I thought that <UNK> would be omitted, as it's an ARPA word for notate "any other word". It's possible to configure the CMUCLMTK tools so that does not include <UNK> in the language model? There is any problem (in terms of acuracy) if I remove it from the language model?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all!
I just finished my spanish language model. I followed the tutorial building a language model.
First of all I created a reference text, with no punctuation and other stranger characters and one sentence per line. Here is my first question: It's really necessary the \<s> and \</s> delimiters? there is any problem if I don't add it?, when I add it, the vocabulary file generated (with CMUCLMTK) contains as words the
and, I thought that the tool would ignore this.A sample of my reference text:
Then with the CMUCLMTK tools text2wfreq and wfreq2vocab I obtain my vocabulary file, I cheek it and I remove any strange character if it's present.
And finally I create the language model. When I cheeked it for the first time, I noticed that there were some <UNK> words. In this page I found that the word <UNK> represents any word that isn't present in the vocabulary, I'm right? So, as this word, andare contempled by an ARPA language model, I asume that this is not an error on my language model, I'm right?
Here is a sample of my language model:
For include my language model in my sphinx4 code, I create a dictonary using the vocabulary file and the g2p-seq2seq tool.
When I run the program, I get the following messages:
As my language model is created using a 25k words vocabulary, this message is repeated a lot, and my program crashes or spends a lot of minutes throwing this message.
I thought that <UNK> would be omitted, as it's an ARPA word for notate "any other word". It's possible to configure the CMUCLMTK tools so that does not include <UNK> in the language model?
There is any problem (in terms of acuracy) if I remove it from the language model?
okay, with -vocab_type 0 option in the comand idngram2lm I can create a closed language model with out the <UNK>
But, I would need to use <UNK>, how I can configure sphinx to understand this as a non existing word in the vocabulary?