CMU Sphinx / Forums / Help: Language models doubts

Hi all!

I just finished my spanish language model. I followed the tutorial building a language model.

First of all I created a reference text, with no punctuation and other stranger characters and one sentence per line. Here is my first question: It's really necessary the \ ~~and \~~ delimiters? there is any problem if I don't add it?, when I add it, the vocabulary file generated (with CMUCLMTK) contains as words the ~~and~~ , I thought that the tool would ignore this.
A sample of my reference text:

Se conoce como micronutrientes a aquellas sustancias que el organismo de los seres vivos necesita en pequeñas dosis
Son indispensables para los diferentes procesos bioquímicos y metabólicos de los organismos vivos y sin ellos morirán
En los animales engloba las vitaminas y minerales y estos últimos se dividen en minerales y oligoelementos
Se ha podido estudiar bien en ellas cuáles necesitan gracias a cultivos sin suelo que pudiesen alterar los resultados

Then with the CMUCLMTK tools text2wfreq and wfreq2vocab I obtain my vocabulary file, I cheek it and I remove any strange character if it's present.

And finally I create the language model. When I cheeked it for the first time, I noticed that there were some <unk> words. In this page I found that the word <unk> represents any word that isn't present in the vocabulary, I'm right? So, as this word, and are contempled by an ARPA language model, I asume that this is not an error on my language model, I'm right?
Here is a sample of my language model:</unk></unk>

1-grams:
-1.1974 <UNK>   -0.4059
-1.7203 a   -1.4692
-5.3345 aaron   -0.3948
-5.2994 ab  -0.4322
-5.5114 aba -0.3841
-5.0201 abad    -0.4608
-4.2065 abajo   -0.7631
-4.7754 abandona    -0.9374
-4.9210 abandonada  -0.6420
-5.2559 abandonadas -0.5052
-4.5763 abandonado  -0.8070
-5.1057 abandonados -0.5412
-5.3605 abandonan   -0.7941
[...]
-2.2587 <UNK> intencionalmente una 
-1.4825 <UNK> intenciones <UNK> 
-2.1145 <UNK> intenciones agresivas 
-2.1145 <UNK> intenciones aunque

For include my language model in my sphinx4 code, I create a dictonary using the vocabulary file and the g2p-seq2seq tool.

When I run the program, I get the following messages:

20:44:29.719 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'

As my language model is created using a 25k words vocabulary, this message is repeated a lot, and my program crashes or spends a lot of minutes throwing this message.

I thought that <unk> would be omitted, as it's an ARPA word for notate "any other word". It's possible to configure the CMUCLMTK tools so that does not include <unk> in the language model?</unk>
**There is any problem (in terms of acuracy) if I remove it from the language model? **</unk>

Language models doubts

Speech Recognition Toolkit

Forums

Help

Language models doubts document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Language models doubts