Menu

Language models doubts

Help
2017-09-28
2017-09-28
  • Adrián Amarante

    Hi all!

    I just finished my spanish language model. I followed the tutorial building a language model.

    First of all I created a reference text, with no punctuation and other stranger characters and one sentence per line. Here is my first question: It's really necessary the \<s> and \</s> delimiters? there is any problem if I don't add it?, when I add it, the vocabulary file generated (with CMUCLMTK) contains as words the and , I thought that the tool would ignore this.
    A sample of my reference text:

    Se conoce como micronutrientes a aquellas sustancias que el organismo de los seres vivos necesita en pequeñas dosis
    Son indispensables para los diferentes procesos bioquímicos y metabólicos de los organismos vivos y sin ellos morirán
    En los animales engloba las vitaminas y minerales y estos últimos se dividen en minerales y oligoelementos
    Se ha podido estudiar bien en ellas cuáles necesitan gracias a cultivos sin suelo que pudiesen alterar los resultados
    

    Then with the CMUCLMTK tools text2wfreq and wfreq2vocab I obtain my vocabulary file, I cheek it and I remove any strange character if it's present.

    And finally I create the language model. When I cheeked it for the first time, I noticed that there were some <UNK> words. In this page I found that the word <UNK> represents any word that isn't present in the vocabulary, I'm right? So, as this word, and are contempled by an ARPA language model, I asume that this is not an error on my language model, I'm right?
    Here is a sample of my language model:

    1-grams:
    -1.1974 <UNK>   -0.4059
    -1.7203 a   -1.4692
    -5.3345 aaron   -0.3948
    -5.2994 ab  -0.4322
    -5.5114 aba -0.3841
    -5.0201 abad    -0.4608
    -4.2065 abajo   -0.7631
    -4.7754 abandona    -0.9374
    -4.9210 abandonada  -0.6420
    -5.2559 abandonadas -0.5052
    -4.5763 abandonado  -0.8070
    -5.1057 abandonados -0.5412
    -5.3605 abandonan   -0.7941
    [...]
    -2.2587 <UNK> intencionalmente una 
    -1.4825 <UNK> intenciones <UNK> 
    -2.1145 <UNK> intenciones agresivas 
    -2.1145 <UNK> intenciones aunque 
    

    For include my language model in my sphinx4 code, I create a dictonary using the vocabulary file and the g2p-seq2seq tool.

    When I run the program, I get the following messages:

    20:44:29.719 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
    20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
    20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
    20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
    20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
    20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
    20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
    20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
    20:44:57.245 INFO dictionary           The dictionary is missing a phonetic transcription for the word '<UNK>'
    

    As my language model is created using a 25k words vocabulary, this message is repeated a lot, and my program crashes or spends a lot of minutes throwing this message.

    I thought that <UNK> would be omitted, as it's an ARPA word for notate "any other word". It's possible to configure the CMUCLMTK tools so that does not include <UNK> in the language model?
    There is any problem (in terms of acuracy) if I remove it from the language model?

     
  • Adrián Amarante

    okay, with -vocab_type 0 option in the comand idngram2lm I can create a closed language model with out the <UNK>

    But, I would need to use <UNK>, how I can configure sphinx to understand this as a non existing word in the vocabulary?

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.