CMU Sphinx / Forums / Help: How to perform keyword spotting in Italian

AiCoder8 - 2015-08-25

Hello,

I'm new to ASR. Currently I'm using PocketSphinx from GitHub master to perform keyword spotting.

I tried on English language, and it works. I got lost when I tried on Italian language.

After a lot of googling, for en, I'm using:

$ pocketsphinx_continuous -infile input.wav -hmm ./en-us-8khz -samprate 8000 -kws keywords.txt -kws_threshold 0.95

It works. I'm doing keyword spotting on phone calls recording, so 8khz, but this is fine for english. I must tune it a little bit, but results are encouraging. (I also tried to use pocketsphinx_kws as stated in many places, but I guess it has been merged).

Anyway, how to do that for Italian ?

If I simple use Italian keywords in keywords.txt I got that words are not in the dictionary, and this make sense as I'm using -hmm en-us-8khz, but even not using this model Italian words are not in the dictionary.

I would like to avoid, at least for now, to build complex acustic models for Italian, for keyword spotting.

Any indication on the right direction is greatly appreciated.

Best,
--C

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-08-25
  
  To analyze Italian you need to train Italian acoustic model. The tutorial is here:
  
  http://cmusphinx.sourceforge.net/wiki/tutorialam
  
  You can use transcribed Italian audiobooks and podcasts to create a database for training.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - AiCoder8 - 2015-08-25
    
    Hi Nickolay, Thank you for your reply. I'll follow the tutorial.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

AiCoder8 - 2015-08-26

Following the tutorial, and using some files from voxforge_it I was able to setup the training directory and so on, but running the training I got:

Phase 5: Determine amount of training data, see if n_tied_states seems reasonable.
Estimated Total Hours Training: 0.0577444444444444
ERROR: Not enough data for the training, we can only train CI models (set CFG_CD_TRAIN to "no")

Googling about this issue only points me to a code patch to allow for small CI models training, but I'm not sure about what CI models are, not what CFG_CD_TRAIN directive will do.

What I would like to achieve is recognize between a small set of words, like { SI, NO }, or { UNO, DUE, ..., ZERO }, using keyword spotting and comparing with grammars and ngram based approach. But I need to be able to build the acustic model to continue. I simply need more training data to build the acustic model ?

Thank you,
--C

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-08-26
  
  I simply need more training data to build the acustic model ?
  
  Yes
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

AiCoder8 - 2015-08-27

Ok, perfect. To have an idea, which is the really minimum amount of training data (in hours), to be able to correctly build the acustic model ?

I mean, to test the building process for the acustic model, regardless of quality. Then I will add more and more data to improve quality.

Thanks again,

--C

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-08-27
  
  To test the build process you can download an4 database.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

AiCoder8 - 2015-08-27

Done. Training went fine, but at the end (and when running sphinxtrain -s decode run) I got:

Sphinxtrain path: /usr/local/lib/sphinxtrain
Sphinxtrain binaries path: /usr/local/libexec/sphinxtrain
MODULE: DECODE Decoding using models previously trained
Decoding 130 segments starting at 0 (part 1 of 1)
0% ERROR: FATAL: "batch.c", line 821: PocketSphinx decoder init failed

ERROR: This step had 3 ERROR messages and 0 WARNING messages. Please check the log file for details.
ERROR: Failed to start pocketsphinx_batch
Aligning results to find error rate
Can't open /home/coder/working/ts/an4/result/an4-1-1.match
word_align.pl failed with error code 65280 at /usr/local/lib/sphinxtrain/scripts/decode/slave.pl line 173.

Log file contains no errors, only lot of:
INFO: sphinx_fe.c(764): Converting /home/coder/working/ts/an4/wav/an4_clstk/fash/an251-fash-b.sph to /home/coder/working/ts/an4/feat/an4_clstk/fash/an251-fash-b.mfc
INFO: sphinx_fe.c(764): Converting /home/coder/working/ts/an4/wav/an4_clstk/fash/an253-fash-b.sph to /home/coder/working/ts/an4/feat/an4_clstk/fash/an253-fash-b.mfc
INFO: sphinx_fe.c(764): Converting /home/coder/working/ts/an4/wav/an4_clstk/fash/an254-fash-b.sph to /home/coder/working/ts/an4/feat/an4_clstk/fash/an254-fash-b.mfc
INFO: sphinx_fe.c(764): Converting /home/coder/working/ts/an4/wav/an4_clstk/fash/an255-fash-b.sph to /home/coder/working/ts/an4/feat/an4_clstk/fash/an255-fash-b.mfc

Googling has not reported any useful solution to that.

Meanwhile, a very important question. I'm preparing a training set for acustic model. I've audio files of conversations of about two minutes each. The transcription is not short like in voxforge (a single sentence like: < s> Esempio di frase < /s> ), but each audio transcription is composed by 3/5 sentences. I found no info in the documentation about the best strategy.
The important question is: I must split audio file in sentence (1 file = 1 sentence), or it is better (and preferred), to have for transcription, each row the transcription of all conversation, with sentences separed by dots ? Example:
"< s> Questa è la prima frase. Questa è la seconda frase. Questa è un'altra frase. < /s>"

Very difficult to figure out this point.

Thank you,
--C

Last edit: AiCoder8 2015-08-27

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

AiCoder8 - 2015-08-27

Problem fixed:

Following the tutorial here "please make sure that you changed an4.lm.DMP to an4.ug.lm.DMP" I renamed an4.lm.DMP to an4.ug.lm.DMP, and this was the cause of the error.

Moving back to an4.lm.DMP, the error has been fixed:

sphinxtrain -s decode run
Sphinxtrain path: /usr/local/lib/sphinxtrain
Sphinxtrain binaries path: /usr/local/libexec/sphinxtrain
MODULE: DECODE Decoding using models previously trained
Decoding 130 segments starting at 0 (part 1 of 1)
0%
Aligning results to find error rate
SENTENCE ERROR: 46.2% (60/130) WORD ERROR RATE: 15.5% (119/773)

Basically, the training process went fine using an4. It works.

Now I can focus on making the same for Italian language. I copy here from the previous post an important question about that point.

I'm preparing a training set for acustic model. I've audio files of conversations of about two minutes each. The transcription is not short like in voxforge (a single sentence like: < s> Esempio di frase < /s> ), but each audio transcription is composed by 3/5 sentences. I found no info in the documentation about the best strategy.
The important question is: I must split audio file in sentence (1 file = 1 sentence), or it is better (and preferred), to have for transcription, each row the transcription of all conversation, with sentences separed by dots ? Example:
"< s> Questa è la prima frase. Questa è la seconda frase. Questa è un'altra frase. < /s>"

Very difficult to figure out this point.

Thank you,
--C

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-08-27
  
  The text for training should have no punctuation or upper case. You can download an4 database and see the details there.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - AiCoder8 - 2015-08-28
    
    Hi Nickolay,
    
    The text format for training seems to be, for me, confusing.
    
    From an4 training file, I've rows like the following:
    
    < s> ERASE O T E B SEVENTY NINE < /s> (an346-mnfe-b)
    < s> HELP < /s> (an347-mnfe-b)
    < s> ENTER EIGHT TWENTY SEVEN < /s> (an348-mnfe-b)
    < s> RUBOUT R R N A A NINETY FOUR < /s> (an349-mnfe-b)
    < s> ENTER THREE < /s> (an350-mnfe-b)
    < s> E F R O M < /s> (cen1-mnfe-b)
    < s> N E I L < /s> (cen2-mnfe-b)
    
    It seems to be like a spelling or single word training, and is all upper case.
    
    From voxforge_it dataset, I've rows like the following:
    
    < s> cretesi eaco appresso i mirmidoni ma per arrivare dove ho l'animo abbaino < /s> (it_0011)
    < s> cui il giovane sanza me le conoscerai abandona i pigri sonni e col tuo < /s> (it_0012)
    < s> descrive in questa forma dicendo che ella abbaia ha la voce di cagnolino < /s> (it_0013)
    < s> documenti che gli parevano dargli un titolo ad accampare de' diritti su quel < /s> (it_0014)
    < s> doloroso marito si venne accorgendo che ella nel confortare lui a bere non < /s> (it_0015)
    
    All is lowercase, and more like a sentence, no "spelling" of words.
    
    Why in an4 often HELLO is repreesented as H E L L O, and which is the best practice for the training ? I was not able to find any tutorial about this point.
    
    Please point me to the right direction, using a quasi-real case, like having:
    
    -> Audio1.wav
    -> Transcription1.txt
    "Buongiorno, oggi parliamo del cloud. I sistemi distribuiti si dividono in più categorie, sistemi ibridi e sistemi basati totalmente su cloud. La lezione di oggi si focalizzerà sull'analisi delle architetture, cosi come introdotte da D'Ambrogi nel relativo libro di testo o negli appunti".
    
    Now, which is the best practice ?
    
    A single sentence < s> Buongiorno oggi parliamo ... < /s> with all text without puntuaction, or is better to split the audio file with one sentence per file, and thus, having one sentence per row ?
    
    With name of person, like D'Ambrogi (very common case in Italian), how to achieve best result? It should represented as < s> ..... d'ambrogi ..... < /s> or < s> .... d ambrogi ... < /s>
    
    Please help me understanding how to correctly create the transcription to have best results, as it is very difficult to understand this point.
    
    Best,
    --C
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2015-08-28
      
      It seems to be like a spelling or single word training, and is all upper case.
      
      No, it is not spelling, it's just a sequence of single-letter words which this model is created for. You do not need that.
      
      From voxforge_it dataset, I've rows like the following:
      
      This is a correct example to follow
      
      A single sentence < s> Buongiorno oggi parliamo ... < /s> with all text without puntuaction, or is better to split the audio file with one sentence per file, and thus, having one sentence per row ?
      
      You need one sentence per row without punctuation all lowercase just like in voxforge example
      
      With name of person, like D'Ambrogi (very common case in Italian), how to achieve best result? It should represented as < s> ..... d'ambrogi ..... < /s> or < s> .... d ambrogi ... < /s>
      
      d'ambrogi
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

AiCoder8 - 2015-08-28

Thank you, very precious feedback. Now all should be ok to start creating a (hope) good Italian acustic model for baseline testing.

Thanks again,
--C

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

AiCoder8 - 2015-09-09

Hello,

I was able to build the acustic model, with the following performance: SER 100% WER 30.5%

Now, in order to better undestrand pocketsphinx (and how to build a better model), I would like to perform two tasks.

First task, recognize single word in the set [ si, no, zero, uno, ..., nove ], basically yes/no plus digits. Giving the acustic model I've built, what is the best strategy to use: i) Build a language model where words are [si,no,zero,uno,...,nove] and the same for dictionary, ii) Use a grammar iii) Use keyword spotting

Fort the second task (more difficult), I would like to recognize proper city names as well as people name. This task is quite different. Here is better to use: i) A custom language model with sentences that contains these names ii) A grammar iii) Keyword spotting

Thanks,
--C

Last edit: AiCoder8 2015-09-09

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-09-09
  
  what is the best strategy to use: i) Build a language model where words are [si,no,zero,uno,...,nove] and the same for dictionary, ii) Use a grammar iii) Use keyword spotting
  
  This question is covered in tutorial
  
  http://cmusphinx.sourceforge.net/wiki/tutoriallm
  
  it explains advantage and disadvantage of using each method. Decision does not depend on vocabulary size, but on the type of speech you want to recognize (do you want continuous listening and so on).
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

How to perform keyword spotting in Italian

Speech Recognition Toolkit

Forums

Help

How to perform keyword spotting in Italian document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

How to perform keyword spotting in Italian