before, I have done several searches in the forum and I haven't found the answer. Thank you in advance and sorry if I make mistakes in english, it is not my mother tongue ;-)
I'm using SphinxTrain in order to train my acoustics models, Sphinx-II as a decoder and SLM-toolkit for language models.
My task is answering queries about prices, timetables, services, etc for long distance trains in spanish. I have collected audio data for acoustics models and everything seems to be all right. I have trained semi-continous models with the perl scripts.
I have two recognizers: sphinx-II in live mode and batch mode. When I use live mode it seems like the word accuracy is almost 80%, but when I try to evaluate my test set I obtain only a 50% of word accuracy. I would like to know how to fix this problem and what is the difference between the live mode frontend and the batch frontend. Perhaps, the transformation between signals and features is done differently, but I really don't know.
Thank you in advance, I wait for your answer.
Sergio
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2005-05-18
Hola Sergio,
soy Javi de la Universida Carlos III de Madrid. Perdona que me dirija a t y no conteste a tu cuestin, pero me parece interesante ponernos en contacto, porque ambos estamos dedicndonos a lo mismo.
Actualmente estoy realizando mi propio entrenamiento en espaol con Sphinx III. Al ver tu mensaje me ha llamado la atencin el hecho de que obtuvieses un 80% de aciertos con un modelo de lenguaje que no es el elaborado por tu entrenamiento. Quisiera saber qu modelo has utilizado, si es en espaol, y cmo podra acceder a l: Existe algn modelo de lenguaje en espaol en la web?
Bueno, mucha suerte con el tema, que no es fcil. Espero que nos podamos ayudar en el futuro.
Gracias de antemano y un cordial saludo.
Javi
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
"Forgive me for addressing you while not answering to your question, but it seems interesting that we get in touch, since we're both working in the same field.
I'm currently training models in Spanish with Sphinx-3. Your message caught my attention because you achieved 80% accuracy with an LM not obtained from your own training. I'd like to know which modles you used, whether in Spanish, and how I could access it. Is there any Spanish model on the web?"
Followed by the usual greetings.
In answer to the first post, the live and batch processing are roughly the same. The cepstral mean normalization is different. Batch CMN is performed in the whole utterance (non causal), whereas the live CMN uses a sliding window to update the estimated mean vector.
Also, batch processing processes the whole utterance at once, whereas live processing processes frames as they become available. The transition between frames (how many samples go to an overload buffer etc) may have bugs. If you find them and sends us a fix, we'll be grateful.
In answer to the second post, I'm not aware of any publicly available model in Spanish. At CMU, we had models in Spanish some years ago, but these are not public.
--Evandro
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
First of all, thank you for your answers. (Obrigado Evandro!!)
I don't know how to fix this problem but it's really frustrating to see a high accuracy in live mode and not be capable of measure this accuracy in batch mode. I don't know, perhaps there is a problem in my own test audio data. I tried to record the test audio data with the same characteristics of the train data.
The acoustics models have been trained with data for an easier task. The evaluation showed me a 10% WER for speaker dependent task and 75% for speaker independent. All the experimentation has been done with the Sphinx-II.
In the new task (more difficult) I have created the language model, the dictionary and used the acoustics models for the previous task. Perhaps the problem is (despite I think I have record my corpus in the same conditions), that I have to train another acoustics models with the new audio files.
By the way, I have another question. I had no problem to train acoustics models with 16khz, 16 bits, a-law data but I have failed with telephone data 8khz, 8 bits, u-law. I have found a lot of errors in the procces, with the iterations of the baum-welch algorithm. I thinked there were a lot of misalignments with the transcription and I think the problem was in the wave2feat transformation. Which parameters may I change to transform telephone data to cepstral correctly? Sampling rate is clear but frame rate? hamming window? filter banks? I would really happy it you could help me.
With respects to j_arroba, he asked me if we can collaborate in this task. He asked me to if there is a spanish language model in the web. Well, I think there are not a lot of resources in spanish to train acoustics models. I have written a manual for the Sphinx tools in spanish. In this manual I explain how to collect data, make the dictionary, create the language models, how to train acoustic models with the perl scripts and how to modify the sphinx2-batch, the sphinx2-live and the sphinx2-server. I have record my own audio data with ten people reading 200 sentences and use my own make_dict for the castilian spanish. I will try to put this in my web page as soon as possible. I think Artur Chan has collected links for Sphinx in another languages (other than english)? If so, I will send you the link!!
Sorrry, it's a long post... but I hope that helps someone.
Cheers,
Sergio
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2005-05-19
Hi Sergio -- re your "other question" --
Forgive me if this is already obvious to you, but it's important! Wave2feat cannot process directly 8-bit u-law audio data; it must be converted to 16-bit linear samples first.
Wave2feat has a number of processing parameters whose values default to the case of processing 16 kHz data. For 8 kHz data, you must set the values of these four. If you type "wave2feat -help yes", you'll see that they recommend these values:
-lowerf 130
-upperf 3700
-nfilt 31
-nfft 256 or 512 (I think 256 should be OK for 8 kHz data, but the Sphinx-4 people used 512 for training 8 kHz models). You can leave the frame rate and hamming window parameters defaulted. It is essential that these same parameter values be set in the recognizer, since the front-end processing in the acoustic model and the recognizer should always be the same.
In additon, I believe it's always advisable to use "-dither yes".
I hope that helps.
cheers,
jerry wolf
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi everyone,
before, I have done several searches in the forum and I haven't found the answer. Thank you in advance and sorry if I make mistakes in english, it is not my mother tongue ;-)
I'm using SphinxTrain in order to train my acoustics models, Sphinx-II as a decoder and SLM-toolkit for language models.
My task is answering queries about prices, timetables, services, etc for long distance trains in spanish. I have collected audio data for acoustics models and everything seems to be all right. I have trained semi-continous models with the perl scripts.
I have two recognizers: sphinx-II in live mode and batch mode. When I use live mode it seems like the word accuracy is almost 80%, but when I try to evaluate my test set I obtain only a 50% of word accuracy. I would like to know how to fix this problem and what is the difference between the live mode frontend and the batch frontend. Perhaps, the transformation between signals and features is done differently, but I really don't know.
Thank you in advance, I wait for your answer.
Sergio
Hola Sergio,
soy Javi de la Universida Carlos III de Madrid. Perdona que me dirija a t y no conteste a tu cuestin, pero me parece interesante ponernos en contacto, porque ambos estamos dedicndonos a lo mismo.
Actualmente estoy realizando mi propio entrenamiento en espaol con Sphinx III. Al ver tu mensaje me ha llamado la atencin el hecho de que obtuvieses un 80% de aciertos con un modelo de lenguaje que no es el elaborado por tu entrenamiento. Quisiera saber qu modelo has utilizado, si es en espaol, y cmo podra acceder a l: Existe algn modelo de lenguaje en espaol en la web?
Bueno, mucha suerte con el tema, que no es fcil. Espero que nos podamos ayudar en el futuro.
Gracias de antemano y un cordial saludo.
Javi
Hi,
Could anyone help j_arroba to translate his message? I would like to help but I don't know how to speak Spanish. :-)
Arthur
A rough translation is:
"Forgive me for addressing you while not answering to your question, but it seems interesting that we get in touch, since we're both working in the same field.
I'm currently training models in Spanish with Sphinx-3. Your message caught my attention because you achieved 80% accuracy with an LM not obtained from your own training. I'd like to know which modles you used, whether in Spanish, and how I could access it. Is there any Spanish model on the web?"
Followed by the usual greetings.
In answer to the first post, the live and batch processing are roughly the same. The cepstral mean normalization is different. Batch CMN is performed in the whole utterance (non causal), whereas the live CMN uses a sliding window to update the estimated mean vector.
Also, batch processing processes the whole utterance at once, whereas live processing processes frames as they become available. The transition between frames (how many samples go to an overload buffer etc) may have bugs. If you find them and sends us a fix, we'll be grateful.
In answer to the second post, I'm not aware of any publicly available model in Spanish. At CMU, we had models in Spanish some years ago, but these are not public.
--Evandro
First of all, thank you for your answers. (Obrigado Evandro!!)
I don't know how to fix this problem but it's really frustrating to see a high accuracy in live mode and not be capable of measure this accuracy in batch mode. I don't know, perhaps there is a problem in my own test audio data. I tried to record the test audio data with the same characteristics of the train data.
The acoustics models have been trained with data for an easier task. The evaluation showed me a 10% WER for speaker dependent task and 75% for speaker independent. All the experimentation has been done with the Sphinx-II.
In the new task (more difficult) I have created the language model, the dictionary and used the acoustics models for the previous task. Perhaps the problem is (despite I think I have record my corpus in the same conditions), that I have to train another acoustics models with the new audio files.
By the way, I have another question. I had no problem to train acoustics models with 16khz, 16 bits, a-law data but I have failed with telephone data 8khz, 8 bits, u-law. I have found a lot of errors in the procces, with the iterations of the baum-welch algorithm. I thinked there were a lot of misalignments with the transcription and I think the problem was in the wave2feat transformation. Which parameters may I change to transform telephone data to cepstral correctly? Sampling rate is clear but frame rate? hamming window? filter banks? I would really happy it you could help me.
With respects to j_arroba, he asked me if we can collaborate in this task. He asked me to if there is a spanish language model in the web. Well, I think there are not a lot of resources in spanish to train acoustics models. I have written a manual for the Sphinx tools in spanish. In this manual I explain how to collect data, make the dictionary, create the language models, how to train acoustic models with the perl scripts and how to modify the sphinx2-batch, the sphinx2-live and the sphinx2-server. I have record my own audio data with ten people reading 200 sentences and use my own make_dict for the castilian spanish. I will try to put this in my web page as soon as possible. I think Artur Chan has collected links for Sphinx in another languages (other than english)? If so, I will send you the link!!
Sorrry, it's a long post... but I hope that helps someone.
Cheers,
Sergio
Hi Sergio -- re your "other question" --
Forgive me if this is already obvious to you, but it's important! Wave2feat cannot process directly 8-bit u-law audio data; it must be converted to 16-bit linear samples first.
Wave2feat has a number of processing parameters whose values default to the case of processing 16 kHz data. For 8 kHz data, you must set the values of these four. If you type "wave2feat -help yes", you'll see that they recommend these values:
-lowerf 130
-upperf 3700
-nfilt 31
-nfft 256 or 512 (I think 256 should be OK for 8 kHz data, but the Sphinx-4 people used 512 for training 8 kHz models). You can leave the frame rate and hamming window parameters defaulted. It is essential that these same parameter values be set in the recognizer, since the front-end processing in the acoustic model and the recognizer should always be the same.
In additon, I believe it's always advisable to use "-dither yes".
I hope that helps.
cheers,
jerry wolf
Javi (j_arroba), send me a message with the sourceforge utility (click my name in my posts) with your email and I will contact you.
Sergio