Hello,
im playing around with pocketsphinx and the adaptation (MLLR and MAP) as per your tutorials
and have realy low accuracy on my voice using the en-us model.
I tried to troubleshoot but i would love your opinion:)
This is the result from ffprobe from my .wav files:
"bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s"
and from "file" (man file, determine file type) :
"RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz"
I follow this scenario every time:
"around 5 minutes of data as the adaptation input"
"around 2 minutes of different data to test and compare recognition with and w/o the adapted model/MLLR"
My native language is greek and when using the greek continuous model
( downloaded from your site and created by Fotis Pantazoglou)
i get around 70% accuracy with the model alone, and after MLLR adaptation
and testing, accuracy goes to ~90%. Which is satisfactory .
Now with the en-us ptm model (included in pocketsphinx/model)
or with the en-us continuous model (cmusphinx-cont-en-us-5.2 downloaded from your site)
when i run the recognizer (either _ continuous or _ batch ) i get poor results (lower than 50%)
and adapting (MLLR or MAP) does not give something higher than ~60%.
I am wondering what are my options to better my accuracy.
I am attaching the transcription and fileids files with the audio (.wav files of myself reading the arctic example)
(do tell me if they got uploaded correctly)
I am sure i got an accent and that does not help recognition,
but i thought the adaptation could correct some of my accent
(it does get the accuracy up by 5-10% as promised so im not complaining).
However the recognition is still low and i dont know what to do...
i would like to avoid training a model,
and i would like to create something that is not based around my own voice,
but more of a general tool to let the user record, adapt and test the adapted model
and then work with that.
pocketsphinx is very old technology, it doesn't provide enough accuracy for modern standard. You can try vosk instead https://github.com/alphacep/vosk-api with daanzu English model.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, thank you Nickolay,
i already started looking into it.
I skimmed the documentation and the C++ API a bit
and found that is pretty simpler than pocketsphinx.
However reading the documentation blindly never helps ;)
Do you have any suggestion on where to start?
With sphinx there was a pretty clean and clear tutorial that helped me alot.
( i also feel that this convertation should be held elsewhere, should i open a new topic here?
Or even better: where can i ask about vosk?)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
im playing around with pocketsphinx and the adaptation (MLLR and MAP) as per your tutorials
and have realy low accuracy on my voice using the en-us model.
I tried to troubleshoot but i would love your opinion:)
This is the result from ffprobe from my .wav files:
"bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s"
and from "file" (man file, determine file type) :
"RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz"
I follow this scenario every time:
"around 5 minutes of data as the adaptation input"
"around 2 minutes of different data to test and compare recognition with and w/o the adapted model/MLLR"
My native language is greek and when using the greek continuous model
( downloaded from your site and created by Fotis Pantazoglou)
i get around 70% accuracy with the model alone, and after MLLR adaptation
and testing, accuracy goes to ~90%. Which is satisfactory .
Now with the en-us ptm model (included in pocketsphinx/model)
or with the en-us continuous model (cmusphinx-cont-en-us-5.2 downloaded from your site)
when i run the recognizer (either _ continuous or _ batch ) i get poor results (lower than 50%)
and adapting (MLLR or MAP) does not give something higher than ~60%.
I am wondering what are my options to better my accuracy.
I am attaching the transcription and fileids files with the audio (.wav files of myself reading the arctic example)
(do tell me if they got uploaded correctly)
I am sure i got an accent and that does not help recognition,
but i thought the adaptation could correct some of my accent
(it does get the accuracy up by 5-10% as promised so im not complaining).
However the recognition is still low and i dont know what to do...
i would like to avoid training a model,
and i would like to create something that is not based around my own voice,
but more of a general tool to let the user record, adapt and test the adapted model
and then work with that.
Any help would be amazing,
thanks in advance!
pocketsphinx is very old technology, it doesn't provide enough accuracy for modern standard. You can try vosk instead https://github.com/alphacep/vosk-api with daanzu English model.
Ok, thank you Nickolay,
i already started looking into it.
I skimmed the documentation and the C++ API a bit
and found that is pretty simpler than pocketsphinx.
However reading the documentation blindly never helps ;)
Do you have any suggestion on where to start?
With sphinx there was a pretty clean and clear tutorial that helped me alot.
( i also feel that this convertation should be held elsewhere, should i open a new topic here?
Or even better: where can i ask about vosk?)
On github or in a telegram group.