My current task is to implement SpeechToText on Ubuntu 14.04.
I installed pocketsphinx from packages (python-pocketsphinx pocketsphinx-hmm-en-hub4wsj pocketsphinx-lm-en-hub4) and installed cmuclmtk-0.7 (in order to be able to build the dictionary).
corpus.txt, mklm.sh, mkdict.sh, vc.lm and vc.dic can be found in the attached zip. Note that cmu07a.dic also needs to be present in this folder, since mkdict.sh uses it. (copy it from /usr/share/pocketsphinx/model/lm/en_US)
For the cmu-toolkit:
LD_LIBRARY_PATH=/usr/local/lib
export LD_LIBRARY_PATH
I created the language model(vc.lm) and the dictionary(vc.dic) with:
sh mklm.sh
sh mkdict.sh
Change /home/rob/CMUSphinx/vc.lm and /home/.../vc.dic in cmupyt.py to point to the vc.lm and vc.dic files on your computer.
The wav-files are all in the acmds folder (bigger, browser, center, close, down, email, keyboard, left, music, open, play, right, select, smaller, stop, tv, up)
For example:
python ./cmupyt.py acmds/browser.wav
correctly prints: "browser"
The old API (pocketsphinx installed from packages correclty recognized all the words (in the wav-files))
The problem is that the new API (installed from source) doesn't recognize "down", "open", "play", "stop", "up". It gives back an empy string for "down", "open" and "play". It returns "stop" for "select" and "stop" for "up".
I believe it should be at least as good as the old API, so I must be doing something wrong.
I know that it is better to have words with more that 3 syllables, but the old API recognizes all of them.
git clone git://github.com/cmusphinx/sphinxbase.git
cd sphinxbase
./autogen.sh
make
sudo make install
cd ..
git clone git://github.com/cmusphinx/pocketsphinx.git
cd pocketsphinx
./autogen.sh
make
sudo make install
cd ..
acustic model, dict and lm are in /usr/local/share/pocketsphinx/model/en-us
new.py is included in the attachement.
Copy vc.lm and vc.dic to /usr/local/share/pocketsphinx/model/en-us
+++Usage:
python ./new.py acmds/browser.wav
I like better the new API and it is also recommened to use the new one.
Any help is appreciated.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Your audio files are too loud, decoder needs some time to adapt to the volume of the speech it receives. If you decode all your files sequentially without resetting the decoder or just if you set -cmninit value to 66 in en-us/feat.params it will decode all your utterances properly.
Your files are even clipped to the maximum value, you better reduce recording level and also noise has to be reduced.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Reducing audio volume (to 50%) solved the problem. Now it works very well.
Of course, noise can still be eliminated with the following steps:
1. Fourier transform
2. set the low and high frequencies to zero in the array
3. Inverse Fourier
Maybe audio-level should be adjusted between steps 2 and 3 (decibel) by simply multiplying the amplitude of the frequencies.
Again, thank you for your help.
CMU-Sphinx (pocketsphinx) is wonderful.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for your answer, Nickolay.
I misunderstood your previous comment: "also noise has to be reduced".
I thought you meant to apply filtering.
Now it is clear.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My current task is to implement SpeechToText on Ubuntu 14.04.
I installed pocketsphinx from packages (python-pocketsphinx pocketsphinx-hmm-en-hub4wsj pocketsphinx-lm-en-hub4) and installed cmuclmtk-0.7 (in order to be able to build the dictionary).
corpus.txt, mklm.sh, mkdict.sh, vc.lm and vc.dic can be found in the attached zip. Note that cmu07a.dic also needs to be present in this folder, since mkdict.sh uses it. (copy it from /usr/share/pocketsphinx/model/lm/en_US)
For the cmu-toolkit:
LD_LIBRARY_PATH=/usr/local/lib
export LD_LIBRARY_PATH
I created the language model(vc.lm) and the dictionary(vc.dic) with:
sh mklm.sh
sh mkdict.sh
Change /home/rob/CMUSphinx/vc.lm and /home/.../vc.dic in cmupyt.py to point to the vc.lm and vc.dic files on your computer.
The wav-files are all in the acmds folder (bigger, browser, center, close, down, email, keyboard, left, music, open, play, right, select, smaller, stop, tv, up)
For example:
python ./cmupyt.py acmds/browser.wav
correctly prints: "browser"
The old API (pocketsphinx installed from packages correclty recognized all the words (in the wav-files))
The problem is that the new API (installed from source) doesn't recognize "down", "open", "play", "stop", "up". It gives back an empy string for "down", "open" and "play". It returns "stop" for "select" and "stop" for "up".
I believe it should be at least as good as the old API, so I must be doing something wrong.
I know that it is better to have words with more that 3 syllables, but the old API recognizes all of them.
++++++New API (installation)
install: autoconf automake libtool bison python-dev swig git libasound2-dev
mkdir voice_recognition
cd voice_recognition
git clone git://github.com/cmusphinx/sphinxbase.git
cd sphinxbase
./autogen.sh
make
sudo make install
cd ..
git clone git://github.com/cmusphinx/pocketsphinx.git
cd pocketsphinx
./autogen.sh
make
sudo make install
cd ..
acustic model, dict and lm are in /usr/local/share/pocketsphinx/model/en-us
new.py is included in the attachement.
Copy vc.lm and vc.dic to /usr/local/share/pocketsphinx/model/en-us
+++Usage:
python ./new.py acmds/browser.wav
I like better the new API and it is also recommened to use the new one.
Any help is appreciated.
The scipts, and wav-files.
I forgot to mention that I correctly uninstalled the old API before installing the new one.
It is "python ./new.py browser.wav".
Your audio files are too loud, decoder needs some time to adapt to the volume of the speech it receives. If you decode all your files sequentially without resetting the decoder or just if you set
-cmninitvalue to 66 inen-us/feat.paramsit will decode all your utterances properly.Your files are even clipped to the maximum value, you better reduce recording level and also noise has to be reduced.
Thank you for your answer Nickolay.
I will try them.
Reducing audio volume (to 50%) solved the problem. Now it works very well.
Of course, noise can still be eliminated with the following steps:
1. Fourier transform
2. set the low and high frequencies to zero in the array
3. Inverse Fourier
Maybe audio-level should be adjusted between steps 2 and 3 (decibel) by simply multiplying the amplitude of the frequencies.
Again, thank you for your help.
CMU-Sphinx (pocketsphinx) is wonderful.
This has no effect and is harmful for accuracy. Pocketsphinx does filtering inside by itself and it simply does not consider frequencies you filter.
Thank you for your answer, Nickolay.
I misunderstood your previous comment: "also noise has to be reduced".
I thought you meant to apply filtering.
Now it is clear.