It shouldn't be using OSS OSS is selected during sphinxbase configuration. You need to install alsa development headers, reconfigure sphinxbase and make sure alsa is selected, then reinstall sphinxbase, then reinstall pocketsphinx. I also recommend you to try more powerful boards for voice experiments.
Try Vosk https://github.com/alphacep/vosk-api/blob/master/java/demo/src/main/java/org/vosk/demo/DecoderDemo.java
Ok, you can try to remove -static-libgcc from the flags, it might help. Otherwise you'd better recompile pocketsphinx/sphinxbase with mingw too. Or you use MSVC to compile your binary since libraries were built with msvc. In general the rule is that you use single compiler for all the libs and binaries in the project. On Windows they are not easily cross-compatible due to different runtimes.
I asked mingw OR with a visual studio
Did you compile pocketsphinx_batch.exe and pocketsphinx.dll with mingw or with a visual studio?
Then you need to check compilation params. It is something about dll, not about your code. It crashes when it plugs first sphixnbase functions.
Looks good now, thank you! Attention to details will help you in your programming career. As for exit after start, it exits because it fails to find the dlls (sphinxbase and pocketsphinx). They must be in the same folder where you run the program. You can use dll explorer to make sure dlls are present.
Please click edit button in your original post and format the code properly.
Can you format the code properly? You can use markdown syntax or just the vyswyg editor.
You can call speech.ad.stop_recording before calling Festival: https://github.com/bambocher/pocketsphinx-python/blob/769492da47e41b71e3dd57a6b033fbba79e57032/swig/sphinxbase/ad_base.i#L77
You can improve the accuracy of recognition with a custom language model: https://cmusphinx.github.io/wiki/tutoriallm/ and also teach the model with the audio with acoustic model adaptation: https://cmusphinx.github.io/wiki/tutorialadapt/ In general you can get much better results without adaptation with more advanced and modern toolkits than pocketsphinx (Vosk).
I'm doing some work with Sphinxtrain It is very outdated technology these days. Does the training process automatically search both spellings to find the best fit? No. Or, must the transcript file be coded to the correct pronunciation, (let's assume the subject ennuciated the second pronunciations) eg You can point proper transcription variant in the transcription file. You can also run forced alignment stage, it will try to select a proper pronunciation, but the accuracy is not guaranteed.
For example you can check jigasi transcription with vosk server: https://community.jitsi.org/t/jigasi-open-source-alternative-of-google-speech-to-text/20739
jsapi is dead long time ago. You can use vosk https://github.com/alphacep/vosk-api
Is this project still being supported? No
If there is a resource that we can reference that it does not store data then that would be awesome. The source code is our documentation.
https://cmusphinx.github.io/wiki/tutoriallmadvanced/
That one very basic, you can try https://github.com/alphacep/vosk-api and https://alphacephei.com/vosk/models/vosk-model-small-pt-0.3.zip it should work fine
Pocketsphinx is very old technology, not very accurate. Consider Vosk https://github.com/alphacep/vosk-api. You can use Vosk with Qt on Android, there is no problem to build a library and link it to Qt.
frate: default is 100. This means the hop_length is 10 milliseconds, so the frames are generated at 0, 10, 20,... 990 th milliseconds, right? With -samprate = 16000, the hop_length is 160samples. Is this correct? Yes In each frame, what is number of samples? - The parameter wlen = 0.025625. I interpret this as framesize=0.025625 seconds. That is, 25.625milliseconds = 410 samples (with 16KHz sampling rate). Is this correct? Yes Or, is it nfft=512 parameter that defines framesize as 512 samples. no...
You can check https://montreal-forced-aligner.readthedocs.io/en/latest/
There is voice activity detection which removes frames, you can add -remove_silence no to see remaining ones.
where can i ask about vosk? On github or in a telegram group.
Arduino is too slow, it can not run any serious AI on it. Ubuntu OS is ok, though on Raspberry Pi you usually use specialized distro called Raspbian. Rpi3 is ok, you can run Vosk on it and get good accuracy. For more advanced application you'd better get Rpi4 though.
You have nice application that should fit Vosk capabilities. What is your problem then? You can download the library and use it.
pocketsphinx is very old technology, it doesn't provide enough accuracy for modern standard. You can try vosk instead https://github.com/alphacep/vosk-api with daanzu English model.
does the language model restrict the recognition to its contents (words used to build the language model) Yes or can words outside of the LM (but included in the dictionary) still be recognized? No would it be better to build my own LM (myLM) and to use it with the en-us dictionary, or build a dictionary (myDict) from the same corpus and use the compination myLM and myDict ? You have to update both LM and the dictionary
https://stackoverflow.com/questions/4727480/how-do-i-make-my-perl-scripts-act-like-normal-programs-on-windows
Also https://github.com/jimregan/wolnelektury-audio-corpus
Check https://github.com/danijel3/ClarinStudioKaldi
I'd like to have the ability to write to a log file without all the fluff, just the capture. And I don't see any way to get pocketsphinx to stop listening, like pause on command, or take any commands once it's running. This has to be done throught the API like Python API
You can write a script in Python
https://stackoverflow.com/questions/4727480/how-do-i-make-my-perl-scripts-act-like-normal-programs-on-windows
Try threshold 1e-20, 30, 40 I replied you on stack overflow already.
is it possible to train a model within 10days for 1400hours of speech? Yes, it is perfectly possible on machine like above.
Thanks for your reply. I will prefer to adapt my model since I have less amount of time. Do you think it would be possible to adapt the model with 1400hours of audio file? It is possible to adapt but accuracy will not be the best one.
For 1400 hours it is better to train from scratch. 4gb of memory is pretty low, something like 16 at least better 64 You also need a gpu card for modern algorithms at least gtx1080.
Try https://github.com/alphacep/vosk-api, it supports Portuguese.
Use vosk https://github.com/alphacep/vosk-api
For such short words like "yes" it is impossible to do keyword spotting because false alarm rate is too high. You have to impelment full LVCSR recognizer with speaker separation and search in the results. cmusphinx tutorial is here https://cmusphinx.github.io/wiki/tutorial
In probability calculations it is important to properly describe probability spaces. Say you have position1 that would be space A1 and you have next word position 2 that would be space A2. You can write p("how are") = P (are|how) * P(how) and you'll reduce it to P(how|are) * P(are) by Bayes rule but here you need to be careful because in P(how | are) the first word "how" is still from space A1 and the second word "are" is still from the space A2 so you can not really replace it with P("are how"),...
Thanks Nickolay. So if I want log P(are | how) I should type lm.log_P("how are") - lm.log_P("how"), correct? No. P("how are") is about the order like I wrote above. So it is not simply P("how" & "are") but more like P("how" & "are" & "are follows how") so log_P("how are") - lm.log_P("how") is P(are | how & "are follows how") not simply P(are | how). There is an extra term that must balance P("how are") and P("are how"). I couldn't find the documentation that explained this. Can you point me to it?...
And even last thing is wrong since order or words is important, you need to adjust this with the prob of "are going after how" vs "how going after are".
It doesn't work like that lm.log_p("how are") is not really log P(are | how) but more the estimate of probability of seeing both words together in a text corpus, i.e. log P(are | how) + log P(how)
Well, if you are serious about this project you need a neural spotter otherwise its not going to work reliably, you can probably try https://github.com/hyperconnect/TC-ResNet, it supports tflite and should easily work on mobile. If you are not serious, just select a longer keyword and train better Italian model. Still, you need linux.
Well, if you are serious about this project you need a neural spotter otherwise its not going to wore reliably, you can probably try https://github.com/hyperconnect/TC-ResNet, it supports tflite and should easily work on mobile. If you are not serious, just select a longer keyword and train better Italian model. Still, you need linux.
Adaptation doesn't improve the accuracy of the keyword detection. For best detection keyword should have 3-5 syllables, your one is 2. If you still want to keep your keyword, you'd better adopt something like mycroft precise. But you will need to record much more keyword samples. Windows is not suitable for any kind of speech work.
What are you trying to achieve overall? Do you want an English model?
Yeah sure, thans for your suggestion, im gonna try using vosk, but it is able to make own lm and dict using vosk like pocketsphinx? Yes, sure, you can use kaldi toolkit for that.
Raspberry Pi CPU is too slow to decode with such configuration. You need to use smaller acoustic and language model or you can try more modern vosk library https://github.com/alphacep/vosk-api vosk is much faster to recognize large vocabulary.
It is not very easy but somewhat doable, you can check for details http://vpanayotov.blogspot.com/2012/06/kaldi-decoding-graph-construction.html
How's a person supposed to learn it? Just keep coming here and asking questions? Yes. Many people can read code too, it is the best documentation. So little information, and no tutorials. That's a bummer! I'd think there should be some good tutorials around on how to use and modify vosk. Eventually there will be some tutorials but for now the speed of the development of the technology makes it very hard to create extensive documentation. You can check also https://groups.google.com/d/topic/kaldi...
All the ALSA lib comments are coming from Pyaudio. So I'll, need to look into how I can turn those off. You need to cleanup the alsa config, see https://stackoverflow.com/questions/7088672/pyaudio-working-but-spits-out-error-messages-each-time Do I then need to dig that result out of there to put it into a string I can actualy use? It is json, you can parse it with json.loads: import json result = json.loads(rec.Result()) text = result['text'] Is there any documented source code for Vosk? No
How do I shut off all the other messages? add config.set_string('-logfn', '/dev/null')) see also https://stackoverflow.com/questions/17825820/how-do-i-turn-off-e-info-in-pocketsphinx I'd really like to find documentation on the Pocket Sphinx source code so I can see what the methods I call actualy do, and what parameters I can send them to do things like telling them not to print messages. There is C documentation here. https://cmusphinx.github.io/doc/pocketsphinx/files.html. There is no Python documentation,...
ImportError: libgfortran.so.3: cannot open shared object file: No such file or directory You need to install libfortran.so.3 with sudo apt-get install libgfortran3 Is there anyone still alive who can answer questions about Pocket Sphinx.? I'd rather be using pocketsphinx to be honest. It was looking really promising for my specific project. You are welcome to ask.
What exactly is vosk, and why is it needed? Isn't Kaldi suppoosed to be the SRE? If you want to simply use speech recognizer from python, you can use vosk prepackaged wheels and models. Kaldi is more a system for speech researchers with complex install, api and usage.
Try pip3 install https://github.com/alphacep/vosk-api/releases/download/0.3.3/vosk-0.3.3-cp36-cp36m-linux_aarch64.whl
This feels like going back to square one moving over to vosk-kaldi. I hope it's worth it. Absolutely! Let me know if you have further questions.
I don't even know what it is, or how to use it? You are welcome to ask, Does vosk use pocketsphinx? No If not, what exactly is vosk? And where do I find detailed information on it beyond that github page, especially in terms of tutorials.? It is a software library to recognize speech just like pocketsphinx. I'm already having close to 100% accuracy with Pocket Sphinx. It's been decoding everything I throw at it with near perfection. Perhaps it likes the way I speek? If it is perfect already, what...
Try https://github.com/alphacep/vosk-api, it is much more accurate.
Modern DNN recognizers mostly use log-mel filterbanks.
No. And LPC is not that good for ASR unfortunately because a lot of information is in the residual.
Good. Try vosk-api between.
it is a crash due to runtime mismatch as described in: https://cmusphinx.github.io/wiki/faq/#q-pocketsphinx-crashes-on-windows-in-_lock_file you need to check how your visual studio updated the project files. It most likely screwed things.
I recommend you to try vosk-api, a modern toolkit with higher accuracy: http://github.com/alphacep/vosk-api The installation on rpi4 is simple pip3 install vosk Code samples are here: https://github.com/alphacep/vosk-api/tree/master/python/example
As far as I can tell, -DMODELDIR invokes QT It doesn't. It just computes the header path and pass them to compiler. You can run: pkg-config --cflags --libs pocketsphinx sphinxbase To see what it outputs. . My next step was to try to compile gcc -o hello_ps hello.c You can't do that, you need the header path from pkg-config. You can substitute the pkg-config result yourself though.
Please provide more information on the problem to get help (steps, logs, errors, environment)
vosk is faster than pocketsphinx on compatible hardware
Its not about a config value, more about the code. Overall, for asterisk the vosk-api will be much more accurate, I recommend you to try it.
start detection threshold setting
See the answer at https://sourceforge.net/p/cmusphinx/discussion/help/thread/e241c19421/?
This? -vad_startspeech 10 Num of speech frames to trigger vad from silence to speech.
It does
sphinx4 is very old, try vosk-api: https://github.com/alphacep/vosk-api/tree/master/java
Offline model update: https://github.com/alphacep/vosk-api/blob/master/doc/model.md Online words list: https://github.com/alphacep/vosk-api/blob/master/python/example/test_words.py
You audio is 8khz, you can try with 8khz model to get a good recognition accuracy.
On Windows install vosk-api with pip3 install https://github.com/alphacep/vosk-api/releases/download/0.3.3/vosk-0.3.3-cp37-cp37m-win_amd64.whl To get help on the accuracy share the audio file recorded from your microphone to reproudce accuracy issues.
You can share audio samples but most likely your bluetooth records have 8khz bandwidth and require 8khz model You can also get much better accuracy with vosk-api instead of livespeech.
I have successfully runned dialogue demo in my mac, but I don't quite understand what is the result of the demo, can it recognize the speech I speaks to microphone? I do, by the way, but nothing happens (haha, a bit awkward, speak several times and nothing happens.) Probalby your microphone is muted/mailfunctioning shoud I do some audio check?? Yes
Install linux
quick_lm only creates lm. For dictionary you need a custom tool since g2p doesn't work for chinese. You can get some inspiration from https://cc-cedict.org/wiki/ probably.
Speaker identification is not supported in pocketsphnx. We recently added speakerid to vosk library, you can use it instead; https://github.com/alphacep/vosk-api/blob/master/python/example/test_speaker.py
Yes, you can. To install in on rpi simply type pip3 install vosk
Speaker identification is not supported in pocketsphnx. We recently added speakerid to vosk library, you can use it instead; https://github.com/alphacep/vosk-api/blob/master/python/example/test_local_speaker.py
Hypothesis.text empty or whitespace
If you listen for mp3 file you'll hear it is clearly corrupted with sample rate conversions. So the other files. Same for wav file. Raw file you shared is 44100 Hz, with proper sample rate configuration it decodes fine. You can try wtih command line first instead of tlsphinx.
Same as https://sourceforge.net/p/cmusphinx/bugs/490/
If you want to reject other words, you need keyword spotting mode, not LM. You need to tune thresholds in keyphrase.list file for reliable detection, but I doubt you will be able to do it with such a similar keywords. For more accurate recognition you might try vosk-api https://github.com/alphacep/vosk-api
ps_get_hyp function returns current results, final or intermediate if you call it. https://cmusphinx.github.io/doc/pocketsphinx/pocketsphinx_8h.html#ada74b12d71e9d4db5d959b94004ff812
Ok then. Timeout is perfectly possible, you just have to implement it yourself in your code. I do not see any problem here, just count the bytes processed and return if you got enough bytes.
I'm sorry, it is hard to understand the purpose of the system and give you the advise. You simply experience accuracy issues. Asterisk is bad choice here since it only limits to 8khz which is less accurate than 16khz. Another thing would be to have more accurate system based on neural networks. https://github.com/alphacep/vosk-api should work on RPi if that is your embedded system. German model is here: https://github.com/alphacep/kaldi-android-demo/releases/download/2020-01/alphacep-model-android-de-zamia-0.3.tar.gz...
You can create the dictionary before alignment with g2p library like phonetisaurus and then align.
Depends on what your problem, please elaborate what exactly are you trying to achieve.
Case matters, James should be lowercase since all our dictionary is lowercase. You can not recognize the word if it is missing in the dictionary unfortunately, not with pocketsphinx at least. Some modern recognizers allow vocabulary-free recognition though.
Yes, without dataset of the size specified in the tutorial it will not work. You can probably record more data and use less parameters in the model. There is a large research on low-resourced languages, I am just not sure you'll be able to apply it. For example, you can also add English data to training as a helper dataset and use a common phoneset.
Care to say what is your super secret language?
Not necessary, you can take arbitrary annotated recordings.
No, until you have at least 50 hours of data.
You see the unstability of results due to the small amount of training data.
The audio is UK English, not US English, includes noise and 8khz telephony speech. Phonetic recognition of this would be problematic without accurate model. Even with the best neural network models phonetic recognition is not very accurate because it is hard to recognize very loosely defined phonemes in continuous speech. If you want to deal with the audio like this you'd better recognize with UK large vocabulary model and then convert to phonemes.