My current setup starts with a configured HARK node that produces sphinx compliant .wav files of the separated detected speech sources and writes them to a folder. A listener script then looks for new files in this folder and starts a sphinx4 java client with this file and a grammar file. The results are pretty good, but this is ugly programming and I would like to up the game to let HARK and Sphinx talk through socketcommunication. How would i do this?
1: HARK is capable of outputting MFCC features over a socket, so could sphinx use these MFCC features? Are these compatible? And would there be a buid-in function that listens to a socket for these features?
2: The other possibility would be streaming the audiodata over socket to sphinx and let sphinx deal with the whole processing side. Ideally I would like to be able to compare these two methods in terms of recognition rate, but for now I'm really interested in getting either to work.
So, in short, how can I let sphinx listen on a port and what kind of data can Sphinx handle (audio data, MFCC features) And would i need Gstreamer for that or is there something in Sphinx4 that can deal with this. For now I will be working on the HARK end of the system, getting some data flowing, but for the Sphinx-part I would really like your help
Thanks in advance,
Ben
ps: the only relevant thing I could find on connecting HARK to PocketSphinx was a slide describing a proposed framework. (I will put the link here, I've lost it at the moment)
Last edit: Ben 2014-09-13
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1: HARK is capable of outputting MFCC features over a socket, so could sphinx use these MFCC features? Are these compatible? And would there be a buid-in function that listens to a socket for these features?
No, those features are not compatible with our models. You need to send raw audio.
2: The other possibility would be streaming the audiodata over socket to sphinx and let sphinx deal with the whole processing side. Ideally I would like to be able to compare these two methods in terms of recognition rate, but for now I'm really interested in getting either to work.
It is not really reasonable to compare the recognition rate, it's going to be the same.
As for streaming the data to the socket, you can do it with sphinx4 or pocketsphinx, you can use gstreamer or work without it like julius server does, there is no much difference. Our gstreamer plugin is unfortunately pretty outdated both in gstreamer part and pocketsphinx part too. If you are using Python you can use Pocketsphinx Python bindings.
There is no magic, you set up a TCP server, listen for the data, process it with decoder and return the result back. If you are using ROS we might update ROS pocketsphinx plugin for you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It would seem that you can read my mind;) As it happens I'm working with python currently for testing purposes, but I'm planning to port the whole thing to ROS to improve the speech recognition for our home service robot.
So I will be going for the Pocketsphinx Python bindings, but if you would find the time to update the ROS plugin, that would be awesome. I'm sure I can find enough information in the docs.
Either way, I'm planning to post an update here when I've made some progress
Ben
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Unfortunately Hark seems unable to stream the sound data directly, but rather sends a packet with the src_info (the location and ID of the source) and the src_wav (the bit we're really interested in getting to Sphinx).
I'm having trouble decoding the signal, I'm basically only getting jibberish from the port. So that's great [sarcasm]. I've sent an email to Hark support to help me getting to actually use the node they created. I'm sure I'm just doing something obviously wrong.
Hark is coded mainly in c and cpp, but that shouldn't affect the readability right?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I appreciate you thinking along. Unfortunately Hark is not recognizing my kinect at the moment, so, I'll be fixing that first.
EDIT: for future reference, if one has followed the steps to troubleshoot the kinect not showing up as a sound device: please make sure that powercable is plugged in -_-
Last edit: Ben 2014-09-14
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You were right, decoding the stream wasn't necessary in order to connect HarkStreamData to PocketSphinx, but something still goes wrong. Please listen to these separated sound sources created by my Hark Networkfile:
Could it be that since the documentation of HarkDataStreamSender mentiones that the first few bytes are related to ID and location data, that these need to be ignored/skipped?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Unfortunately, there's no .RAW file when using this script, but when using "pocketsphinx_continuous -inmic yes -rawlogdir 'logs'" does output something. At the last line you can see that it tries to save to logs/.raw, when in stead it should be logs/00000000.raw.
Is there a conflict with the config settings in the hmm?
And i tried to say 'hello'.
EDIT: terminal output as attachment
EDIT2: the naming of raw depends on whats used in start_utt(''), in this case '.raw' as filename was to be expected. FIX: changed it to start_utt('rawlog')
Okay I added the -input_endian line, to no avail. The result is not better and the RAW file still isn't being written. I will lookup if I can change it to little endian through Hark.
Again the output when saying hello:
EDIT: output of terminal as Attachment for readability
~~~~~~~~~~~
// Header for one cycle data
typedef struct tag_HD_Header {
int type; // variable for bit flag
int advance; // shift length
int count; // frameID of HARK
int64_t tv_sec; // timestamp of HARK in seconds
int64_t tv_usec; // timestamp of HARK in micro-seconds
} HD_Header;
and 0xc is type PACKET_SRC_INFO | PACKET_SRC_WAVE
Then there goes SRC info
// Header for source information
typedef struct tag_HDH_SrcInfo {
int src_id; // sound source id
float x[3]; // position of sound source
float power; // power of sound source
} HDH_SrcInfo;
~~~~~~~~
You need to parse those structures to properly parse the stream and extract audio data from it. You can not feed this raw data as is to pocketsphinx.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In simplified version your code should look like this:
~~~~~~~~~~
header = recv(0x40) # header length
data_size = last_int_from_header # must be 0x140 or 320 bytes
data = recv(size) # get the 320 bytes
decoder.process_raw(data)
~~~~~~~~~~~
You can use struct.unpack_from(fmt, buffer[, offset=0]) to unpack header data.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The tricky bit was finding out the bytesize of an int64, which turned out to be a signed long long int, or 'q' in format character.
While this decoding currently definitely works for nonsimultaneous detected sound sources, it only decodes the SRC_INFO and SRC_WAVE. I'm sure these scripts can be used to get the other outputs as well, but I'm curious to see what happens if we introduce simultaneous speakers to the system.
NB: with the old setup where the separated files were being written, I didn't need to worry about the simultaneous signals, they would wind up in their own file to be processed later anyway.
With some adjustments (it turns out that when there are no sources, there is no sourcedata, who would have guessed) Sphinx is guessing away at my short utterances.
[output listener.py]Connected with Hark @ 127.0.0.1:34761Azimuth: -24.9999111771 (right)Source 0 Result: all rightAzimuth: 24.9999111771 (left)Source 2 Result: that occurAzimuth: -19.9994597863 (right)Source 4 Result: soAzimuth: 29.9993791202 (left)Source 5 Result: you knowAzimuth: 29.9993791202 (left)Source 7 Result: hello meAzimuth: -19.9994597863 (right)Source 8 Result: follow me
With the technical bit out of the way, I can focus on improving recognition rates, both on Hark and Sphinx side. But that's another task and perhaps another topic if I need more help. And again, thanks for the help:)
Ben
ps: is there a way to hide the "Current configuration" printed at the beginning?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
But I had the line set in the config file already. Using pocketsphinx continous doesn't yield this output when running:
$pocketsphinx_continious -inmic yes -logfn 'null'
But when using the script, I do get it. And I seem to get why. The configuration line (INFO: cmd_ln.c(696): Parsing command line:) is outputted by calling the function:
decoder = Decoder.default_config()
Which makes sense, because this is called before adding the line to the config. The config hasn't been altered yet.
Can i pass some argument to Decode.default_config() to make it silent too? Or perhaps can I set the default.
Ben
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Community,
I would like your help or advice for connecting the sound source separation capabilities of HARK-kinect as described here: https://sourceforge.net/p/cmusphinx/discussion/help/thread/57979bf4/ to Sphinx or Pocketsphinx using socket communication.
My current setup starts with a configured HARK node that produces sphinx compliant .wav files of the separated detected speech sources and writes them to a folder. A listener script then looks for new files in this folder and starts a sphinx4 java client with this file and a grammar file. The results are pretty good, but this is ugly programming and I would like to up the game to let HARK and Sphinx talk through socketcommunication. How would i do this?
1: HARK is capable of outputting MFCC features over a socket, so could sphinx use these MFCC features? Are these compatible? And would there be a buid-in function that listens to a socket for these features?
I have taken (just a) look at https://github.com/alumae/ruby-pocketsphinx-server, and using Gstreamer, this might be possible framework to accomplish this, but I don't know.
2: The other possibility would be streaming the audiodata over socket to sphinx and let sphinx deal with the whole processing side. Ideally I would like to be able to compare these two methods in terms of recognition rate, but for now I'm really interested in getting either to work.
So, in short, how can I let sphinx listen on a port and what kind of data can Sphinx handle (audio data, MFCC features) And would i need Gstreamer for that or is there something in Sphinx4 that can deal with this. For now I will be working on the HARK end of the system, getting some data flowing, but for the Sphinx-part I would really like your help
Thanks in advance,
Ben
ps: the only relevant thing I could find on connecting HARK to PocketSphinx was a slide describing a proposed framework. (I will put the link here, I've lost it at the moment)
Last edit: Ben 2014-09-13
Hello Ben
No, those features are not compatible with our models. You need to send raw audio.
It is not really reasonable to compare the recognition rate, it's going to be the same.
As for streaming the data to the socket, you can do it with sphinx4 or pocketsphinx, you can use gstreamer or work without it like julius server does, there is no much difference. Our gstreamer plugin is unfortunately pretty outdated both in gstreamer part and pocketsphinx part too. If you are using Python you can use Pocketsphinx Python bindings.
There is no magic, you set up a TCP server, listen for the data, process it with decoder and return the result back. If you are using ROS we might update ROS pocketsphinx plugin for you.
Hi Nickolay,
It would seem that you can read my mind;) As it happens I'm working with python currently for testing purposes, but I'm planning to port the whole thing to ROS to improve the speech recognition for our home service robot.
So I will be going for the Pocketsphinx Python bindings, but if you would find the time to update the ROS plugin, that would be awesome. I'm sure I can find enough information in the docs.
Either way, I'm planning to post an update here when I've made some progress
Ben
Ok, you can a Python example here to see how use latest python bindings
http://cmusphinx.sourceforge.net/2014/08/python-decoding-example/
Please let me know if you have questions, docs are pretty small, but you can ask here or on #cmusphinx irc channel.
Just an update: I have found this Hark node which should make life easier: HarkDataStreamSender. http://winnie.kuis.kyoto-u.ac.jp/HARK/document/2.0.0/hark-document-en/subsec-HarkDataStreamSender.html
Unfortunately Hark seems unable to stream the sound data directly, but rather sends a packet with the src_info (the location and ID of the source) and the src_wav (the bit we're really interested in getting to Sphinx).
I'm having trouble decoding the signal, I'm basically only getting jibberish from the port. So that's great [sarcasm]. I've sent an email to Hark support to help me getting to actually use the node they created. I'm sure I'm just doing something obviously wrong.
Hark is coded mainly in c and cpp, but that shouldn't affect the readability right?
Well, you can probably need provide the code you already wrote and the dump of the data you receiving. It's hard to help you without seeing the code.
I don't think you need raw data instead of packets, packets are usually better since they contain additional information.
I appreciate you thinking along. Unfortunately Hark is not recognizing my kinect at the moment, so, I'll be fixing that first.
EDIT: for future reference, if one has followed the steps to troubleshoot the kinect not showing up as a sound device: please make sure that powercable is plugged in -_-
Last edit: Ben 2014-09-14
You were right, decoding the stream wasn't necessary in order to connect HarkStreamData to PocketSphinx, but something still goes wrong. Please listen to these separated sound sources created by my Hark Networkfile:
I know they sound a bit soft, but I wouldn't expect them to yield these results:
I used the tutorial from mattze96 but to be sure, This is the code for listener.py:
Could it be that since the documentation of HarkDataStreamSender mentiones that the first few bytes are related to ID and location data, that these need to be ignored/skipped?
Ok, looks good
Please provide pocketsphinx output when you decode the stream
Please add to decoder initialization:
Then it will store raw data you pass to recognizer into raw files. Please share the raw files stored by decoder.
Unfortunately, there's no .RAW file when using this script, but when using "pocketsphinx_continuous -inmic yes -rawlogdir 'logs'" does output something. At the last line you can see that it tries to save to logs/.raw, when in stead it should be logs/00000000.raw.
Is there a conflict with the config settings in the hmm?
And i tried to say 'hello'.
EDIT: terminal output as attachment
EDIT2: the naming of raw depends on whats used in start_utt(''), in this case '.raw' as filename was to be expected. FIX: changed it to start_utt('rawlog')
Last edit: Ben 2014-09-14
Ok, it seems that you receive big-endian data over network. Try to convert it to little endian or just add "-input_endian big" to decoder config.
Okay I added the -input_endian line, to no avail. The result is not better and the RAW file still isn't being written. I will lookup if I can change it to little endian through Hark.
Again the output when saying hello:
EDIT: output of terminal as Attachment for readability
Last edit: Ben 2014-09-14
To write utt, you need to change
to something meaningful like decoder.start_utt('something')
It also might be possible that hark sends float data instead of utt. Then you have to convert float to short.
As the table shows, the HDH_SrcData.data_bytes are short integers. I will try something with start_utt
So I got the raw file, can you make any of it? using aplay, the thing sounds corrupted enough:
Last edit: Ben 2014-09-14
Well, I checked hark code and docs above, it sends binary stream with many fields. For example first few bytes are
Is a structure
~~~~~~~~~~~
// Header for one cycle data
typedef struct tag_HD_Header {
int type; // variable for bit flag
int advance; // shift length
int count; // frameID of HARK
int64_t tv_sec; // timestamp of HARK in seconds
int64_t tv_usec; // timestamp of HARK in micro-seconds
} HD_Header;
// Header for source information
typedef struct tag_HDH_SrcInfo {
int src_id; // sound source id
float x[3]; // position of sound source
float power; // power of sound source
} HDH_SrcInfo;
~~~~~~~~
You need to parse those structures to properly parse the stream and extract audio data from it. You can not feed this raw data as is to pocketsphinx.
In simplified version your code should look like this:
~~~~~~~~~~
header = recv(0x40) # header length
data_size = last_int_from_header # must be 0x140 or 320 bytes
data = recv(size) # get the 320 bytes
decoder.process_raw(data)
~~~~~~~~~~~
You can use struct.unpack_from(fmt, buffer[, offset=0]) to unpack header data.
Thank you for the simple example, I was wildly using google for decoding network examples.
And thank you very much for the help so far!!
Using the raw file, some webpages[1][2] and especially your help I finally decoded the stream and learned some new things on the way.
This is an output of just one of the packets, as can be generated by using recvbin.py while first executing sendbin.py
The tricky bit was finding out the bytesize of an int64, which turned out to be a signed long long int, or 'q' in format character.
While this decoding currently definitely works for nonsimultaneous detected sound sources, it only decodes the SRC_INFO and SRC_WAVE. I'm sure these scripts can be used to get the other outputs as well, but I'm curious to see what happens if we introduce simultaneous speakers to the system.
NB: with the old setup where the separated files were being written, I didn't need to worry about the simultaneous signals, they would wind up in their own file to be processed later anyway.
So, let's see whether Sphinx likes his new input.
With some adjustments (it turns out that when there are no sources, there is no sourcedata, who would have guessed) Sphinx is guessing away at my short utterances.
With the technical bit out of the way, I can focus on improving recognition rates, both on Hark and Sphinx side. But that's another task and perhaps another topic if I need more help. And again, thanks for the help:)
Ben
ps: is there a way to hide the "Current configuration" printed at the beginning?
Nice.
For the best accuracy use en-us generic acoustic model:
http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English%20Generic%20Acoustic%20Model/en-us.tar.gz/download
And create a focused JSGF grammar or LM:
http://cmusphinx.sourceforge.net/wiki/tutoriallm
add "-logfn /dev/null" to config.
Hi, sorry to bother you,
But I had the line set in the config file already. Using pocketsphinx continous doesn't yield this output when running:
But when using the script, I do get it. And I seem to get why. The configuration line (INFO: cmd_ln.c(696): Parsing command line:) is outputted by calling the function:
Which makes sense, because this is called before adding the line to the config. The config hasn't been altered yet.
Can i pass some argument to Decode.default_config() to make it silent too? Or perhaps can I set the default.
Ben