CMU Sphinx / Forums / Help: Connecting HARK to Sphinx through socket

Ben - 2014-09-13

Hi Community,

I would like your help or advice for connecting the sound source separation capabilities of HARK-kinect as described here: https://sourceforge.net/p/cmusphinx/discussion/help/thread/57979bf4/ to Sphinx or Pocketsphinx using socket communication.

My current setup starts with a configured HARK node that produces sphinx compliant .wav files of the separated detected speech sources and writes them to a folder. A listener script then looks for new files in this folder and starts a sphinx4 java client with this file and a grammar file. The results are pretty good, but this is ugly programming and I would like to up the game to let HARK and Sphinx talk through socketcommunication. How would i do this?

1: HARK is capable of outputting MFCC features over a socket, so could sphinx use these MFCC features? Are these compatible? And would there be a buid-in function that listens to a socket for these features?

I have taken (just a) look at https://github.com/alumae/ruby-pocketsphinx-server, and using Gstreamer, this might be possible framework to accomplish this, but I don't know.

2: The other possibility would be streaming the audiodata over socket to sphinx and let sphinx deal with the whole processing side. Ideally I would like to be able to compare these two methods in terms of recognition rate, but for now I'm really interested in getting either to work.

So, in short, how can I let sphinx listen on a port and what kind of data can Sphinx handle (audio data, MFCC features) And would i need Gstreamer for that or is there something in Sphinx4 that can deal with this. For now I will be working on the HARK end of the system, getting some data flowing, but for the Sphinx-part I would really like your help

Thanks in advance,

Ben

ps: the only relevant thing I could find on connecting HARK to PocketSphinx was a slide describing a proposed framework. (I will put the link here, I've lost it at the moment)

Last edit: Ben 2014-09-13

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2014-09-13

Hello Ben

1: HARK is capable of outputting MFCC features over a socket, so could sphinx use these MFCC features? Are these compatible? And would there be a buid-in function that listens to a socket for these features?

No, those features are not compatible with our models. You need to send raw audio.

2: The other possibility would be streaming the audiodata over socket to sphinx and let sphinx deal with the whole processing side. Ideally I would like to be able to compare these two methods in terms of recognition rate, but for now I'm really interested in getting either to work.

It is not really reasonable to compare the recognition rate, it's going to be the same.

As for streaming the data to the socket, you can do it with sphinx4 or pocketsphinx, you can use gstreamer or work without it like julius server does, there is no much difference. Our gstreamer plugin is unfortunately pretty outdated both in gstreamer part and pocketsphinx part too. If you are using Python you can use Pocketsphinx Python bindings.

There is no magic, you set up a TCP server, listen for the data, process it with decoder and return the result back. If you are using ROS we might update ROS pocketsphinx plugin for you.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ben - 2014-09-13

Hi Nickolay,

It would seem that you can read my mind;) As it happens I'm working with python currently for testing purposes, but I'm planning to port the whole thing to ROS to improve the speech recognition for our home service robot.

So I will be going for the Pocketsphinx Python bindings, but if you would find the time to update the ROS plugin, that would be awesome. I'm sure I can find enough information in the docs.

Either way, I'm planning to post an update here when I've made some progress

Ben

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2014-09-13
  
  Ok, you can a Python example here to see how use latest python bindings
  
  http://cmusphinx.sourceforge.net/2014/08/python-decoding-example/
  
  Please let me know if you have questions, docs are pretty small, but you can ask here or on #cmusphinx irc channel.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ben - 2014-09-14

Just an update: I have found this Hark node which should make life easier: HarkDataStreamSender. http://winnie.kuis.kyoto-u.ac.jp/HARK/document/2.0.0/hark-document-en/subsec-HarkDataStreamSender.html

Unfortunately Hark seems unable to stream the sound data directly, but rather sends a packet with the src_info (the location and ID of the source) and the src_wav (the bit we're really interested in getting to Sphinx).

I'm having trouble decoding the signal, I'm basically only getting jibberish from the port. So that's great [sarcasm]. I've sent an email to Hark support to help me getting to actually use the node they created. I'm sure I'm just doing something obviously wrong.

Hark is coded mainly in c and cpp, but that shouldn't affect the readability right?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2014-09-14
  
  Well, you can probably need provide the code you already wrote and the dump of the data you receiving. It's hard to help you without seeing the code.
  
  I don't think you need raw data instead of packets, packets are usually better since they contain additional information.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ben - 2014-09-14
    
    I appreciate you thinking along. Unfortunately Hark is not recognizing my kinect at the moment, so, I'll be fixing that first.
    
    EDIT: for future reference, if one has followed the steps to troubleshoot the kinect not showing up as a sound device: please make sure that powercable is plugged in -_-
    
    Last edit: Ben 2014-09-14
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ben - 2014-09-14

You were right, decoding the stream wasn't necessary in order to connect HarkStreamData to PocketSphinx, but something still goes wrong. Please listen to these separated sound sources created by my Hark Networkfile:

hello

i am fine

thank you

I know they sound a bit soft, but I wouldn't expect them to yield these results:

Stream decoding result: add Stream decoding result: in Stream decoding result: in

I used the tutorial from mattze96 but to be sure, This is the code for listener.py:

import socket import sys from pocketsphinx import * HOST, PORT = "localhost", 5530 hmm= '/home/ben/Documents/sphinx/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k' lm = '/home/ben/Documents/sphinx/pocketsphinx/model/lm/en_US/hub4.5000.DMP' dic = '/home/ben/Documents/sphinx/pocketsphinx/model/lm/en_US/hub4.5000.dic' config = Decoder.default_config() config.set_string('-hmm', hmm) config.set_string('-lm', lm) config.set_string('-dict', dic) config.set_string('-logfn', '/dev/null') decoder = Decoder(config) s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) try: s.bind((HOST, PORT)) except socket.error , msg: print 'Bind failed. Error Code : ' + str(msg[0]) + ' Message ' + msg[1] sys.exit() s.listen(1) conn, addr = s.accept() print 'Connected with ' + addr[0] + ':' + str(addr[1]) in_speech_bf = True decoder.start_utt('') while True: buf = conn.recv(1024) if buf: decoder.process_raw(buf, False, False) if decoder.get_in_speech() != in_speech_bf: in_speech_bf = decoder.get_in_speech() if not in_speech_bf: decoder.end_utt() try: if decoder.hyp().hypstr != '': print 'Stream decoding result:', decoder.hyp().hypstr except AttributeError: pass decoder.start_utt('') else: print "An error occured:" break decoder.end_utt() print 'An Error occured:', decoder.hyp().hypstr

Could it be that since the documentation of HarkDataStreamSender mentiones that the first few bytes are related to ID and location data, that these need to be ignored/skipped?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2014-09-14
  
  Ok, looks good
  
  Please provide pocketsphinx output when you decode the stream
  
  Please add to decoder initialization:
  
  config.set_string('-rawlogdir', "/path/to/some_folder")
  
  Then it will store raw data you pass to recognizer into raw files. Please share the raw files stored by decoder.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ben - 2014-09-14
    
    Unfortunately, there's no .RAW file when using this script, but when using "pocketsphinx_continuous -inmic yes -rawlogdir 'logs'" does output something. At the last line you can see that it tries to save to logs/.raw, when in stead it should be logs/00000000.raw.
    
    Is there a conflict with the config settings in the hmm?
    
    And i tried to say 'hello'.
    
    EDIT: terminal output as attachment
    EDIT2: the naming of raw depends on whats used in start_utt(''), in this case '.raw' as filename was to be expected. FIX: changed it to start_utt('rawlog')
    
    Last edit: Ben 2014-09-14
    
    output_terminal_1
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2014-09-14
      
      Ok, it seems that you receive big-endian data over network. Try to convert it to little endian or just add "-input_endian big" to decoder config.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Ben - 2014-09-14
        
        Okay I added the -input_endian line, to no avail. The result is not better and the RAW file still isn't being written. I will lookup if I can change it to little endian through Hark.
        
        Again the output when saying hello:
        
        EDIT: output of terminal as Attachment for readability
        
        Last edit: Ben 2014-09-14
        
        output_terminal_2
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2014-09-14
        
        To write utt, you need to change
        
        decoder.start_utt('')
        
        to something meaningful like decoder.start_utt('something')
        
        It also might be possible that hark sends float data instead of utt. Then you have to convert float to short.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Ben - 2014-09-14
        
        As the table shows, the HDH_SrcData.data_bytes are short integers. I will try something with start_utt
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Ben - 2014-09-14
        
        So I got the raw file, can you make any of it? using aplay, the thing sounds corrupted enough:
        
        Playing raw data 'logs/hello.raw' : Unsigned 8 bit, Rate 8000 Hz, Mono
        
        Last edit: Ben 2014-09-14
        
        hello.raw
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2014-09-14
        
        Well, I checked hark code and docs above, it sends binary stream with many fields. For example first few bytes are
        
        0C 00 00 00 │ A0 00 00 00 │ 42 04 00 00 │ F1 7F 00 00 │ 18 FA 15 54
        
        Is a structure
        
        ~~~~~~~~~~~
        // Header for one cycle data
        typedef struct tag_HD_Header {
        int type; // variable for bit flag
        int advance; // shift length
        int count; // frameID of HARK
        int64_t tv_sec; // timestamp of HARK in seconds
        int64_t tv_usec; // timestamp of HARK in micro-seconds
        } HD_Header;
        
        and 0xc is type PACKET_SRC_INFO | PACKET_SRC_WAVE Then there goes SRC info
        
        // Header for source information
        typedef struct tag_HDH_SrcInfo {
        int src_id; // sound source id
        float x[3]; // position of sound source
        float power; // power of sound source
        } HDH_SrcInfo;
        ~~~~~~~~
        
        You need to parse those structures to properly parse the stream and extract audio data from it. You can not feed this raw data as is to pocketsphinx.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2014-09-14
        
        In simplified version your code should look like this:
        
        ~~~~~~~~~~
        header = recv(0x40) # header length
        data_size = last_int_from_header # must be 0x140 or 320 bytes
        data = recv(size) # get the 320 bytes
        decoder.process_raw(data)
        ~~~~~~~~~~~
        
        You can use struct.unpack_from(fmt, buffer[, offset=0]) to unpack header data.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Ben - 2014-09-14
        
        Thank you for the simple example, I was wildly using google for decoding network examples.
        
        And thank you very much for the help so far!!
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ben - 2014-09-15

Using the raw file, some webpages[1][2] and especially your help I finally decoded the stream and learned some new things on the way.

This is an output of just one of the packets, as can be generated by using recvbin.py while first executing sendbin.py

Header: (12, 160, 4971, 1410728024, 357855) #sources: 1 SrcInfo: (7, 0.40478238463401794, -0.8680621385574341, 0.2874000072479248, 28.165206909179688) SrcData: (160, 320) SrcRaw: (0, 1, 4, 3, -3, -6, -1, 2, 3, 5, 3, 4, 0, 4, 5, 0, -4, 4, -2, -10, 10, -1, 3, 3, -2, 0, 3, 0, -9, 1, -6, 2, 0, 3, 2, -5, -9, -4, -2, 4, 5, 5, -3, 5, 3, -3, 0, 0, 0, -8, -1, -1, -4, -2, 0, -2, 4, -1, -2, -1, -1, 5, 1, -1, -1, -3, -1, -1, 0, 0, 0, -3, 0, 0, 2, -3, 0, 1, 0, 1, 0, 0, -1, 0, 1, 1, 0, 0, 1, 3, 2, 0, 1, -2, -1, -1, 0, 2, 0, 0, 1, 3, 0, 1, -1, 3, 0, 1, 4, 2, 0, -1, 1, 0, -2, -2, 0, -1, -2, 0, 0, 0, -2, -1, -1, 0, 0, 0, 0, 0, 0, -2, 0, 0, 0, 0, -1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)

The tricky bit was finding out the bytesize of an int64, which turned out to be a signed long long int, or 'q' in format character.

While this decoding currently definitely works for nonsimultaneous detected sound sources, it only decodes the SRC_INFO and SRC_WAVE. I'm sure these scripts can be used to get the other outputs as well, but I'm curious to see what happens if we introduce simultaneous speakers to the system.

NB: with the old setup where the separated files were being written, I didn't need to worry about the simultaneous signals, they would wind up in their own file to be processed later anyway.

So, let's see whether Sphinx likes his new input.

hello.raw

recvbin.py

sendbin.py
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ben - 2014-09-15

With some adjustments (it turns out that when there are no sources, there is no sourcedata, who would have guessed) Sphinx is guessing away at my short utterances.

[output listener.py] Connected with Hark @ 127.0.0.1:34761 Azimuth: -24.9999111771 (right) Source 0 Result: all right Azimuth: 24.9999111771 (left) Source 2 Result: that occur Azimuth: -19.9994597863 (right) Source 4 Result: so Azimuth: 29.9993791202 (left) Source 5 Result: you know Azimuth: 29.9993791202 (left) Source 7 Result: hello me Azimuth: -19.9994597863 (right) Source 8 Result: follow me

With the technical bit out of the way, I can focus on improving recognition rates, both on Hark and Sphinx side. But that's another task and perhaps another topic if I need more help. And again, thanks for the help:)

Ben

ps: is there a way to hide the "Current configuration" printed at the beginning?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2014-09-15
  
  Nice.
  
  For the best accuracy use en-us generic acoustic model:
  
  http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English%20Generic%20Acoustic%20Model/en-us.tar.gz/download
  
  And create a focused JSGF grammar or LM:
  
  http://cmusphinx.sourceforge.net/wiki/tutoriallm
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2014-09-15
    
    ps: is there a way to hide the "Current configuration" printed at the beginning?
    
    add "-logfn /dev/null" to config.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Ben - 2014-09-15
      
      Hi, sorry to bother you,
      
      But I had the line set in the config file already. Using pocketsphinx continous doesn't yield this output when running:
      
      $pocketsphinx_continious -inmic yes -logfn 'null'
      
      But when using the script, I do get it. And I seem to get why. The configuration line (INFO: cmd_ln.c(696): Parsing command line:) is outputted by calling the function:
      
      decoder = Decoder.default_config()
      
      Which makes sense, because this is called before adding the line to the config. The config hasn't been altered yet.
      Can i pass some argument to Decode.default_config() to make it silent too? Or perhaps can I set the default.
      
      Ben
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Connecting HARK to Sphinx through socket

Speech Recognition Toolkit

Forums

Help

Connecting HARK to Sphinx through socket document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Connecting HARK to Sphinx through socket