Menu

Connecting HARK to Sphinx through socket

Help
Ben
2014-09-13
2014-09-15
  • Ben

    Ben - 2014-09-13

    Hi Community,

    I would like your help or advice for connecting the sound source separation capabilities of HARK-kinect as described here: https://sourceforge.net/p/cmusphinx/discussion/help/thread/57979bf4/ to Sphinx or Pocketsphinx using socket communication.

    My current setup starts with a configured HARK node that produces sphinx compliant .wav files of the separated detected speech sources and writes them to a folder. A listener script then looks for new files in this folder and starts a sphinx4 java client with this file and a grammar file. The results are pretty good, but this is ugly programming and I would like to up the game to let HARK and Sphinx talk through socketcommunication. How would i do this?

    1: HARK is capable of outputting MFCC features over a socket, so could sphinx use these MFCC features? Are these compatible? And would there be a buid-in function that listens to a socket for these features?

    I have taken (just a) look at https://github.com/alumae/ruby-pocketsphinx-server, and using Gstreamer, this might be possible framework to accomplish this, but I don't know.

    2: The other possibility would be streaming the audiodata over socket to sphinx and let sphinx deal with the whole processing side. Ideally I would like to be able to compare these two methods in terms of recognition rate, but for now I'm really interested in getting either to work.

    So, in short, how can I let sphinx listen on a port and what kind of data can Sphinx handle (audio data, MFCC features) And would i need Gstreamer for that or is there something in Sphinx4 that can deal with this. For now I will be working on the HARK end of the system, getting some data flowing, but for the Sphinx-part I would really like your help

    Thanks in advance,

    Ben

    ps: the only relevant thing I could find on connecting HARK to PocketSphinx was a slide describing a proposed framework. (I will put the link here, I've lost it at the moment)

     

    Last edit: Ben 2014-09-13
  • Nickolay V. Shmyrev

    Hello Ben

    1: HARK is capable of outputting MFCC features over a socket, so could sphinx use these MFCC features? Are these compatible? And would there be a buid-in function that listens to a socket for these features?

    No, those features are not compatible with our models. You need to send raw audio.

    2: The other possibility would be streaming the audiodata over socket to sphinx and let sphinx deal with the whole processing side. Ideally I would like to be able to compare these two methods in terms of recognition rate, but for now I'm really interested in getting either to work.

    It is not really reasonable to compare the recognition rate, it's going to be the same.

    As for streaming the data to the socket, you can do it with sphinx4 or pocketsphinx, you can use gstreamer or work without it like julius server does, there is no much difference. Our gstreamer plugin is unfortunately pretty outdated both in gstreamer part and pocketsphinx part too. If you are using Python you can use Pocketsphinx Python bindings.

    There is no magic, you set up a TCP server, listen for the data, process it with decoder and return the result back. If you are using ROS we might update ROS pocketsphinx plugin for you.

     
  • Ben

    Ben - 2014-09-13

    Hi Nickolay,

    It would seem that you can read my mind;) As it happens I'm working with python currently for testing purposes, but I'm planning to port the whole thing to ROS to improve the speech recognition for our home service robot.

    So I will be going for the Pocketsphinx Python bindings, but if you would find the time to update the ROS plugin, that would be awesome. I'm sure I can find enough information in the docs.

    Either way, I'm planning to post an update here when I've made some progress

    Ben

     
    • Nickolay V. Shmyrev

      Ok, you can a Python example here to see how use latest python bindings

      http://cmusphinx.sourceforge.net/2014/08/python-decoding-example/

      Please let me know if you have questions, docs are pretty small, but you can ask here or on #cmusphinx irc channel.

       
  • Ben

    Ben - 2014-09-14

    Just an update: I have found this Hark node which should make life easier: HarkDataStreamSender. http://winnie.kuis.kyoto-u.ac.jp/HARK/document/2.0.0/hark-document-en/subsec-HarkDataStreamSender.html

    Unfortunately Hark seems unable to stream the sound data directly, but rather sends a packet with the src_info (the location and ID of the source) and the src_wav (the bit we're really interested in getting to Sphinx).

    I'm having trouble decoding the signal, I'm basically only getting jibberish from the port. So that's great [sarcasm]. I've sent an email to Hark support to help me getting to actually use the node they created. I'm sure I'm just doing something obviously wrong.

    Hark is coded mainly in c and cpp, but that shouldn't affect the readability right?

     
    • Nickolay V. Shmyrev

      Well, you can probably need provide the code you already wrote and the dump of the data you receiving. It's hard to help you without seeing the code.

      I don't think you need raw data instead of packets, packets are usually better since they contain additional information.

       
      • Ben

        Ben - 2014-09-14

        I appreciate you thinking along. Unfortunately Hark is not recognizing my kinect at the moment, so, I'll be fixing that first.

        EDIT: for future reference, if one has followed the steps to troubleshoot the kinect not showing up as a sound device: please make sure that powercable is plugged in -_-

         

        Last edit: Ben 2014-09-14
  • Ben

    Ben - 2014-09-14

    You were right, decoding the stream wasn't necessary in order to connect HarkStreamData to PocketSphinx, but something still goes wrong. Please listen to these separated sound sources created by my Hark Networkfile:

    I know they sound a bit soft, but I wouldn't expect them to yield these results:

    Stream decoding result: add
    Stream decoding result: in
    Stream decoding result: in
    

    I used the tutorial from mattze96 but to be sure, This is the code for listener.py:

    import socket
    import sys
    from pocketsphinx import *
    
    HOST, PORT = "localhost", 5530
    hmm= '/home/ben/Documents/sphinx/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k'
    lm = '/home/ben/Documents/sphinx/pocketsphinx/model/lm/en_US/hub4.5000.DMP'
    dic = '/home/ben/Documents/sphinx/pocketsphinx/model/lm/en_US/hub4.5000.dic'
    config = Decoder.default_config()
    config.set_string('-hmm', hmm)
    config.set_string('-lm', lm)
    config.set_string('-dict', dic)
    config.set_string('-logfn', '/dev/null')
    
    decoder = Decoder(config)
    
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    
    try:
        s.bind((HOST, PORT))
    except socket.error , msg:
        print 'Bind failed. Error Code : ' + str(msg[0]) + ' Message ' + msg[1]
        sys.exit()
    
    s.listen(1)
    conn, addr = s.accept()
    print 'Connected with ' + addr[0] + ':' + str(addr[1])
    
    in_speech_bf = True
    decoder.start_utt('')
    while True:
        buf = conn.recv(1024)
        if buf:
            decoder.process_raw(buf, False, False)
            if decoder.get_in_speech() != in_speech_bf:
                in_speech_bf = decoder.get_in_speech()
                if not in_speech_bf:
                    decoder.end_utt()
                    try:
                        if  decoder.hyp().hypstr != '':
                            print 'Stream decoding result:', decoder.hyp().hypstr
                    except AttributeError:
                        pass
                    decoder.start_utt('')
        else:
            print "An error occured:"
            break
    decoder.end_utt()
    print 'An Error occured:', decoder.hyp().hypstr
    

    Could it be that since the documentation of HarkDataStreamSender mentiones that the first few bytes are related to ID and location data, that these need to be ignored/skipped?

     
    • Nickolay V. Shmyrev

      Ok, looks good

      Please provide pocketsphinx output when you decode the stream

      Please add to decoder initialization:

          config.set_string('-rawlogdir', "/path/to/some_folder")
      

      Then it will store raw data you pass to recognizer into raw files. Please share the raw files stored by decoder.

       
      • Ben

        Ben - 2014-09-14

        Unfortunately, there's no .RAW file when using this script, but when using "pocketsphinx_continuous -inmic yes -rawlogdir 'logs'" does output something. At the last line you can see that it tries to save to logs/.raw, when in stead it should be logs/00000000.raw.

        Is there a conflict with the config settings in the hmm?

        And i tried to say 'hello'.

        EDIT: terminal output as attachment
        EDIT2: the naming of raw depends on whats used in start_utt(''), in this case '.raw' as filename was to be expected. FIX: changed it to start_utt('rawlog')

         

        Last edit: Ben 2014-09-14
        • Nickolay V. Shmyrev

          Ok, it seems that you receive big-endian data over network. Try to convert it to little endian or just add "-input_endian big" to decoder config.

           
          • Ben

            Ben - 2014-09-14

            Okay I added the -input_endian line, to no avail. The result is not better and the RAW file still isn't being written. I will lookup if I can change it to little endian through Hark.

            Again the output when saying hello:

            EDIT: output of terminal as Attachment for readability

             

            Last edit: Ben 2014-09-14
            • Nickolay V. Shmyrev

              To write utt, you need to change

                decoder.start_utt('')
              

              to something meaningful like decoder.start_utt('something')

              It also might be possible that hark sends float data instead of utt. Then you have to convert float to short.

               
              • Ben

                Ben - 2014-09-14

                As the table shows, the HDH_SrcData.data_bytes are short integers. I will try something with start_utt

                 
              • Ben

                Ben - 2014-09-14

                So I got the raw file, can you make any of it? using aplay, the thing sounds corrupted enough:

                Playing raw data 'logs/hello.raw' : Unsigned 8 bit, Rate 8000 Hz, Mono
                
                 

                Last edit: Ben 2014-09-14
                • Nickolay V. Shmyrev

                  Well, I checked hark code and docs above, it sends binary stream with many fields. For example first few bytes are

                    0C 00 00 00 │ A0 00 00 00 │ 42 04 00 00 │ F1 7F 00 00 │ 18 FA 15 54
                  

                  Is a structure

                  ~~~~~~~~~~~
                  // Header for one cycle data
                  typedef struct tag_HD_Header {
                  int type; // variable for bit flag
                  int advance; // shift length
                  int count; // frameID of HARK
                  int64_t tv_sec; // timestamp of HARK in seconds
                  int64_t tv_usec; // timestamp of HARK in micro-seconds
                  } HD_Header;

                  and 0xc is type PACKET_SRC_INFO | PACKET_SRC_WAVE
                  Then there goes SRC info 
                  

                  // Header for source information
                  typedef struct tag_HDH_SrcInfo {
                  int src_id; // sound source id
                  float x[3]; // position of sound source
                  float power; // power of sound source
                  } HDH_SrcInfo;
                  ~~~~~~~~

                  You need to parse those structures to properly parse the stream and extract audio data from it. You can not feed this raw data as is to pocketsphinx.

                   
                  • Nickolay V. Shmyrev

                    In simplified version your code should look like this:

                    ~~~~~~~~~~
                    header = recv(0x40) # header length
                    data_size = last_int_from_header # must be 0x140 or 320 bytes
                    data = recv(size) # get the 320 bytes
                    decoder.process_raw(data)
                    ~~~~~~~~~~~

                    You can use struct.unpack_from(fmt, buffer[, offset=0]) to unpack header data.

                     
                    • Ben

                      Ben - 2014-09-14

                      Thank you for the simple example, I was wildly using google for decoding network examples.

                      And thank you very much for the help so far!!

                       
  • Ben

    Ben - 2014-09-15

    Using the raw file, some webpages[1][2] and especially your help I finally decoded the stream and learned some new things on the way.

    This is an output of just one of the packets, as can be generated by using recvbin.py while first executing sendbin.py

    Header: (12, 160, 4971, 1410728024, 357855)
    #sources: 1
    SrcInfo: (7, 0.40478238463401794, -0.8680621385574341, 0.2874000072479248, 28.165206909179688)
    SrcData: (160, 320)
    SrcRaw: (0, 1, 4, 3, -3, -6, -1, 2, 3, 5, 3, 4, 0, 4, 5, 0, -4, 4, -2, -10, 10, -1, 3, 3, -2, 0, 3, 0, -9, 1, -6, 2, 0, 3, 2, -5, -9, -4, -2, 4, 5, 5, -3, 5, 3, -3, 0, 0, 0, -8, -1, -1, -4, -2, 0, -2, 4, -1, -2, -1, -1, 5, 1, -1, -1, -3, -1, -1, 0, 0, 0, -3, 0, 0, 2, -3, 0, 1, 0, 1, 0, 0, -1, 0, 1, 1, 0, 0, 1, 3, 2, 0, 1, -2, -1, -1, 0, 2, 0, 0, 1, 3, 0, 1, -1, 3, 0, 1, 4, 2, 0, -1, 1, 0, -2, -2, 0, -1, -2, 0, 0, 0, -2, -1, -1, 0, 0, 0, 0, 0, 0, -2, 0, 0, 0, 0, -1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)
    

    The tricky bit was finding out the bytesize of an int64, which turned out to be a signed long long int, or 'q' in format character.

    While this decoding currently definitely works for nonsimultaneous detected sound sources, it only decodes the SRC_INFO and SRC_WAVE. I'm sure these scripts can be used to get the other outputs as well, but I'm curious to see what happens if we introduce simultaneous speakers to the system.

    NB: with the old setup where the separated files were being written, I didn't need to worry about the simultaneous signals, they would wind up in their own file to be processed later anyway.

    So, let's see whether Sphinx likes his new input.

     
  • Ben

    Ben - 2014-09-15

    With some adjustments (it turns out that when there are no sources, there is no sourcedata, who would have guessed) Sphinx is guessing away at my short utterances.

    [output listener.py]
    Connected with Hark @ 127.0.0.1:34761
    Azimuth: -24.9999111771 (right)
    Source 0 Result: all right
    
    Azimuth: 24.9999111771 (left)
    Source 2 Result: that occur
    
    Azimuth: -19.9994597863 (right)
    Source 4 Result: so
    
    Azimuth: 29.9993791202 (left)
    Source 5 Result: you know
    
    Azimuth: 29.9993791202 (left)
    Source 7 Result: hello me
    
    Azimuth: -19.9994597863 (right)
    Source 8 Result: follow me
    

    With the technical bit out of the way, I can focus on improving recognition rates, both on Hark and Sphinx side. But that's another task and perhaps another topic if I need more help. And again, thanks for the help:)

    Ben

    ps: is there a way to hide the "Current configuration" printed at the beginning?

     
    • Nickolay V. Shmyrev

       
      • Nickolay V. Shmyrev

        ps: is there a way to hide the "Current configuration" printed at the beginning?

        add "-logfn /dev/null" to config.

         
        • Ben

          Ben - 2014-09-15

          Hi, sorry to bother you,

          But I had the line set in the config file already. Using pocketsphinx continous doesn't yield this output when running:

          $pocketsphinx_continious -inmic yes -logfn 'null'
          

          But when using the script, I do get it. And I seem to get why. The configuration line (INFO: cmd_ln.c(696): Parsing command line:) is outputted by calling the function:

          decoder = Decoder.default_config()
          

          Which makes sense, because this is called before adding the line to the config. The config hasn't been altered yet.
          Can i pass some argument to Decode.default_config() to make it silent too? Or perhaps can I set the default.

          Ben

           

Log in to post a comment.