CMU Sphinx / Forums / Help: Anyone have experience with SHoUT Speech Diarization?

Richard Liu - 2015-06-17

I want to incorporate diarization software in my project and unfortunately LIUM is giving me pretty useless results. After looking through the algorithms I feel like SHoUT is better suited to the task. However, when trying to use SHoUT, it seems that I need acoustic models (supposedly for speech and silence) to run the initial segmentation, but I have no clue how to produce these since it seems as if acoustic models are typically produced with respect to individual phonemes, as described on the website as well.

Does anyone have experience with SHoUT? I couldn't find anything elsewhere on the web about it. Although the author says that the training sets are adaptable for all purposes, ie using swedish BN for training for english conference data, there doesn't seem to be any publicly available training sets.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-06-17
  
  LIUM should work ok. If you have any troubles with it you probably need to share the data you are trying to run on.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Richard Liu - 2015-06-18
    
    It seems to be pretty much worst case scenario data for LIUM. Very short speaker intervals, overlapping speech, oftentimes low SNR, informal environment oftentimes with changes in speaker's voice eg dull quiet talking to emphatic emotinal speech, environment with audible nonspeech noises like clattering, laughter etc occasionally, and many very similar voices (female, middle aged).
    
    When I did use LIUM the clusters it produced had voices from all speakers in each (multiple clusters) without apparant differences in concentration. It also failed to segment conversations between two speakers, lumping it all together as one segment.
    
    EDIT: This is without threshold optimizing I guess, so I'm giving that a try. I'm not 100% sure what the sinputmask and sinput2mask segmentation files are supposed to contain though. It says sinputmask is supposed to be analagous to a NIST UEM so it seems to be a specification for specific times to segment etc - doesn't providing that information make the scoring useless because I'm providing information about the data set, although my goal is blind diarization without that input?
    
    And then the ref file that goes in sinput2mask is a reference file, the guideline for scoring? So a diarization file containing properly clustered speakers? So both of these files are created with my own labels.
    
    Also I'm wondering what the proper way to manage audible nonspeech in my segmentation files is - forego labelling it entirely or cluster separately as another "speaker" since it seems that the system always seems to identify it as speech anyways.
    
    Last edit: Richard Liu 2015-06-18
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2015-06-21
      
      Dear Richard
      
      Well, the diarization task is far from being solved and there could be different ways to improve results. I don't think you can get good results simply by tuning thresholds. For example, you might decode first and then use decoded text to improve speaker separation. This is a totally different approach from what has been taken in LIUM and other diarization toolkits but it might be reasonable for you. Special noises have to be remove as well before you start processing or at least detected in a speech. There are methods to separate speech from music and other sources, most of them are pretty complex by themselves and not supported in any modern toolkit.
      
      LIUM has pretty good examples on the wiki on how to run diarization:
      
      http://www-lium.univ-lemans.fr/diarization/doku.php/quick_start
      
      http://www-lium.univ-lemans.fr/diarization/doku.php/howto
      
      and so on, you just need to read it. I do not see where you need sinputmask, but in case you still needed, you can compose it from one big segment. Format of all files is described in the documentation too.
      
      LIUM filters non-speech like music with GMM scoring, it is a step
      
      ~~~~~~~~~~~
      
      filter spk segmentation according pms segmentation
      
      fltseg=./$datadir/$show.flt.$h.seg
      $java -Xmx2048m -classpath "$LOCALCLASSPATH" fr.lium.spkDiarization.tools.SFilter --help --fInputDesc=audio2sphinx,1:3:2:0:0:0,13,0:0:0 --fInputMask=$features --fltSegMinLenSpeech=150 --fltSegMinLenSil=25 --sFilterClusterName=j --fltSegPadding=25 --sFilterMask=$pmsseg --sInputMask=$adjseg --sOutputMask=$fltseg $show
      ~~~~~~~~~~~~~~
      
      It is not perfect but should select speech segments pretty well.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anyone have experience with SHoUT Speech Diarization?

Speech Recognition Toolkit

Forums

Help

Anyone have experience with SHoUT Speech Diarization? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

filter spk segmentation according pms segmentation

Anyone have experience with SHoUT Speech Diarization?