Menu

Anyone have experience with SHoUT Speech Diarization?

Help
2015-06-17
2015-06-21
  • Richard Liu

    Richard Liu - 2015-06-17

    I want to incorporate diarization software in my project and unfortunately LIUM is giving me pretty useless results. After looking through the algorithms I feel like SHoUT is better suited to the task. However, when trying to use SHoUT, it seems that I need acoustic models (supposedly for speech and silence) to run the initial segmentation, but I have no clue how to produce these since it seems as if acoustic models are typically produced with respect to individual phonemes, as described on the website as well.

    Does anyone have experience with SHoUT? I couldn't find anything elsewhere on the web about it. Although the author says that the training sets are adaptable for all purposes, ie using swedish BN for training for english conference data, there doesn't seem to be any publicly available training sets.

     
    • Nickolay V. Shmyrev

      LIUM should work ok. If you have any troubles with it you probably need to share the data you are trying to run on.

       
      • Richard Liu

        Richard Liu - 2015-06-18

        It seems to be pretty much worst case scenario data for LIUM. Very short speaker intervals, overlapping speech, oftentimes low SNR, informal environment oftentimes with changes in speaker's voice eg dull quiet talking to emphatic emotinal speech, environment with audible nonspeech noises like clattering, laughter etc occasionally, and many very similar voices (female, middle aged).

        When I did use LIUM the clusters it produced had voices from all speakers in each (multiple clusters) without apparant differences in concentration. It also failed to segment conversations between two speakers, lumping it all together as one segment.

        EDIT: This is without threshold optimizing I guess, so I'm giving that a try. I'm not 100% sure what the sinputmask and sinput2mask segmentation files are supposed to contain though. It says sinputmask is supposed to be analagous to a NIST UEM so it seems to be a specification for specific times to segment etc - doesn't providing that information make the scoring useless because I'm providing information about the data set, although my goal is blind diarization without that input?

        And then the ref file that goes in sinput2mask is a reference file, the guideline for scoring? So a diarization file containing properly clustered speakers? So both of these files are created with my own labels.

        Also I'm wondering what the proper way to manage audible nonspeech in my segmentation files is - forego labelling it entirely or cluster separately as another "speaker" since it seems that the system always seems to identify it as speech anyways.

         

        Last edit: Richard Liu 2015-06-18
        • Nickolay V. Shmyrev

          Dear Richard

          Well, the diarization task is far from being solved and there could be different ways to improve results. I don't think you can get good results simply by tuning thresholds. For example, you might decode first and then use decoded text to improve speaker separation. This is a totally different approach from what has been taken in LIUM and other diarization toolkits but it might be reasonable for you. Special noises have to be remove as well before you start processing or at least detected in a speech. There are methods to separate speech from music and other sources, most of them are pretty complex by themselves and not supported in any modern toolkit.

          LIUM has pretty good examples on the wiki on how to run diarization:

          http://www-lium.univ-lemans.fr/diarization/doku.php/quick_start

          http://www-lium.univ-lemans.fr/diarization/doku.php/howto

          and so on, you just need to read it. I do not see where you need sinputmask, but in case you still needed, you can compose it from one big segment. Format of all files is described in the documentation too.

          LIUM filters non-speech like music with GMM scoring, it is a step

          ~~~~~~~~~~~

          filter spk segmentation according pms segmentation

          fltseg=./$datadir/$show.flt.$h.seg
          $java -Xmx2048m -classpath "$LOCALCLASSPATH" fr.lium.spkDiarization.tools.SFilter --help --fInputDesc=audio2sphinx,1:3:2:0:0:0,13,0:0:0 --fInputMask=$features --fltSegMinLenSpeech=150 --fltSegMinLenSil=25 --sFilterClusterName=j --fltSegPadding=25 --sFilterMask=$pmsseg --sInputMask=$adjseg --sOutputMask=$fltseg $show
          ~~~~~~~~~~~~~~

          It is not perfect but should select speech segments pretty well.

           

Log in to post a comment.