Menu

Speaker Diarization

Help
2015-08-25
2016-02-08
  • Darin Herle

    Darin Herle - 2015-08-25

    Hi,

    I'm trying to use the LIUM Speaker Diarization toolkit on multi-speaker audio files, but am seeing (seemingly) poor results.

    I'd like to do diarization on telephone quality audio, with 2-10 speakers. I have a pair of audio files I'm using for testing, The first is a 30-min conference call recording with 5 speakers, good quality audio, with overlapping speech. The second is a short 20s clip with 3 speakers, good quality audio with no overlapping speech.

    Diarization results, using the "Quick Start", seem poor. For the 5 speaker recording, I'm only seeing a pair of speakers identified, both with incorrect genders. For the shorter recording, which should be simple to diarize, I'm only seeing a single speaker identified.

    I haven't trained any models to do this - is that required for more accurate results? If so, do I create a UBM per speaker and then MAP train each one with successive audio samples for that speaker?

    Thanks!

    Darin

     
    • Nickolay V. Shmyrev

      Overall diarization is hard to get right and LIUM tools are not perfect. Good diarization of the telephone call is a subject for research.

      Default LIUM configuration is more targeted to TV shows, the thing that it expects quite slow speaker changes, with each speaker active for 3-4 seconds. It is not really the case for telephone conferences where speakers change way more often and it screws diarization altogether.

      For reliable diarization in telephone calls you need to increase the precision of the initial split in order ot properly detect clusters on the first segmentation. In this step:

      $java -Xmx2048m -classpath "$LOCALCLASSPATH" fr.lium.spkDiarization.programs.MSeg --kind=FULL --sMethod=GLR --trace --help --fInputMask=$features --fInputDesc=$fDesc --sInputMask=./$datadir/%s.i.seg --sOutputMask=./$datadir/%s.s.seg $show

      It is better to set

      --sModelWindowSize=100 --sMinimumWindowSize=100

      Default value is 250 which means we only detect 3 seconds of speech as segment.

      Further accuracy improvment require more work on the algorithm. It is probably better to decode first, then try to split speaker based on decoder-produced segmentation.

       
  • Darin Herle

    Darin Herle - 2015-08-26

    Thanks for the prompt response Nickolay!

    I re-read the CMU Sphynx documentation and learned that my input audio files were formatted incorrectly (wrong sampling rate and bitness) - once I fixed this, I able to get a good diarization from my simple audio file. Re-running on the longer, more complex audio file still exhibited the same behavior - incorrect # of speakers identified and the segments appear incorrect as well. I tried to add the window size parameters as suggested above with no significant change.

    Just reading how the diarization algorithm works, it looks like GMMs are computed on the fly as its assumed the speakers are unknown. In my case, we can create and train GMMs in advance (we know the pool of potential speakers) - would this improve accuracy/lower the DER?

     
    • Nickolay V. Shmyrev

      I re-read the CMU Sphynx documentation and learned that my input audio files were formatted incorrectly (wrong sampling rate and bitness) - once I fixed this, I able to get a good diarization from my simple audio file. Re-running on the longer, more complex audio file still exhibited the same behavior - incorrect # of speakers identified and the segments appear incorrect as well. I tried to add the window size parameters as suggested above with no significant change.

      Ok, there could be many problems here, it is hard to guess. You can at least study segmentation on all the stages to see if it is reasonable or not

      Just reading how the diarization algorithm works, it looks like GMMs are computed on the fly as its assumed the speakers are unknown. In my case, we can create and train GMMs in advance (we know the pool of potential speakers) - would this improve accuracy/lower the DER?

      Yes, sure you can train UBM and then MAP-train individual GMMs and then use them for segmentation, the process is described here:

      http://www-lium.univ-lemans.fr/diarization/doku.php/gaussian_gmm_training

       
    • Brian Cunnie

      Brian Cunnie - 2016-02-08

      Hi Darin,

      Were you able to find a solution for getting better results on the longer, more complex audio file?

      My brother and I are facing similar problems when using LIUM to diarize a ten-minute 2-speaker file (we're fairly sure we have the correct format: 16kHz, 16-bit PCM mono .wav file), and we're hoping that you've already found the answer.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.