I'm trying to use the LIUM Speaker Diarization toolkit on multi-speaker audio files, but am seeing (seemingly) poor results.
I'd like to do diarization on telephone quality audio, with 2-10 speakers. I have a pair of audio files I'm using for testing, The first is a 30-min conference call recording with 5 speakers, good quality audio, with overlapping speech. The second is a short 20s clip with 3 speakers, good quality audio with no overlapping speech.
Diarization results, using the "Quick Start", seem poor. For the 5 speaker recording, I'm only seeing a pair of speakers identified, both with incorrect genders. For the shorter recording, which should be simple to diarize, I'm only seeing a single speaker identified.
I haven't trained any models to do this - is that required for more accurate results? If so, do I create a UBM per speaker and then MAP train each one with successive audio samples for that speaker?
Thanks!
Darin
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Overall diarization is hard to get right and LIUM tools are not perfect. Good diarization of the telephone call is a subject for research.
Default LIUM configuration is more targeted to TV shows, the thing that it expects quite slow speaker changes, with each speaker active for 3-4 seconds. It is not really the case for telephone conferences where speakers change way more often and it screws diarization altogether.
For reliable diarization in telephone calls you need to increase the precision of the initial split in order ot properly detect clusters on the first segmentation. In this step:
Default value is 250 which means we only detect 3 seconds of speech as segment.
Further accuracy improvment require more work on the algorithm. It is probably better to decode first, then try to split speaker based on decoder-produced segmentation.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I re-read the CMU Sphynx documentation and learned that my input audio files were formatted incorrectly (wrong sampling rate and bitness) - once I fixed this, I able to get a good diarization from my simple audio file. Re-running on the longer, more complex audio file still exhibited the same behavior - incorrect # of speakers identified and the segments appear incorrect as well. I tried to add the window size parameters as suggested above with no significant change.
Just reading how the diarization algorithm works, it looks like GMMs are computed on the fly as its assumed the speakers are unknown. In my case, we can create and train GMMs in advance (we know the pool of potential speakers) - would this improve accuracy/lower the DER?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I re-read the CMU Sphynx documentation and learned that my input audio files were formatted incorrectly (wrong sampling rate and bitness) - once I fixed this, I able to get a good diarization from my simple audio file. Re-running on the longer, more complex audio file still exhibited the same behavior - incorrect # of speakers identified and the segments appear incorrect as well. I tried to add the window size parameters as suggested above with no significant change.
Ok, there could be many problems here, it is hard to guess. You can at least study segmentation on all the stages to see if it is reasonable or not
Just reading how the diarization algorithm works, it looks like GMMs are computed on the fly as its assumed the speakers are unknown. In my case, we can create and train GMMs in advance (we know the pool of potential speakers) - would this improve accuracy/lower the DER?
Yes, sure you can train UBM and then MAP-train individual GMMs and then use them for segmentation, the process is described here:
Were you able to find a solution for getting better results on the longer, more complex audio file?
My brother and I are facing similar problems when using LIUM to diarize a ten-minute 2-speaker file (we're fairly sure we have the correct format: 16kHz, 16-bit PCM mono .wav file), and we're hoping that you've already found the answer.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I'm trying to use the LIUM Speaker Diarization toolkit on multi-speaker audio files, but am seeing (seemingly) poor results.
I'd like to do diarization on telephone quality audio, with 2-10 speakers. I have a pair of audio files I'm using for testing, The first is a 30-min conference call recording with 5 speakers, good quality audio, with overlapping speech. The second is a short 20s clip with 3 speakers, good quality audio with no overlapping speech.
Diarization results, using the "Quick Start", seem poor. For the 5 speaker recording, I'm only seeing a pair of speakers identified, both with incorrect genders. For the shorter recording, which should be simple to diarize, I'm only seeing a single speaker identified.
I haven't trained any models to do this - is that required for more accurate results? If so, do I create a UBM per speaker and then MAP train each one with successive audio samples for that speaker?
Thanks!
Darin
Overall diarization is hard to get right and LIUM tools are not perfect. Good diarization of the telephone call is a subject for research.
Default LIUM configuration is more targeted to TV shows, the thing that it expects quite slow speaker changes, with each speaker active for 3-4 seconds. It is not really the case for telephone conferences where speakers change way more often and it screws diarization altogether.
For reliable diarization in telephone calls you need to increase the precision of the initial split in order ot properly detect clusters on the first segmentation. In this step:
$java -Xmx2048m -classpath "$LOCALCLASSPATH" fr.lium.spkDiarization.programs.MSeg --kind=FULL --sMethod=GLR --trace --help --fInputMask=$features --fInputDesc=$fDesc --sInputMask=./$datadir/%s.i.seg --sOutputMask=./$datadir/%s.s.seg $show
It is better to set
--sModelWindowSize=100 --sMinimumWindowSize=100
Default value is 250 which means we only detect 3 seconds of speech as segment.
Further accuracy improvment require more work on the algorithm. It is probably better to decode first, then try to split speaker based on decoder-produced segmentation.
Thanks for the prompt response Nickolay!
I re-read the CMU Sphynx documentation and learned that my input audio files were formatted incorrectly (wrong sampling rate and bitness) - once I fixed this, I able to get a good diarization from my simple audio file. Re-running on the longer, more complex audio file still exhibited the same behavior - incorrect # of speakers identified and the segments appear incorrect as well. I tried to add the window size parameters as suggested above with no significant change.
Just reading how the diarization algorithm works, it looks like GMMs are computed on the fly as its assumed the speakers are unknown. In my case, we can create and train GMMs in advance (we know the pool of potential speakers) - would this improve accuracy/lower the DER?
Ok, there could be many problems here, it is hard to guess. You can at least study segmentation on all the stages to see if it is reasonable or not
Yes, sure you can train UBM and then MAP-train individual GMMs and then use them for segmentation, the process is described here:
http://www-lium.univ-lemans.fr/diarization/doku.php/gaussian_gmm_training
Hi Darin,
Were you able to find a solution for getting better results on the longer, more complex audio file?
My brother and I are facing similar problems when using LIUM to diarize a ten-minute 2-speaker file (we're fairly sure we have the correct format: 16kHz, 16-bit PCM mono .wav file), and we're hoping that you've already found the answer.