I want to incorporate diarization software in my project and unfortunately LIUM is giving me pretty useless results. After looking through the algorithms I feel like SHoUT is better suited to the task. However, when trying to use SHoUT, it seems that I need acoustic models (supposedly for speech and silence) to run the initial segmentation, but I have no clue how to produce these since it seems as if acoustic models are typically produced with respect to individual phonemes, as described on the website as well.
Does anyone have experience with SHoUT? I couldn't find anything elsewhere on the web about it. Although the author says that the training sets are adaptable for all purposes, ie using swedish BN for training for english conference data, there doesn't seem to be any publicly available training sets.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It seems to be pretty much worst case scenario data for LIUM. Very short speaker intervals, overlapping speech, oftentimes low SNR, informal environment oftentimes with changes in speaker's voice eg dull quiet talking to emphatic emotinal speech, environment with audible nonspeech noises like clattering, laughter etc occasionally, and many very similar voices (female, middle aged).
When I did use LIUM the clusters it produced had voices from all speakers in each (multiple clusters) without apparant differences in concentration. It also failed to segment conversations between two speakers, lumping it all together as one segment.
EDIT: This is without threshold optimizing I guess, so I'm giving that a try. I'm not 100% sure what the sinputmask and sinput2mask segmentation files are supposed to contain though. It says sinputmask is supposed to be analagous to a NIST UEM so it seems to be a specification for specific times to segment etc - doesn't providing that information make the scoring useless because I'm providing information about the data set, although my goal is blind diarization without that input?
And then the ref file that goes in sinput2mask is a reference file, the guideline for scoring? So a diarization file containing properly clustered speakers? So both of these files are created with my own labels.
Also I'm wondering what the proper way to manage audible nonspeech in my segmentation files is - forego labelling it entirely or cluster separately as another "speaker" since it seems that the system always seems to identify it as speech anyways.
Last edit: Richard Liu 2015-06-18
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, the diarization task is far from being solved and there could be different ways to improve results. I don't think you can get good results simply by tuning thresholds. For example, you might decode first and then use decoded text to improve speaker separation. This is a totally different approach from what has been taken in LIUM and other diarization toolkits but it might be reasonable for you. Special noises have to be remove as well before you start processing or at least detected in a speech. There are methods to separate speech from music and other sources, most of them are pretty complex by themselves and not supported in any modern toolkit.
LIUM has pretty good examples on the wiki on how to run diarization:
and so on, you just need to read it. I do not see where you need sinputmask, but in case you still needed, you can compose it from one big segment. Format of all files is described in the documentation too.
LIUM filters non-speech like music with GMM scoring, it is a step
~~~~~~~~~~~
filter spk segmentation according pms segmentation
I want to incorporate diarization software in my project and unfortunately LIUM is giving me pretty useless results. After looking through the algorithms I feel like SHoUT is better suited to the task. However, when trying to use SHoUT, it seems that I need acoustic models (supposedly for speech and silence) to run the initial segmentation, but I have no clue how to produce these since it seems as if acoustic models are typically produced with respect to individual phonemes, as described on the website as well.
Does anyone have experience with SHoUT? I couldn't find anything elsewhere on the web about it. Although the author says that the training sets are adaptable for all purposes, ie using swedish BN for training for english conference data, there doesn't seem to be any publicly available training sets.
LIUM should work ok. If you have any troubles with it you probably need to share the data you are trying to run on.
It seems to be pretty much worst case scenario data for LIUM. Very short speaker intervals, overlapping speech, oftentimes low SNR, informal environment oftentimes with changes in speaker's voice eg dull quiet talking to emphatic emotinal speech, environment with audible nonspeech noises like clattering, laughter etc occasionally, and many very similar voices (female, middle aged).
When I did use LIUM the clusters it produced had voices from all speakers in each (multiple clusters) without apparant differences in concentration. It also failed to segment conversations between two speakers, lumping it all together as one segment.
EDIT: This is without threshold optimizing I guess, so I'm giving that a try. I'm not 100% sure what the sinputmask and sinput2mask segmentation files are supposed to contain though. It says sinputmask is supposed to be analagous to a NIST UEM so it seems to be a specification for specific times to segment etc - doesn't providing that information make the scoring useless because I'm providing information about the data set, although my goal is blind diarization without that input?
And then the ref file that goes in sinput2mask is a reference file, the guideline for scoring? So a diarization file containing properly clustered speakers? So both of these files are created with my own labels.
Also I'm wondering what the proper way to manage audible nonspeech in my segmentation files is - forego labelling it entirely or cluster separately as another "speaker" since it seems that the system always seems to identify it as speech anyways.
Last edit: Richard Liu 2015-06-18
Dear Richard
Well, the diarization task is far from being solved and there could be different ways to improve results. I don't think you can get good results simply by tuning thresholds. For example, you might decode first and then use decoded text to improve speaker separation. This is a totally different approach from what has been taken in LIUM and other diarization toolkits but it might be reasonable for you. Special noises have to be remove as well before you start processing or at least detected in a speech. There are methods to separate speech from music and other sources, most of them are pretty complex by themselves and not supported in any modern toolkit.
LIUM has pretty good examples on the wiki on how to run diarization:
http://www-lium.univ-lemans.fr/diarization/doku.php/quick_start
http://www-lium.univ-lemans.fr/diarization/doku.php/howto
and so on, you just need to read it. I do not see where you need sinputmask, but in case you still needed, you can compose it from one big segment. Format of all files is described in the documentation too.
LIUM filters non-speech like music with GMM scoring, it is a step
~~~~~~~~~~~
filter spk segmentation according pms segmentation
fltseg=./$datadir/$show.flt.$h.seg
$java -Xmx2048m -classpath "$LOCALCLASSPATH" fr.lium.spkDiarization.tools.SFilter --help --fInputDesc=audio2sphinx,1:3:2:0:0:0,13,0:0:0 --fInputMask=$features --fltSegMinLenSpeech=150 --fltSegMinLenSil=25 --sFilterClusterName=j --fltSegPadding=25 --sFilterMask=$pmsseg --sInputMask=$adjseg --sOutputMask=$fltseg $show
~~~~~~~~~~~~~~
It is not perfect but should select speech segments pretty well.