From: Luis G. M. <lgu...@gm...> - 2011-03-16 11:08:37
|
Hi David, Great news! :-) > I looked some more at the code, made some changes, and I have a version that at least claims to detect speaker changes. However, when listening to the indicated times of change, these often don't line up with what I hear as a speaker change, and sometimes it's just the the speaker is changing her tone of voice. One thing I happen to remember is that the quasi-GMM MarSystem is not storing its internal matrices as MarControls which means it will probably fail to store them correctly as state and therefore the divergence and distance computations may happen to be actually bogus... :-\ I would need to really look into it again to recall the current status of matters. Maybe you could try to have a look at this and check if all makes sense. > I plan to impliment MFCC and pitch as additional features for the quasi-GMM, and I also plan on keeping around the speaker model since the speakers alternate a lot, and it may take a number of switches before the models do a good job. Sounds good. Marsyas already has MFCC and some pitch estimation MarSystems so you could perhaps start by trying those out. That idea of keeping a "database" of past speakers was also an old idea of mine that never got implemented but that makes all sense. > If you want, I can send a patch for the files I changed. Better than that is to give you commit access to the SVN repo. Please create a sourceforge account (in case you don't have one already) and send you usename to George so he can add you to the committers team :-) > > I look forward to working with you on this, and I'll keep you posted on my progress. Please do! Btw, I've just found an old presentation I did about this work which may give a bit more insight to the way the code is implemented. You can get it at: http://dl.dropbox.com/u/3706339/LabMeetings%20-%20Speaker%20Segmentation.pdf Cheers, Gustavo P.S. I'm CCing this message to the Marsyas Developers Mailing list, where is seems to be more at home :-) Luis Gustavo Martins lgu...@gm... On Mar 16, 2011, at 10:17 , David Cooper wrote: > Hi Gustavo, > > That's great to hear. I looked some more at the code, made some changes, and I have a version that at least claims to detect speaker changes. However, when listening to the indicated times of change, these often don't line up with what I hear as a speaker change, and sometimes it's just the the speaker is changing her tone of voice. > > I plan to impliment MFCC and pitch as additional features for the quasi-GMM, and I also plan on keeping around the speaker model since the speakers alternate a lot, and it may take a number of switches before the models do a good job. > > If you want, I can send a patch for the files I changed. > > I look forward to working with you on this, and I'll keep you posted on my progress. > > Thanks, > > David > David Cooper > http://www.cs.umass.edu/~dcooper > > On 3/14/11 5:19 PM, Luis Gustavo Martins wrote: >> >> Hi David, >> >> You're right, that was some code I wrote some years ago using the first version of Marsyas (0.1) and then started porting it into the current Marsyas version (0.2, 0.3, 0.4). However, I then never really had the chance to finish that port (mainly because I was busy with other things in Marsyas), so the code is in the current repository but there are big portions of it that still need some hacking, testing and evaluation, I'm afraid. >> >> The code in Marsyas0.1 was used for the ICME2006 paper, so it is true that there is a working code reference (I even developed a GUI version of the speaker segmentation at the time, using Qt - but I have to fetch it on my archives to see where it lives nowadays...). >> >> So, I will be thrilled to help you out on this task, but I would need some time to get back to that code (and papers!) and therefore be in position to help you figure out your questions. However, I've been struggling to find any sort of time to work on Marsyas lately, but since I'm starting a brand new project that will extensively use Marsyas in early April, I think I will be soon in good position to also dedicate some time to this old speaker segmentation code of mine. >> >> So, thank you for your interest and if you don't mind waiting a bit so I can get back to it, I would be really glad to collaborate with you. This mail is already in my todo list, and I hope to get back to your questions ASAP. Meanwhile, feel free to digg on the code and keep sending questions and suggestions: I'm pretty sure there are lots of bugs and things to be improved in that code. >> >> Cheers from Porto, >> >> Gustavo >> >> Luis Gustavo Martins >> lgu...@gm... >> >> >> >> On Mar 14, 2011, at 20:40 , David Cooper wrote: >> >>> Hi, >>> >>> I think this can be best answered by Gustavo, but I'm not certain. >>> >>> After looking around for free speaker segmentation code, I was about to implement the speaker separation algorithm of Lu and Zhang, 2002, in Marsyas. When I looked for examples of how to connect LPC to LSP, I found the speakerSeg app, which appears to implement a good portion of Lu and Zhang, 2002, already. Upon further evaluation, it looks like it may be an implementation of system 2 from the ICME 2006 paper that Gustavo is a co-author of. However, there are a few discrepancies that I would like to understand. I guess that this may have been written over 5 years ago, so the answers may not be readily available. I will be grateful for any additional clarity into this implementation and the underlying algorithm, as I am not completely clear on how it works. Please find a few questions below: >>> >>> 1. It appears that the algorithm expects the input to already be downsampled to an 8kHz monophonic sound file. Is this correct? >>> >>> 2. the number of samples is hard coded to 125 which is a little over 16 ms for 8kHz, rather than the 25 ms mentioned in the paper. Is there a reason for the difference? >>> >>> 3. Due to the hop size attempting to be half of 55, there are actually 54 frames for each segment, and the hop size is 27 frames. Is this a known fact, or a mistake? >>> >>> 4. When I run this on a 20 minute sound file, I get no output, which I believe means that there is not detection of a speaker change. However, there are definitely 2 speakers in the audio file. Is there a test file that works with the current implementation that I could use? >>> >>> 5. in the BICchangeDetector, there is a call to get the previous distance value: >>> >>> mrs_real distanceLeft = pdists_(pdists_.getSize()-1); //i.e. the previous distance value >>> >>> This appears to ignore the fact that this vector is circular, and so it looks like the value will only change every 3 times around (since 3 is the size of the memory). >>> >>> 6. In the quasi-GMM updateCovModel, the number of feature vectors is not increased in the update even though the model is now represented by more feature vectors. This would make it so that the most recent value has much more weight than all of the previous values. Is this intended to be this way? >>> >>> Thanks, >>> >>> David >>> >>> >>> >>> >>> >>> David Cooper >>> http://www.cs.umass.edu/~dcooper >>> >>> >>> ------------------------------------------------------------------------------ >>> Colocation vs. Managed Hosting >>> A question and answer guide to determining the best fit >>> for your organization - today and in the future. >>> http://p.sf.net/sfu/internap-sfd2d >>> _______________________________________________ >>> Marsyas-users mailing list >>> Mar...@li... >>> https://lists.sourceforge.net/lists/listinfo/marsyas-users >> |