From: Nagendra K. G. <nag...@go...> - 2013-04-18 18:38:42
|
Arnab, I prefer not to use soxi as its an overkill sometimes. Sometimes the data may not even be in wav format (sure will convert before using feature extraction but that's a different pipe). How about if we make the syntax requirements more strict - like require the value to be exactly -1. The only issue will be that it's loaded as float, but we could take the difference and require that to be very small. This will help you catch bugs in your scripts early on while keeping me safe. I recall earlier there was some data that had incorrect segmentation (like end time was rounded off), causing scripts to unnecessarily fail for some segments. However that data has been cleaned up. Nagendra -----Original Message----- From: Arnab Ghoshal [mailto:ar...@gm...] Sent: Thursday, April 18, 2013 2:03 PM To: Nagendra Kumar Goel Cc: Daniel Povey; kal...@li... Subject: Re: [Kaldi-developers] extract-segments The reason I don't like the special value is that there is a check to reject segments that are too small. This is a command line option and is visible to the user. The special value (in the current code it's really an interval) is hidden and one can only know about it by reading the code. But the hidden option has a higher priority than the visible option. So while it is reasonable for a user to expect any segments with invalid start and end times (i.e. start >= end) to be rejected, sometimes the whole file may actually get included instead. This is, in fact, how we found the problem-- a scripting bug caused some end times to be 0, which went undetected till some process way down the line died due to a very big segment that shouldn't have been there. There is also an option to accept invalid end times (false by default) and I am not sure what is the reason to have that functionality. The way I would have solved your particular problem is to get the start (which will be 0) and end times for the single utterance files, while keeping the segments format unchanged. You could use soxi to get the end time. Let me know if this works for you. -Arnab On Thu, Apr 18, 2013 at 6:35 PM, Nagendra Kumar Goel <nag...@go...> wrote: > I have been using this to mix in data that is segmented with data that > is sentence by sentence files. I didn't care if its 0 or -1. > > Is there a specific reason you don't like it? It solves a real problem > for me. > > > > From: Daniel Povey [mailto:dp...@gm...] > Sent: Thursday, April 18, 2013 1:32 PM > To: Arnab Ghoshal; Nagendra Kumar Goel > Cc: kal...@li... > Subject: Re: [Kaldi-developers] extract-segments > > > > I think Nagendra may have been using this, he should chime in. > Dan > > > > > > On Thu, Apr 18, 2013 at 1:30 PM, Arnab Ghoshal <ar...@gm...> wrote: > > Hi all, > > we just noticed that there is an (unmentioned) assumption in > extract-segments.cc that an end time of (0, -1] in the segments file > means "include till the end of the file". But there are additional > logical bugs that causes an end time of 0 to have the same effect. I > do not like having this special value of the end time and plan to > remove it. But is there anybody who has a good reason to keep such a > functionality? > > -Arnab > > ---------------------------------------------------------------------- > -------- Precog is a next-generation analytics platform capable of > advanced analytics on semi-structured data. The platform includes APIs > for building apps and a phenomenal toolset for data science. > Developers can use our toolset for easy data analysis & visualization. > Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > > |