Since I am being cheap (don't want to send money to LDC yet;)), I collected quite a bit of transcripts and audio files from some public radio stations. A couple of problems or questions I have:
1. the transcript and audio data are from news radio stations. Since my target domain is telephony conversations. I plan to downgrade the 16khz audio to 8khz. In term of acoustic feature, would this cause problem? For example, radio stations have better equipments than regular telephone hardware?
I also have problem with the transcript itself. It's not time-aligned with the audio file. Each transcript is roughly 1 hour long (with punctuation and sentence boundaries). So, I need to time-align the transcript with audio file. I thought about these ways:
2.1 force align with sphinx3_align. Since the transcript are very long, there is no way to force align using sphinx3_align, which only scale up 150s.
2.2 CMUAlign, but I had problem to compile it on fedora 8. Besides, I am also not sure if it suits my need. It seems rather outdated? Is there any more current version? I grabbed mine from NIST.
2.3 the mClust tools from LIUM. It does mention some scripts for sentence boundary detection, but the scripts are not provided in the download.
2.4 write my own script for sphinx3: for each sentence from transcript, estimate the speech length of the utterence (based on the number of phonemes) and grab the estimate chunk of audio from the large audio file. Do a force align and check the force-align score. Try increase or decrease the window frame by frame. The best score wins. Then move on to next sentence.
Any suggestions?
Ben
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For the 8kHz issue, you can simply change the feature extraction parameters to match the ones used for 8kHz telephone speech. No need to downsample the data, just edit feat.params to contain these parameters:
-nfilt 31
-ncep 13
-lowerf 200
-upperf 3500
However, your training data still won't match the channel charateristics of the telephone. In theory, this will be compensated for by cepstral mean normalization, so I'm not sure how useful it is to apply a "telephonizing filter" (which, after all, is just a linear system, which can be implemented using cepstral subtraction) to the data. There is a paper out there from the last millennium called "Continuous Recognition of Large-Vocabulary Telephone-Quality Speech" which talks a bit about "telephonizing" wideband audio for acoustic model training.
I don't recommend trying to use CMUseg, because it seems to be hard-coded for one particular task from 1996. mClust actually does exactly the same thing as CMUseg, which is not exactly sentence boundary detection, but might be okay anyway. Basically it detects acoustic changes, cuts the audio at them, and then clusters all the resulting segments to group together all the ones which come from the same speaker.
Why this can help you is that it also clusters silence and non-speech regions. So you can just listen to the segments to decide which speaker IDs actually mean "silence" or "non-speech". I believe that it also allows you to put an upper limit on the length of each segment. Since no human is actually capable of talking for 1 hour without pausing, I think the output of mClust would be fine for training, even if some of the segments are pretty long.
However, this doesn't solve the problem of actually aligning the transcripts to the audio. I'm ashamed to admit that I've never dealt with broadcast news training data so I'm not really sure how this is done.
There's some people here at CMU working on training speech synthesizers from audiobooks who might have some tools that could help you. Try asking Alan Black about it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks, David. Not sure if Professor Blackis monitoring this forum. I just sent over an email to his address. Hopefully, I can get response soon.
btw: I meant to ask the question for long time. I found the sourcforge forum is quite hard to use. Can I load the threads into some sort of reader or something (such as outlook)?
Ben
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Very interesting work and not too complicated technique. It seems that they had this implemented somewhere already and I suppose sphinx3 force align was used. I hope they allow me to use their code (instead of write from scratch). Hopefully, it's a simple porting issue.
Ben
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have response back from Kishore Prahallad. The tool is available in festivox. See excepts below.
Ben
From Kishore Prahallad:
The package we have developed for the segmentation of large audio files is made available as a part of Festvox (speech synthesis) framework which is open source available for download. The package is referred to as as Interslice (or islice) which I believe could be easy to work with, without much hassles. Islice is developed on ehmm framework (which is independent of Sphinx system).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Since I am being cheap (don't want to send money to LDC yet;)), I collected quite a bit of transcripts and audio files from some public radio stations. A couple of problems or questions I have:
1. the transcript and audio data are from news radio stations. Since my target domain is telephony conversations. I plan to downgrade the 16khz audio to 8khz. In term of acoustic feature, would this cause problem? For example, radio stations have better equipments than regular telephone hardware?
2.1 force align with sphinx3_align. Since the transcript are very long, there is no way to force align using sphinx3_align, which only scale up 150s.
2.2 CMUAlign, but I had problem to compile it on fedora 8. Besides, I am also not sure if it suits my need. It seems rather outdated? Is there any more current version? I grabbed mine from NIST.
2.3 the mClust tools from LIUM. It does mention some scripts for sentence boundary detection, but the scripts are not provided in the download.
2.4 write my own script for sphinx3: for each sentence from transcript, estimate the speech length of the utterence (based on the number of phonemes) and grab the estimate chunk of audio from the large audio file. Do a force align and check the force-align score. Try increase or decrease the window frame by frame. The best score wins. Then move on to next sentence.
Any suggestions?
Ben
For the 8kHz issue, you can simply change the feature extraction parameters to match the ones used for 8kHz telephone speech. No need to downsample the data, just edit feat.params to contain these parameters:
-nfilt 31
-ncep 13
-lowerf 200
-upperf 3500
However, your training data still won't match the channel charateristics of the telephone. In theory, this will be compensated for by cepstral mean normalization, so I'm not sure how useful it is to apply a "telephonizing filter" (which, after all, is just a linear system, which can be implemented using cepstral subtraction) to the data. There is a paper out there from the last millennium called "Continuous Recognition of Large-Vocabulary Telephone-Quality Speech" which talks a bit about "telephonizing" wideband audio for acoustic model training.
I don't recommend trying to use CMUseg, because it seems to be hard-coded for one particular task from 1996. mClust actually does exactly the same thing as CMUseg, which is not exactly sentence boundary detection, but might be okay anyway. Basically it detects acoustic changes, cuts the audio at them, and then clusters all the resulting segments to group together all the ones which come from the same speaker.
Why this can help you is that it also clusters silence and non-speech regions. So you can just listen to the segments to decide which speaker IDs actually mean "silence" or "non-speech". I believe that it also allows you to put an upper limit on the length of each segment. Since no human is actually capable of talking for 1 hour without pausing, I think the output of mClust would be fine for training, even if some of the segments are pretty long.
However, this doesn't solve the problem of actually aligning the transcripts to the audio. I'm ashamed to admit that I've never dealt with broadcast news training data so I'm not really sure how this is done.
There's some people here at CMU working on training speech synthesizers from audiobooks who might have some tools that could help you. Try asking Alan Black about it.
Thanks, David. Not sure if Professor Blackis monitoring this forum. I just sent over an email to his address. Hopefully, I can get response soon.
btw: I meant to ask the question for long time. I found the sourcforge forum is quite hard to use. Can I load the threads into some sort of reader or something (such as outlook)?
Ben
There is an article describing their efforts on alignment. It's available on awb's webpage.
About forums, I'm not comfortable with them either. Subscribe to mailing list, say cmusphinx-sdmeet and we'll be happy to help you there :)
Thanks, Nickolay. I only quick scanned awb's page. I'll look more.
just joined the mailing list.
Ben
This one. Question is really raised very often, so it would be very nice if someone could implement this thing with sphinx.
http://www.cs.cmu.edu/~awb/papers/is2007/IS071088.PDF
Very interesting work and not too complicated technique. It seems that they had this implemented somewhere already and I suppose sphinx3 force align was used. I hope they allow me to use their code (instead of write from scratch). Hopefully, it's a simple porting issue.
Ben
I have response back from Kishore Prahallad. The tool is available in festivox. See excepts below.
Ben
From Kishore Prahallad:
The package we have developed for the segmentation of large audio files is made available as a part of Festvox (speech synthesis) framework which is open source available for download. The package is referred to as as Interslice (or islice) which I believe could be easy to work with, without much hassles. Islice is developed on ehmm framework (which is independent of Sphinx system).