I've successfully run the SphinxTrain scripts and executables on a training dataset to produce a Sphinx2 model, and I'm grateful for the assistance provided by the distribution's scripts.
We're going to be using Sphinx2, not -3. I've also compiled Sphinx2 and successfully run Sphinx2-test (both on Linux).
The SphinxTrain doc files (from CMU's website) mention force-aligning, and specifically doing so after the CI-model step. But there's nothing specific on how to do this. I assume I must use the recognizer to do the force-alignment step. I noticed a script sphinx2-0.4/scripts/sphinx2-align -- is that a prototype for doing this? Can I use the "Module 09" scripts (modified for the appropriate model directory) to convert CI-model files to Sphinx2 format and then use them with Sphinx2-align to do this force-alignment?
Or should I try to download/compile/install Sphinx3 for doing this step? Any assistance would be appreciated. Tnx.
cheers,
jerry wolf
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'll give you my experience, but hopefully someone can relate a successful experience.
1) After a bit of struggle, you should be able to get sphinx2-align to work. If you are unable to, please ask me, and I'll give you my copy.
2) After successfully running sphinx2-align, I could not figure out how to take its output to realign the transcript or anything like that. This seems pretty basic. Perhaps I was missing a central concept.
3) I downloaded and installed archive_s3/s3.0 so from the cmusphinx CVS site so I could get timealign. (See http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/cmusphinx/archive_s3/s3.0/pgm/timealign/ ) I was unable to get it to work with Sphinx-2 format model SphinxTrain files. (I'm actually working on getting my language model (LM) to work first and get SphinxTrain to no longer crash (since I did a "cvs update"); so, I haven't had time to work on this further.)
If you do manage to get alignment to work in any form, I'd appreciate hearing about it.
Sorry for not being more helpful.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2002-12-24
Re #2: As I understand it, the purpose of force-aligning is not to use time-alignment information in the training, but rather to select which pronunciation out of many (for words with multiple pronunciations) it the one to use for the training (i.e., which of the 2 common pronunciations of THE). The trainer can use only one, but the recognizer can use multiple ones, so it appears they use a recognizer application with a multiple-pronunciation dictionary to do that selection, using a rough model (e.g., the CI model from Module 02), and then go back and do the entire training with these selected pronunciations.
I believe that this step also recognizes and marks interword silences. So you feed in rough transcripts and get back transcripts with selected-pronunciations and interword silences, to be used for more-accurate training.
A: The process of force-alignment takes an existing transcript, and finds out which, among the many pronunciations for the words occuring in the transcript, are the correct pronunciations. So when you refer to "force-aligned" transcripts, you are also inevitably referring to a *dictionary* with reference to which the transcripts have been force-aligned. So if you have two dictionaries and one has the word "PANDA" listed as:
PANDA P AA N D AA
PANDA(2) P AE N D AA
PANDA(3) P AA N D AX
and the other one has the same word listed as
PANDA P AE N D AA
PANDA(2) P AA N D AX
PANDA(3) P AA N D AA
And you force-aligned using the first dictionary and get your transcript to look like :
I SAW A PANDA(3) BEAR,
then if you used that transcript to train but used the second dictionary to train, then you would be giving the wrong pronunciation to the trainer. You would be telling the trainer that the pronunciation for the word PANDA in your corpus is "P AA N D AA" instead of the correct one, which should have been "P AA N D AX". The data corresponding to the phone AXwill now be wrongly used to train the phone AA.
What you must really do is to collect your transcripts, use only the first listed pronunciation in your training dictionary, train ci models, and use *those ci models* to force-align your transcripts against the training dictionary. Then go all the way back and re-train your ci models with the new transcripts."
Re #3: If timealign is a Sphinx3 application, I wouldn't expect it to work with Sphinx2 models. OTOH, since the SphinxTrain generates Sphinx3-style models at each stage in the process, I'd guess you should use them.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
#2: I did know about adding silence symbols (this is the main thing I was looking for), but I didn't know about the pronunciations. Thanks for all the info.
#3: Yes, I was hoping that (as you say) since SphinxTrain uses Sphinx-3 format until the very end when the conversion takes place, that I should be able to use timealign and do the alignment/re-pronunciation thing to improve the model. The problem I had was that the dimensions of the matrixes would never work. (See https://sourceforge.net/forum/forum.php?thread_id=767594&forum_id=5470 for the thread.) I had rec'd a set of modified SphinxTrain scripts from https://sourceforge.net/users/cbquillen/ which I was having trouble getting working.
Again, I'm sorry I can't be more helpful.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you are trying to align with sphinx2 models, maybe you should use the sphinx2-align script? This runs the Sphinx-2 decoder with transcripts (supplied by the -tactlfn parameter).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Did you ever get this to work? I'm thinking of giving the sphinx2-align another try (due to all my problems with timealign). I still don't understand how I can use the output; did you figure it out?
Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2003-01-03
I've been away since 12/24, and over the holiday, we also packed, moved, and unpacked our office, so I'm just getting back to beiing able to work. So the answer is no, I haven't even been able to try!
But I am inclined to pursue the hypothesis in my 12/23 posting -- convert the S3 CI-model files to S2 format, then see if I can get sphinx2-align to work with them. But I have some infrastructure work to do first, so it looks like you'll get there well ahead of me.
Incidentally, back on the topic of the purpose of force-aligning, there's also in http://www-2.cs.cmu.edu/~dbansal/max/dip.html: "Step4: Force align: (generating a better transcript)
The original transcript may not be a perfect transcript in the sense that it may lack fillers. After some sort of models are made, force-align the reference transcript to get a transcript with fillers. You may want to use this transcript to train further. This step can be performed at any appropriate time, generally after the models converge on initial transcript. After performing this step, bw is again run over this new transcript." So he doesn't mention alternate pronunciations at all, but rather the addition of "fillers".
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2003-02-03
robert b -- After a lengthy lag, I'm back to the force-alignment problem and about to try sphinx2-align. Have you pursued this, and what have you achieved?
jerry
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I haven't dealt with this in a while. I think I was still unable to get time-align going -- I could never give it the right -feat parameter (I think that's what it was). The problem was that I could never get the matrix dimensions to match. I suspect that it would take some hacking to get time-align to work with Sphinx-II format.
I did manage to get sphinx2-align working, but I still never figured out a good way to massage the output so it would be useful.
BTW, I may not get back to this 'til Wednesday or Friday. Best of luck!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2003-02-21
Sphinx2 force-aligning: you have to specify -osentfn to tell it where to write its "sentence output", which looks like:
WE WE SIL HAVE FARM TWO HOURS(2) AWAY (st001a011)
SO SIL I DON'T HOW TO LOCATE ANYBODY I DON'T KNOW WHAT TO(2) SIL DO AT THIS(2) P
OINT SIL (st002a007)
WE WILL(2) CERTAINLY DO OUR(2) VERY BEST TO DO THAT (st002b017)
I found these results to be of mixed quality in terms of inserting silences and selecting correct alternate pronunciations. (In this case, I was using a cd-tied Sphinx2 model from the end of the SphinxTrain process, which later results suggest may not have been good. I have not tried converting the ci S3 model (result of SphinxTrain step 02) to S2 format and trying that with sphinx2-align,.
Sphinx3 force-aligning: Rita Singh advised me to use Sphinx3 for force-aligning. I checked out from SourceForge CVS archive_s3/s3 (the Sphinx3 "slow decoder") and followed the README instructions to make s3align.
Then I used tests/et94-align.csh as a template for force-aligning with my own data and model (note that Sphinx3 apparently doesn't take input from audio files, but rather from precomputed cepstral data files). I first used a cd-tied S3 model, and results weren't good; I suspect that this model is too specific, having been trained with transcriptions with 1st pronunciations only and no silences. But when I used the coarser ci model instead, I got much better results, which look like:
<s> WE WE <sil> HAVE FARM TWO HOURS AWAY </s> (st001a011)
<s> SO <sil> I DON'T(2) HOW TO(3) LOCATE ANYBODY I DON'T(2) KNOW WHAT TO(3) DO AT THIS POINT /NOISE/ </s> (st002a007)
<s> WE WILL(2) CERTAINLY DO OUR(3) VERY BEST TO DO THAT(2) </s> (st002b017)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've successfully run the SphinxTrain scripts and executables on a training dataset to produce a Sphinx2 model, and I'm grateful for the assistance provided by the distribution's scripts.
We're going to be using Sphinx2, not -3. I've also compiled Sphinx2 and successfully run Sphinx2-test (both on Linux).
The SphinxTrain doc files (from CMU's website) mention force-aligning, and specifically doing so after the CI-model step. But there's nothing specific on how to do this. I assume I must use the recognizer to do the force-alignment step. I noticed a script sphinx2-0.4/scripts/sphinx2-align -- is that a prototype for doing this? Can I use the "Module 09" scripts (modified for the appropriate model directory) to convert CI-model files to Sphinx2 format and then use them with Sphinx2-align to do this force-alignment?
Or should I try to download/compile/install Sphinx3 for doing this step? Any assistance would be appreciated. Tnx.
cheers,
jerry wolf
I'll give you my experience, but hopefully someone can relate a successful experience.
1) After a bit of struggle, you should be able to get sphinx2-align to work. If you are unable to, please ask me, and I'll give you my copy.
2) After successfully running sphinx2-align, I could not figure out how to take its output to realign the transcript or anything like that. This seems pretty basic. Perhaps I was missing a central concept.
3) I downloaded and installed archive_s3/s3.0 so from the cmusphinx CVS site so I could get timealign. (See http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/cmusphinx/archive_s3/s3.0/pgm/timealign/ ) I was unable to get it to work with Sphinx-2 format model SphinxTrain files. (I'm actually working on getting my language model (LM) to work first and get SphinxTrain to no longer crash (since I did a "cvs update"); so, I haven't had time to work on this further.)
If you do manage to get alignment to work in any form, I'd appreciate hearing about it.
Sorry for not being more helpful.
Re #2: As I understand it, the purpose of force-aligning is not to use time-alignment information in the training, but rather to select which pronunciation out of many (for words with multiple pronunciations) it the one to use for the training (i.e., which of the 2 common pronunciations of THE). The trainer can use only one, but the recognizer can use multiple ones, so it appears they use a recognizer application with a multiple-pronunciation dictionary to do that selection, using a rough model (e.g., the CI model from Module 02), and then go back and do the entire training with these selected pronunciations.
I believe that this step also recognizes and marks interword silences. So you feed in rough transcripts and get back transcripts with selected-pronunciations and interword silences, to be used for more-accurate training.
From http://www-2.cs.cmu.edu/~rsingh/sphinxman/fr3.html
:
"Q: What is force-alignment? Should I force-align my transcripts before I train?
A: The process of force-alignment takes an existing transcript, and finds out which, among the many pronunciations for the words occuring in the transcript, are the correct pronunciations. So when you refer to "force-aligned" transcripts, you are also inevitably referring to a *dictionary* with reference to which the transcripts have been force-aligned. So if you have two dictionaries and one has the word "PANDA" listed as:
PANDA P AA N D AA
PANDA(2) P AE N D AA
PANDA(3) P AA N D AX
and the other one has the same word listed as
PANDA P AE N D AA
PANDA(2) P AA N D AX
PANDA(3) P AA N D AA
And you force-aligned using the first dictionary and get your transcript to look like :
I SAW A PANDA(3) BEAR,
then if you used that transcript to train but used the second dictionary to train, then you would be giving the wrong pronunciation to the trainer. You would be telling the trainer that the pronunciation for the word PANDA in your corpus is "P AA N D AA" instead of the correct one, which should have been "P AA N D AX". The data corresponding to the phone AXwill now be wrongly used to train the phone AA.
What you must really do is to collect your transcripts, use only the first listed pronunciation in your training dictionary, train ci models, and use *those ci models* to force-align your transcripts against the training dictionary. Then go all the way back and re-train your ci models with the new transcripts."
Re #3: If timealign is a Sphinx3 application, I wouldn't expect it to work with Sphinx2 models. OTOH, since the SphinxTrain generates Sphinx3-style models at each stage in the process, I'd guess you should use them.
Hi again. Thanks for replying.
#2: I did know about adding silence symbols (this is the main thing I was looking for), but I didn't know about the pronunciations. Thanks for all the info.
#3: Yes, I was hoping that (as you say) since SphinxTrain uses Sphinx-3 format until the very end when the conversion takes place, that I should be able to use timealign and do the alignment/re-pronunciation thing to improve the model. The problem I had was that the dimensions of the matrixes would never work. (See https://sourceforge.net/forum/forum.php?thread_id=767594&forum_id=5470 for the thread.) I had rec'd a set of modified SphinxTrain scripts from https://sourceforge.net/users/cbquillen/ which I was having trouble getting working.
Again, I'm sorry I can't be more helpful.
If you are trying to align with sphinx2 models, maybe you should use the sphinx2-align script? This runs the Sphinx-2 decoder with transcripts (supplied by the -tactlfn parameter).
jerry wolf -
Did you ever get this to work? I'm thinking of giving the sphinx2-align another try (due to all my problems with timealign). I still don't understand how I can use the output; did you figure it out?
Thanks.
I've been away since 12/24, and over the holiday, we also packed, moved, and unpacked our office, so I'm just getting back to beiing able to work. So the answer is no, I haven't even been able to try!
But I am inclined to pursue the hypothesis in my 12/23 posting -- convert the S3 CI-model files to S2 format, then see if I can get sphinx2-align to work with them. But I have some infrastructure work to do first, so it looks like you'll get there well ahead of me.
Incidentally, back on the topic of the purpose of force-aligning, there's also in http://www-2.cs.cmu.edu/~dbansal/max/dip.html: "Step4: Force align: (generating a better transcript)
The original transcript may not be a perfect transcript in the sense that it may lack fillers. After some sort of models are made, force-align the reference transcript to get a transcript with fillers. You may want to use this transcript to train further. This step can be performed at any appropriate time, generally after the models converge on initial transcript. After performing this step, bw is again run over this new transcript." So he doesn't mention alternate pronunciations at all, but rather the addition of "fillers".
Also see http://www-2.cs.cmu.edu/~rsingh/sphinxman/FAQ.html#9, especially Q1 and Q4.
robert b -- After a lengthy lag, I'm back to the force-alignment problem and about to try sphinx2-align. Have you pursued this, and what have you achieved?
jerry
I haven't dealt with this in a while. I think I was still unable to get time-align going -- I could never give it the right -feat parameter (I think that's what it was). The problem was that I could never get the matrix dimensions to match. I suspect that it would take some hacking to get time-align to work with Sphinx-II format.
I did manage to get sphinx2-align working, but I still never figured out a good way to massage the output so it would be useful.
BTW, I may not get back to this 'til Wednesday or Friday. Best of luck!
Sphinx2 force-aligning: you have to specify -osentfn to tell it where to write its "sentence output", which looks like:
WE WE SIL HAVE FARM TWO HOURS(2) AWAY (st001a011)
SO SIL I DON'T HOW TO LOCATE ANYBODY I DON'T KNOW WHAT TO(2) SIL DO AT THIS(2) P
OINT SIL (st002a007)
WE WILL(2) CERTAINLY DO OUR(2) VERY BEST TO DO THAT (st002b017)
I found these results to be of mixed quality in terms of inserting silences and selecting correct alternate pronunciations. (In this case, I was using a cd-tied Sphinx2 model from the end of the SphinxTrain process, which later results suggest may not have been good. I have not tried converting the ci S3 model (result of SphinxTrain step 02) to S2 format and trying that with sphinx2-align,.
Sphinx3 force-aligning: Rita Singh advised me to use Sphinx3 for force-aligning. I checked out from SourceForge CVS archive_s3/s3 (the Sphinx3 "slow decoder") and followed the README instructions to make s3align.
Then I used tests/et94-align.csh as a template for force-aligning with my own data and model (note that Sphinx3 apparently doesn't take input from audio files, but rather from precomputed cepstral data files). I first used a cd-tied S3 model, and results weren't good; I suspect that this model is too specific, having been trained with transcriptions with 1st pronunciations only and no silences. But when I used the coarser ci model instead, I got much better results, which look like:
<s> WE WE <sil> HAVE FARM TWO HOURS AWAY </s> (st001a011)
<s> SO <sil> I DON'T(2) HOW TO(3) LOCATE ANYBODY I DON'T(2) KNOW WHAT TO(3) DO AT THIS POINT /NOISE/ </s> (st002a007)
<s> WE WILL(2) CERTAINLY DO OUR(3) VERY BEST TO DO THAT(2) </s> (st002b017)