I am in the process of collecting data to use for training. This is the first time I have tried this and have yet to pull everything together so forgive me if this seems like a stupid question.
How well does sphinx cope with letters that are not always pronounced in normal speech?
For example, the 't' in 'cat' should always be distinctly pronounced, at least according to pronounciation guides. But in normal speech the 't' might be dropped, resulting in something that sounds like 'ca'.
While producing training data, is it best to record examples of each pronounciation, and let sphinx work out that the 't' may or may not be sounded. Or would it be better to put two entries in the dictionary, one where the 't' is sounded, one where it is not? Or is there another solution?
Thank you in advance
Matt
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2007-06-11
Ok thank you both. I will give both a try.
Matt
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Best thing to do (presuming you know what the different pronunciations are) would be to put alternative pronunciations in your dictionary, like this:
CAT K AE T
CAT(2) K AE
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2007-06-09
I agree with David's answer, but suggest caution in applying it in cases such as the the example cited by Matt.
The phone T (in Sphinx notation) is an unvoiced stop consonant, which is characterized by (1) a brief silence while the vocal tract is closed by the tongue tip, (2) a "burst" as the pent-up air pressure is released, and (3) a brief duration of noise (aspiration) as air flows through the narrow but widening constriction at the tongue tip. In addition, the formants in the surrounding phones will move due to the changing position of the tongue as it moves into and out of the stop. The burst and aspiration may be more or less evident, depending on phonetic context and the way in which the T is pronounced.
I wrote that lengthy explanation to suggest that if Matt doesn't hear a "noisy" T in CAT, it may be a mistake to conclude that the T has been omitted; I suggest that it's just pronounced not as noisily as you might imagine it should be. I don't think I'd use two pronunciations for CAT. With enough data, acoustic model training will "learn" the acoustic characteristics of T in its various contexts.
There are, to be sure, cases where a T is genuinely dropped, and you can find such by a stroll through the CMUdict. For example, consider:
IDENTITY AY D EH N T AX T IY
IDENTITY(2) AY D EH N AX T IY
Note that the T is articulated at the same place as the preceding N, and in rapid speech, one can omit the stop altogether. IMHO this is a valid case for two pronunciations.
To summarize, multiple dictionary pronunciations are needed to cope with different pronunciations of many words. I simply urge a little conservatism in deciding what is and isn't a different pronunciation at the broad phonetic level used in Sphinx.
cheers,
jerry
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am in the process of collecting data to use for training. This is the first time I have tried this and have yet to pull everything together so forgive me if this seems like a stupid question.
How well does sphinx cope with letters that are not always pronounced in normal speech?
For example, the 't' in 'cat' should always be distinctly pronounced, at least according to pronounciation guides. But in normal speech the 't' might be dropped, resulting in something that sounds like 'ca'.
While producing training data, is it best to record examples of each pronounciation, and let sphinx work out that the 't' may or may not be sounded. Or would it be better to put two entries in the dictionary, one where the 't' is sounded, one where it is not? Or is there another solution?
Thank you in advance
Matt
Ok thank you both. I will give both a try.
Matt
Best thing to do (presuming you know what the different pronunciations are) would be to put alternative pronunciations in your dictionary, like this:
CAT K AE T
CAT(2) K AE
I agree with David's answer, but suggest caution in applying it in cases such as the the example cited by Matt.
The phone T (in Sphinx notation) is an unvoiced stop consonant, which is characterized by (1) a brief silence while the vocal tract is closed by the tongue tip, (2) a "burst" as the pent-up air pressure is released, and (3) a brief duration of noise (aspiration) as air flows through the narrow but widening constriction at the tongue tip. In addition, the formants in the surrounding phones will move due to the changing position of the tongue as it moves into and out of the stop. The burst and aspiration may be more or less evident, depending on phonetic context and the way in which the T is pronounced.
I wrote that lengthy explanation to suggest that if Matt doesn't hear a "noisy" T in CAT, it may be a mistake to conclude that the T has been omitted; I suggest that it's just pronounced not as noisily as you might imagine it should be. I don't think I'd use two pronunciations for CAT. With enough data, acoustic model training will "learn" the acoustic characteristics of T in its various contexts.
There are, to be sure, cases where a T is genuinely dropped, and you can find such by a stroll through the CMUdict. For example, consider:
IDENTITY AY D EH N T AX T IY
IDENTITY(2) AY D EH N AX T IY
Note that the T is articulated at the same place as the preceding N, and in rapid speech, one can omit the stop altogether. IMHO this is a valid case for two pronunciations.
To summarize, multiple dictionary pronunciations are needed to cope with different pronunciations of many words. I simply urge a little conservatism in deciding what is and isn't a different pronunciation at the broad phonetic level used in Sphinx.
cheers,
jerry