Sorry for my English, but I need your help...
I'm looking for algorithms, which you have used in espeak. It's very difficult to look it in source code. But I can't find it in Internet. You can help me, I hope...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
espeak uses the "sinusoidal" method. It makes a vowel sound by adding together the sine-waves of the various harmonics (see the wavegen() function in wavegen.cpp) Different vowels have different mixtures of harmonics. Consonants such as [s] and [t] are simply recorded sound samples (.WAV files). Some consonants such as [z] are produced by a mixture of both these two methods.
A different method of generating speech sounds is to start with a wave-form which is rich in harmonics (eg. something like a triangle wave) and then apply digital filters. Changing the resonances of the filters produces different sounds. For an example of this method, look at the "rsynth" project on sourceforge. This is based on the "klatt synthesizer" (do a Google search on that).
The idea of "formants" is fundamental to both these methods. These are peaks on the audio spectrum of vowels. The position of formants 1,2, and 3 determine the type of vowel. A good tool for analysing speech sounds is "praat" from www.praat.org. This will display the formants and you can see how they change during a diphthong such as [aI] in "high".
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am going to implement an intonation to your application but I have some problems. I don't understand the nature of different coefficients, matrices and formuls for pitch increment and calculating of three componenets of the speed. I have tried to find information in the Internet, but I have found only an overview. Could you write me where I can see full description of these algorithms?
Thank you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If by "intonation" you mean just the pitch variation throughout a sentence, then you should only need to change the intonation.cpp file. This determines a lower-pitch, upper-pitch, and pitch envelope (fall, rise, fall-rise, etc) for each syllable in a clause, taking note of whether the syllable has primary stress, secondary stress, or is unstressed.
The intonation.cpp routines set the PHONEME_LIST pitch1, pitch2, and env fields for each vowel. How you do this is up to you. I came up with the current algorithm by trial and error, adjusting things until it sounded OK. You don't really need to understand how I did it, just come up with a better way :-)
For example, a simple method might be to have the pitch decrease throughout the clause, reducing the pitch at each primary stressed syllable. However, that would have problems for a long clause with many syllables.
A good program for displaying how pitch changes throughout a sample of spoken or synthesized speech is "praat" from www.praat.org. It shows the speech waveform of a phrase together with a graph of the pitch and the formants. So you could speak a sentence, look at how the pitch varies, and then try and write an algorithm which does something similar. Of course a simple speech synthesizer doesn't understand the meaning of the sentence so it doesn't know which words to emphasize.
You also mentioned speed coefficients and matrices. These are not concerned with intonation as such, but rather determine how the length of a vowel varies depending on the adjacent sounds, its stress level, and its position in a word. For example, in English a vowel is shorter before an unvoiced consonant such as [s] [p] or [t] than before a voiced consonant such as [z] [b] or [d].
speed1, speed2, speed3 determine the relative lengths the last syllable of a word, the next to last, and earlier syllables, respectively. They are derived from voice->speedf1, speedf2, speedf3 combined with the overall speaking speed. These factors are set in VoiceReset() in synthdata.cpp and were determined by trial and error. Perhaps different values might be better for a different language or accent.
The stress_lengths array, with gives the relative lengths of vowels with different stressed levels can be set in a voice file, using the stressLength command. So you can easily experiment to make stressed vowels longer, or to make stressed and unstressed vowels the same length.
If you have any specific questions, please ask. Are you interested in improving the intonation/prosody of English or for a different language?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am also looking for algorithms use in espeak. I am confuse in synthesis technique use for espeak : the original eSpeak synthesizer and a Klatt synthesize. Which one is use in now days ?
Which is best ?
algorithm and formula's for eSpeak synthesizer.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry for my English, but I need your help...
I'm looking for algorithms, which you have used in espeak. It's very difficult to look it in source code. But I can't find it in Internet. You can help me, I hope...
: I'm looking for algorithms, which you have used in espeak. It's very difficult to look it in source code. But I can't find it in Internet.
What exactly do you want?
Here's a discussion of speech generation methods:
http://www.acoustics.hut.fi/~slemmett/dippa/chap5.html
espeak uses the "sinusoidal" method. It makes a vowel sound by adding together the sine-waves of the various harmonics (see the wavegen() function in wavegen.cpp) Different vowels have different mixtures of harmonics. Consonants such as [s] and [t] are simply recorded sound samples (.WAV files). Some consonants such as [z] are produced by a mixture of both these two methods.
A different method of generating speech sounds is to start with a wave-form which is rich in harmonics (eg. something like a triangle wave) and then apply digital filters. Changing the resonances of the filters produces different sounds. For an example of this method, look at the "rsynth" project on sourceforge. This is based on the "klatt synthesizer" (do a Google search on that).
The idea of "formants" is fundamental to both these methods. These are peaks on the audio spectrum of vowels. The position of formants 1,2, and 3 determine the type of vowel. A good tool for analysing speech sounds is "praat" from www.praat.org. This will display the formants and you can see how they change during a diphthong such as [aI] in "high".
Thanks a lot for your help. I'm very interested in this theme.
I am going to implement an intonation to your application but I have some problems. I don't understand the nature of different coefficients, matrices and formuls for pitch increment and calculating of three componenets of the speed. I have tried to find information in the Internet, but I have found only an overview. Could you write me where I can see full description of these algorithms?
Thank you.
An alternative intonation would be interesting.
If by "intonation" you mean just the pitch variation throughout a sentence, then you should only need to change the intonation.cpp file. This determines a lower-pitch, upper-pitch, and pitch envelope (fall, rise, fall-rise, etc) for each syllable in a clause, taking note of whether the syllable has primary stress, secondary stress, or is unstressed.
The intonation.cpp routines set the PHONEME_LIST pitch1, pitch2, and env fields for each vowel. How you do this is up to you. I came up with the current algorithm by trial and error, adjusting things until it sounded OK. You don't really need to understand how I did it, just come up with a better way :-)
For example, a simple method might be to have the pitch decrease throughout the clause, reducing the pitch at each primary stressed syllable. However, that would have problems for a long clause with many syllables.
A good program for displaying how pitch changes throughout a sample of spoken or synthesized speech is "praat" from www.praat.org. It shows the speech waveform of a phrase together with a graph of the pitch and the formants. So you could speak a sentence, look at how the pitch varies, and then try and write an algorithm which does something similar. Of course a simple speech synthesizer doesn't understand the meaning of the sentence so it doesn't know which words to emphasize.
You also mentioned speed coefficients and matrices. These are not concerned with intonation as such, but rather determine how the length of a vowel varies depending on the adjacent sounds, its stress level, and its position in a word. For example, in English a vowel is shorter before an unvoiced consonant such as [s] [p] or [t] than before a voiced consonant such as [z] [b] or [d].
speed1, speed2, speed3 determine the relative lengths the last syllable of a word, the next to last, and earlier syllables, respectively. They are derived from voice->speedf1, speedf2, speedf3 combined with the overall speaking speed. These factors are set in VoiceReset() in synthdata.cpp and were determined by trial and error. Perhaps different values might be better for a different language or accent.
The stress_lengths array, with gives the relative lengths of vowels with different stressed levels can be set in a voice file, using the stressLength command. So you can easily experiment to make stressed vowels longer, or to make stressed and unstressed vowels the same length.
If you have any specific questions, please ask. Are you interested in improving the intonation/prosody of English or for a different language?
Please, check your e-mail on sourseforge.net. We have sent you a Windows-Linux version of eSpeak.
I am also looking for algorithms use in espeak. I am confuse in synthesis technique use for espeak : the original eSpeak synthesizer and a Klatt synthesize. Which one is use in now days ?
Which is best ?
algorithm and formula's for eSpeak synthesizer.