As I've briefly seen the source of Sfinx , delta-cepatrum is calcalated as
just difference between current+2 vs -2 frames for short and +-4 for long.
The fact is that the speech spectrum evaluation is highly sensitive to any
noise and/or pitch unsynchronity. E.g: for mail speakers pitch period easily
can be as low as 50 or even down to 30 Hz, which corresponds to 20-30ms period
time. As well, for this type of voices the parts of open vs closed glottis are
clearly seen in sound editor - especialy on LPC-stripped speech. The length of
open glottis can achieve up to 4 and even 5 ms! Thus if MFCC is performed at
constant rate 10ms (if I understood it right way) then MFCC vector are becomes
exposed to very intensive stroboscopic-type modulation. If in addition finite
difference is taken, then considerable degradation of precision occurs.
So it seems to be direct & clear to do some filtering on DCEPs. If we'd take
as simple thing as Fs/2-cutting symmetric window (1/4,1/2,1/4) and pas DCEP
through it, then we yield (for short DCEPs) the next weights:
(-1/4,-1/2,-1/4,0,1/4,1/2,1/4). Analogical filtering can be performed on long
DCEPs. Is it good idea, or there is some issues that I didn't understand up to
now?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think the biggest reason is that pitch extraction is not stable in various
conditions, moreover in noise. Much more interest nowdays falls into the
domain of sparse signal representations and overlapped dictionaries. The
domain of a physics of the speech production is not really popular.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
"The domain of a physics of the speech production is not really popular. "
- It's a good news! 'Cause it makes place for me to make something new! ==)
@ least, many years ago I've made such a mechanic-based synthesizer. It's
available and anyone can evaluate it even right now: http://dump.ru/file/5255545
Very conservative people noticed it as very naturally sounding (though as
voice of psychically-ill or very stupid person =) ). I can upload it to prove
my previous statements, but it was for russian language and worked under dos,
but it works well from cmd line under windows. So, if to add together HMM-
secuencing of speech units with physmod naturalness - then it can produce very
good results. That's why I'm researching articulatory representation of
recorded live speech.
"pitch extraction is not stable in various conditions"
Moremoreover: at me IMHO it's impossible to xactly determine pitch in all
conditions, even with strictly no noise! Le'z take as example the song
"Mercedec Benz" performed acapella by J.J. I've looked very attentively to
LPC-stripped version of this: period-to-period modulation of pitch pulses is
so big (as length as form/amplitude) that absolutely can not be decided what
is the pitch period and what is "false peak" inside a bigger period. Some
people have "dirty" (screaming/growling-alike) pitch nature as (s)he's
personal feature. Sometimes at low male pitch the neighboring pitch periods
are differs by 20+% in length! So IMHO the right way is to consider many
rather then one pitch periods: this approach can regenerate ANY LPC-residual
even with no explicit voiced-unvoiced decision - just N periods, some of which
are multiples of fundamental, some - variations of length from period to next.
Moreover: this sequence of "pulses" can be modelled as Markov process and
regenerated in such a way. I tell all this just because i've already did it
and yield tool to regenerate pitch of reference person with his/she
screaming/growling features.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As I've briefly seen the source of Sfinx , delta-cepatrum is calcalated as
just difference between current+2 vs -2 frames for short and +-4 for long.
The fact is that the speech spectrum evaluation is highly sensitive to any
noise and/or pitch unsynchronity. E.g: for mail speakers pitch period easily
can be as low as 50 or even down to 30 Hz, which corresponds to 20-30ms period
time. As well, for this type of voices the parts of open vs closed glottis are
clearly seen in sound editor - especialy on LPC-stripped speech. The length of
open glottis can achieve up to 4 and even 5 ms! Thus if MFCC is performed at
constant rate 10ms (if I understood it right way) then MFCC vector are becomes
exposed to very intensive stroboscopic-type modulation. If in addition finite
difference is taken, then considerable degradation of precision occurs.
So it seems to be direct & clear to do some filtering on DCEPs. If we'd take
as simple thing as Fs/2-cutting symmetric window (1/4,1/2,1/4) and pas DCEP
through it, then we yield (for short DCEPs) the next weights:
(-1/4,-1/2,-1/4,0,1/4,1/2,1/4). Analogical filtering can be performed on long
DCEPs. Is it good idea, or there is some issues that I didn't understand up to
now?
Advantages should be there but overall they are not great, often it gets even
worse. See for example this paper
Combining Spectral Representations for Large Vocabulary Continuous Speech
Recognition
Giulia Garau*, and Steve Renals, Member, IEEE
http://www.cstr.ed.ac.uk/downloads/publications/2008/garau-
taslp08.pdf
I think the biggest reason is that pitch extraction is not stable in various
conditions, moreover in noise. Much more interest nowdays falls into the
domain of sparse signal representations and overlapped dictionaries. The
domain of a physics of the speech production is not really popular.
"The domain of a physics of the speech production is not really popular. "
- It's a good news! 'Cause it makes place for me to make something new! ==)
@ least, many years ago I've made such a mechanic-based synthesizer. It's
available and anyone can evaluate it even right now:
http://dump.ru/file/5255545
Very conservative people noticed it as very naturally sounding (though as
voice of psychically-ill or very stupid person =) ). I can upload it to prove
my previous statements, but it was for russian language and worked under dos,
but it works well from cmd line under windows. So, if to add together HMM-
secuencing of speech units with physmod naturalness - then it can produce very
good results. That's why I'm researching articulatory representation of
recorded live speech.
"pitch extraction is not stable in various conditions"
Moremoreover: at me IMHO it's impossible to xactly determine pitch in all
conditions, even with strictly no noise! Le'z take as example the song
"Mercedec Benz" performed acapella by J.J. I've looked very attentively to
LPC-stripped version of this: period-to-period modulation of pitch pulses is
so big (as length as form/amplitude) that absolutely can not be decided what
is the pitch period and what is "false peak" inside a bigger period. Some
people have "dirty" (screaming/growling-alike) pitch nature as (s)he's
personal feature. Sometimes at low male pitch the neighboring pitch periods
are differs by 20+% in length! So IMHO the right way is to consider many
rather then one pitch periods: this approach can regenerate ANY LPC-residual
even with no explicit voiced-unvoiced decision - just N periods, some of which
are multiples of fundamental, some - variations of length from period to next.
Moreover: this sequence of "pulses" can be modelled as Markov process and
regenerated in such a way. I tell all this just because i've already did it
and yield tool to regenerate pitch of reference person with his/she
screaming/growling features.