Menu

Why not to enhance DCEP calculation method?

Rmkf
2011-06-03
2012-09-22
  • Rmkf

    Rmkf - 2011-06-03

    As I've briefly seen the source of Sfinx , delta-cepatrum is calcalated as
    just difference between current+2 vs -2 frames for short and +-4 for long.
    The fact is that the speech spectrum evaluation is highly sensitive to any
    noise and/or pitch unsynchronity. E.g: for mail speakers pitch period easily
    can be as low as 50 or even down to 30 Hz, which corresponds to 20-30ms period
    time. As well, for this type of voices the parts of open vs closed glottis are
    clearly seen in sound editor - especialy on LPC-stripped speech. The length of
    open glottis can achieve up to 4 and even 5 ms! Thus if MFCC is performed at
    constant rate 10ms (if I understood it right way) then MFCC vector are becomes
    exposed to very intensive stroboscopic-type modulation. If in addition finite
    difference is taken, then considerable degradation of precision occurs.
    So it seems to be direct & clear to do some filtering on DCEPs. If we'd take
    as simple thing as Fs/2-cutting symmetric window (1/4,1/2,1/4) and pas DCEP
    through it, then we yield (for short DCEPs) the next weights:
    (-1/4,-1/2,-1/4,0,1/4,1/2,1/4). Analogical filtering can be performed on long
    DCEPs. Is it good idea, or there is some issues that I didn't understand up to
    now?

     
  • Nickolay V. Shmyrev

    Advantages should be there but overall they are not great, often it gets even
    worse. See for example this paper

    Combining Spectral Representations for Large Vocabulary Continuous Speech
    Recognition
    Giulia Garau*, and Steve Renals, Member, IEEE

    http://www.cstr.ed.ac.uk/downloads/publications/2008/garau-
    taslp08.pdf

    I think the biggest reason is that pitch extraction is not stable in various
    conditions, moreover in noise. Much more interest nowdays falls into the
    domain of sparse signal representations and overlapped dictionaries. The
    domain of a physics of the speech production is not really popular.

     
  • Rmkf

    Rmkf - 2011-06-05

    "The domain of a physics of the speech production is not really popular. "
    - It's a good news! 'Cause it makes place for me to make something new! ==)
    @ least, many years ago I've made such a mechanic-based synthesizer. It's
    available and anyone can evaluate it even right now:
    http://dump.ru/file/5255545
    Very conservative people noticed it as very naturally sounding (though as
    voice of psychically-ill or very stupid person =) ). I can upload it to prove
    my previous statements, but it was for russian language and worked under dos,
    but it works well from cmd line under windows. So, if to add together HMM-
    secuencing of speech units with physmod naturalness - then it can produce very
    good results. That's why I'm researching articulatory representation of
    recorded live speech.

    "pitch extraction is not stable in various conditions"
    Moremoreover: at me IMHO it's impossible to xactly determine pitch in all
    conditions, even with strictly no noise! Le'z take as example the song
    "Mercedec Benz" performed acapella by J.J. I've looked very attentively to
    LPC-stripped version of this: period-to-period modulation of pitch pulses is
    so big (as length as form/amplitude) that absolutely can not be decided what
    is the pitch period and what is "false peak" inside a bigger period. Some
    people have "dirty" (screaming/growling-alike) pitch nature as (s)he's
    personal feature. Sometimes at low male pitch the neighboring pitch periods
    are differs by 20+% in length! So IMHO the right way is to consider many
    rather then one pitch periods: this approach can regenerate ANY LPC-residual
    even with no explicit voiced-unvoiced decision - just N periods, some of which
    are multiples of fundamental, some - variations of length from period to next.
    Moreover: this sequence of "pulses" can be modelled as Markov process and
    regenerated in such a way. I tell all this just because i've already did it
    and yield tool to regenerate pitch of reference person with his/she
    screaming/growling features.

     

Log in to post a comment.