Rmkf - 2012-04-05

hi all here!
let me present new method of unsupervised blind acoustic-to-articulatory
inversion. breef description is in attach.

the key feature is that restored such a way log area vector have the property
of nearly piecewise-linear trajectory! i.e. log area vector moving in time
from vertex to next vertex and each such vertex corresponds to each
perceptually-distinguishable phone. (though, of course, timefade itself can be
of any form - linear, step, parabolic with either positive or negative
curvature - the common thing is that it is the number monotonically rising
from 0 to 1.) so the log area vector sequence can be represented by vertexes
and timefade from one to another with very little estimation error. - it's
very compact representation clearly factorizing dynamics vs destination
positions. i.e. just the sequence of commands controlling imitating
synthesizer, running time-parallel with input speech sound! thus, for
recognition purposes only vertexes must be kept, not all o(t) sequence. can it
help to speed up significantly graph search (especially in n-best rather than
1-best case) ?

even a simple method of finding optimal set of vertexes (which i have
developed from scratch in a couple of hours) gives almost exact
phonelabelling! this method can speed up extraction of articulatory features,
automatically set phonelabels 80-90% correct (with only a little doubled
labels placed on a same phone). moreover extracting direct dynamic features of
speech obviously can help to create naturally sounding synthesizer while fully
eliminating the need of carefully diphone matching! some example of how it
works is here: http://zalil.ru/33011434

is this idea interesting enough to be incorporated in sphinx ?

=================
ATTACH:

QPRS as the HMM feature vector refinement method.
I want to introduce new approach to speech analysis/synthesis which I’ve named
QPRS (quasi-physical re-synthesizer).
QPRS means that for every windowed array of speech samples vector of “log area
+ excitation mode” vector is calculated such a way, that it is considered to
be of some physical (mechanical) sense. To achieve it I divide local speech
spectrum into 2 factors:
S(t)=Ssc(t)Sat(t), where
S(t) is full speech spectrum at time t,
Ssc(t) is source/channel spectrum and
Sat(t) is the acoustic tube spectrum at time t.
The factorisation is made in cepstral domain:
C(t)=Cat(t)+CscS(t)+CscL(t), where
C(t) is full cepstrum, Cat(t) is acoustic tube cepstrum = FFT(
log(S(t)
Sconj(t)+deltaS)/2 ),
CscS(t) is short-term source/channel cepstrum,
CscL(t) is long-term source/channel cepstrum, contains info about pitch period
and so on,
deltaS - constant, added to power spectrum to make possible to work with near-
zero valued spectrum.
Cat(t)=C(t)Kcat, CscS(t)=C(t)KcscS, CscL(t)=C(t)KcscL, where
Kcat, KcscS, KcscL – is weighting windows, with condition that
Kcat+KcscS+KcscL=1 for all cepstrum indexes.
CscC is concentrated near zero spectrum indexes, corresponding to very wide
band spectrum factor, Cat is such that acoustic tube spectrum factor contains
middle bands – narrower, than short-term source spectrum while wider than
long-term (pitch-related) spectrum factor.
deltaS is the valuable thing, because it makes possible unificate speech data
of different upper corner of spectrum by padding it to full spectrum with
constant. Really, all speech data is of band-limited spectrum because of
ADC/resampler lowpass filter (ARLPF), which is always performed while ideal
acoustic tube is not, so the need to eliminate influence of transition band of
ARLPF is neccessary. The other valuable sense of deltaS is the fact that
fricative source is always positioned inside acoustic tube at such a place,
where 1s formant is not generated, but always in fricative sounds the 1st
formant is near zero, so rising of very low frequancies made by compensate
it’s absense in full speech spectrum of fricative phones imitates missed 1st
formant thus yielding physically-sensible log area profile (proofs are
provided by my experiments).
After this I calculate uncepstrum of Cat(t), which is used as autocorelation
input for standard for LPC Levinson-Durbin routine:
ACFat(t) = FFT(exp(FFT(Cat(t)
2))
Levinson-Durbin routinemade with ACFat(t) as input yields reflection
coefficients Rat(t), which is of that I have called quasi-mechanical log area
Aat(t):
Aat(t,j)= Aat(t,j+1)+log((1+Rat(t,j))/ (1-Rat(t,j))), Aat(t,m)=0, where m is
the LPC order.
To prove that Aat(t) is really relevant to source speech spectrum the results
of inverse filtering and sytnhesis by Rat(t) are provided in experimental data
bundle.

Now complex HMM observation vector is the concatenation of Aat(t), several 1st
coefficients of CscS(t) and several maximum value of lowpass-filtered CscL(t)
in from of amplidude+index pair. Note, that INHO for
coarse/diplophoinc/triplophinic/screaming voices the need to keep several
peaks of long-term autocorrelation is absolutely necessary, because the pitch
peak ampltude/position modulation of irregular pitch can be of any depth, so
there is no treshold, even dynamic, which can made pitch period evaluation
robust in all cases. So, I prefer to keep several pitch periods because
physically it is right: the pitch amplitude/position modulation is the real
physical fact, so I see no causes to ignore it.

Now let’s consider the sequence of Aat(t). I have discovered that the
trajectory of it have the approximate form of piecewise-linear (vertex-based)
type. More exactly: let’s say that sequence Aat(t(i))=A(i), where
t0<t1<t2<…tN, have vertex of order k and precision eps0 at index i if all
vectors between i-th and k-th are placed on straitght line in m-dimensional
space from A(i) to A(k), i.e.:
For all j=0…k: A(i+j)=A(i)+(A(i+k)-A(i))*fd(i,k,j)+eps(i,k,j), where
fd(I,k,j) – time-fade from A(i) to A(i+k), coefficient, monotonically changes
from 0 to 1 with j goes 0 to k.
eps(i,k,j) – approximation error vector with module less than threshold eps0
(may be adaptive).

Next, I found some set of vertexes Av(n), placed optimally in sense of max
possible intervertex time distances while keepng acceptable error.
Experiments have shown that vertexes are placed mainly in the centers of
pronounced phones, sometimes oftener, mean distance in range of 30-60ms,
sometimes up to 100+.