I was reading this paper - "Spoken term detection based on the most probable
phoneme sequence", Gosztolya, G.; Toth, L.; SAMI 2011. On its page no 4, this
paper states - "The acoustic models may be further refined by using more
sophisticated machine learning techniques. One possibility is to apply
Artificial Neural Nets (ANNs) to estimate the local probability values instead
of Gaussian curves. The result- ing construct is called the HMM/ANN hybrid .
Thanks to the dvantages of ANNs, a 1-state monophone hybrid model can produce
just as good a accuracy score as a standard 3- state triphone HMM."
Is this true ? Because ANN based speech recognition systems are outdated now.
I have seen nobody using them. I was told that HMM are best performer. And if
above statement is true then why nobody uses ANN based ASR or builds ANN based
toolkits now a days ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have seen nobody using them. I was told that HMM are best performer.
That's not true. If you will search latest conference proceedings you'll see a
lot of papers about MLP features, that's basically ANN. Many decoders use it,
for example RWTH-ASR or Kaldi. Moreover, recently introduced deep belief
networks which are multi-layer ANN are known to provide best phonetic
accuracy.
The issue with ANN is how to adapt the model to the speaker but it has some
solutions. Anyway, as phonetic classifier this approach is quite successful.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just a note on the terminology ...
In the phonetic recognition field they make distinction between phonetic
'classification', where the phone boundaries are known and 'recognition' where
they are not.
See for example: http://groups.google.com/group/phnrec/msg/356dce67789f2c08
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
But why they not use ANN classifiers on articulatory (log area) vector instead
or better in addition to acoustic one? Most of phones have 1 or 2 main
articulation points (constrictions) which is enough to distinguish it from
others. Moreover - it works well even in cases with missed/coupled formants!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I was reading this paper - "Spoken term detection based on the most probable
phoneme sequence", Gosztolya, G.; Toth, L.; SAMI 2011. On its page no 4, this
paper states - "The acoustic models may be further refined by using more
sophisticated machine learning techniques. One possibility is to apply
Artificial Neural Nets (ANNs) to estimate the local probability values instead
of Gaussian curves. The result- ing construct is called the HMM/ANN hybrid .
Thanks to the dvantages of ANNs, a 1-state monophone hybrid model can produce
just as good a accuracy score as a standard 3- state triphone HMM."
Is this true ? Because ANN based speech recognition systems are outdated now.
I have seen nobody using them. I was told that HMM are best performer. And if
above statement is true then why nobody uses ANN based ASR or builds ANN based
toolkits now a days ?
That's not true. If you will search latest conference proceedings you'll see a
lot of papers about MLP features, that's basically ANN. Many decoders use it,
for example RWTH-ASR or Kaldi. Moreover, recently introduced deep belief
networks which are multi-layer ANN are known to provide best phonetic
accuracy.
The issue with ANN is how to adapt the model to the speaker but it has some
solutions. Anyway, as phonetic classifier this approach is quite successful.
Just a note on the terminology ...
In the phonetic recognition field they make distinction between phonetic
'classification', where the phone boundaries are known and 'recognition' where
they are not.
See for example: http://groups.google.com/group/phnrec/msg/356dce67789f2c08
But why they not use ANN classifiers on articulatory (log area) vector instead
or better in addition to acoustic one? Most of phones have 1 or 2 main
articulation points (constrictions) which is enough to distinguish it from
others. Moreover - it works well even in cases with missed/coupled formants!