|
From: Nickolay S. <nsh...@gm...> - 2015-01-13 23:49:06
|
> 14 янв. 2015 г., в 2:37, <Dan...@pa...> <Dan...@pa...> написал(а):
>
> Hello Nicolay,
>
> Thanks very much for your thoughtful answer. My context was that I wondered whether there might be occasionally be an advantage to mapping words to word phrases in G rather than assigning probabilities to words. I assumed that someone had tried it and it was known not to work well since no one seemed to do it. I couldn't find a record of anyone trying it, so thought I'd ask.
In that context it’s probably worth to describe how recognition works. Many newbies have confusion about that which you might have too. People imagine that audio is converted to phones, then phones converted to words and then words converted to phrases. It is not like that because there are many many ways to do such conversion. Phone boundaries are blurred and often you can not decide easily which phone correspond to which word. Consider famous «wreck a nice beach» example which can be confused with «recognize speech». You can not do a local conversion decision, but you need a global 1-best result.
So instead of doing that straightforward process we consider all possible conversions and select the one of them with global minimum weight. So decoding is not the straightforward transducer application but scoring of all the possible paths with an acceptor. This is where acceptor is required and where you need to assign probabilities to results.
Decoding result is not
G(L(audio))
it is in simplified form
min_{over all possible audio splits} G(L(audio split))
Not a good discussions for kaldi-developers mailing list, maybe we can move that off-list.
|