First of all I must mention that this is my first contact with Kaldi. I have some experience with other speech recognition toolkits (HTK, Sphinx) and used them for small and large vocabulary ASR tasks.
I didn't install anything and I'm not quite sure where to begin, but my goal for now is to create posterior features for a speech database using Kaldi.
Can you give me some guidelines on how to begin?
Thanks,
Horia
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
First of all I must mention that this is my first contact with Kaldi. I
have some experience with other speech recognition toolkits (HTK, Sphinx)
and used them for small and large vocabulary ASR tasks.
I didn't install anything and I'm not quite sure where to begin, but my
goal for now is to create posterior features for a speech database using
Kaldi.
I want to use them for spoken term detection. My experiments trigger two
scenarios:
a) Spokent term detection based on phone posterior features
b) Spokent term detection based on the actual phones (the 1-best hypothesis
string of phones)
Can you say what you intend to use these posterior features for?
Dan
On Mon, Aug 11, 2014 at 10:26 AM, Horia Cucu horiacucu@users.sf.net wrote:
Hi all,
First of all I must mention that this is my first contact with Kaldi. I
have some experience with other speech recognition toolkits (HTK, Sphinx)
and used them for small and large vocabulary ASR tasks.
I didn't install anything and I'm not quite sure where to begin, but my
goal for now is to create posterior features for a speech database using
Kaldi.
It would probably be better to generate a lattice, possibly a phone-level
lattice, and do keyword search on the lattice. We already have
keyword-search stuff in Kaldi, that was used for the BABEL project (see
egs/babel/s5b), but the setup is kind of complicated. There is also an
example script for keyword search in the WSJ example, but I don't know how
recently it has been tested. I don't know if that WSJ example script
handles words not in the vocabulary (probably not).
To generate a phone-level lattice you could either convert a word lattice
to a phone lattice using lattice-align-phones with
--replace-output-symbols=true (but this will only contain phone sequences
that correspond to actual word sequences), or generate a language model at
the phone level and create a decoding graph from it... the latter approach
is probably only practical if you have a system without
word-position-dependent phones (--position-dependent-phones false to
prepare_lang.sh), and I'm afraid a script doesn't currently exist for it at
least in the checked-in code, although it should be doable.
If you really want phone-posterior features, not from a lattice, one way to
do it is to train a neural net to get the posteriors of context-dependent
states, evaluate the neural net using nnet-forward or nnet-compute (nnet1
vs nnet2 setup), convert to pdf-level posteriors using logprob-to-post or
prob-to-post, then convert to phone-level posteriors using
post-to-phone-post.
Guoguo may want to add more regarding the keyword search.
Dan
I want to use them for spoken term detection. My experiments trigger two
scenarios:
a) Spokent term detection based on phone posterior features
b) Spokent term detection based on the actual phones (the 1-best hypothesis
string of phones)
Horia
On 11 August 2014 21:38, Daniel Povey danielpovey@users.sf.net wrote:
Can you say what you intend to use these posterior features for?
Dan
On Mon, Aug 11, 2014 at 10:26 AM, Horia Cucu horiacucu@users.sf.net wrote:
Hi all,
First of all I must mention that this is my first contact with Kaldi. I
have some experience with other speech recognition toolkits (HTK, Sphinx)
and used them for small and large vocabulary ASR tasks.
I didn't install anything and I'm not quite sure where to begin, but my
goal for now is to create posterior features for a speech database using
Kaldi.
Hi all,
First of all I must mention that this is my first contact with Kaldi. I have some experience with other speech recognition toolkits (HTK, Sphinx) and used them for small and large vocabulary ASR tasks.
I didn't install anything and I'm not quite sure where to begin, but my goal for now is to create posterior features for a speech database using Kaldi.
Can you give me some guidelines on how to begin?
Thanks,
Horia
Can you say what you intend to use these posterior features for?
Dan
On Mon, Aug 11, 2014 at 10:26 AM, Horia Cucu horiacucu@users.sf.net wrote:
I want to use them for spoken term detection. My experiments trigger two
scenarios:
a) Spokent term detection based on phone posterior features
b) Spokent term detection based on the actual phones (the 1-best hypothesis
string of phones)
Horia
On 11 August 2014 21:38, Daniel Povey danielpovey@users.sf.net wrote:
It would probably be better to generate a lattice, possibly a phone-level
lattice, and do keyword search on the lattice. We already have
keyword-search stuff in Kaldi, that was used for the BABEL project (see
egs/babel/s5b), but the setup is kind of complicated. There is also an
example script for keyword search in the WSJ example, but I don't know how
recently it has been tested. I don't know if that WSJ example script
handles words not in the vocabulary (probably not).
To generate a phone-level lattice you could either convert a word lattice
to a phone lattice using lattice-align-phones with
--replace-output-symbols=true (but this will only contain phone sequences
that correspond to actual word sequences), or generate a language model at
the phone level and create a decoding graph from it... the latter approach
is probably only practical if you have a system without
word-position-dependent phones (--position-dependent-phones false to
prepare_lang.sh), and I'm afraid a script doesn't currently exist for it at
least in the checked-in code, although it should be doable.
If you really want phone-posterior features, not from a lattice, one way to
do it is to train a neural net to get the posteriors of context-dependent
states, evaluate the neural net using nnet-forward or nnet-compute (nnet1
vs nnet2 setup), convert to pdf-level posteriors using logprob-to-post or
prob-to-post, then convert to phone-level posteriors using
post-to-phone-post.
Guoguo may want to add more regarding the keyword search.
Dan
On Tue, Aug 12, 2014 at 7:40 AM, Horia Cucu horiacucu@users.sf.net wrote: