Suppose I have a sentence:
Well let me see yes that's it.
The actual spoken speech sounds like this:
Well (short pause) let me see (short pause) yes that's it.
I used Karel's nnet to do the alignment and the result is that the whole sentence has been aligned into the final part (yes that's it) of the actual spoken period (while the earlier period was assigned as NSN).
Let us call it a chunking effect because there is a tendency for the phones to be chunked together within one spoken period.
Just wondering is it a normal thing or is it due to parameter setting?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sounds like an alignment error - could be a search error, fixable by
higher beam, or simply a modeling error. However it's unusual that
NSN could appear in the alignment if there was no such marking in the
transcript, because it's not normally an optional-silence that the
lexicon would allows between words (normally only SIL is allowed).
Dan
Suppose I have a sentence:
Well let me see yes that's it.
The actual spoken speech sounds like this:
Well (short pause) let me see (short pause) yes that's it.
I used Karel's nnet to do the alignment and the result is that the whole
sentence has been aligned into the final part (yes that's it) of the actual
spoken period (while the earlier period was assigned as NSN).
Let us call it a chunking effect because there is a tendency for the phones
to be chunked together within one spoken period.
Just wondering is it a normal thing or is it due to parameter setting?
I thought both SIL and NSN are generated by Kaldi, no? I am using the TedLium scripts.
In the transcript the first phones are supposed to be w eh l... but the alignment output gives NSN_S before w eh l... There are some raining sound at the background so I have supposed NSN_S means this kind of noise, but if it is not generated by Kaldi then it would be very strange.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I guess the main reason is that the model for silence is poor, so it cannot add SIL between words of a sentence, hence all words stick together. I am thinking of some possible solutions but not sure which is best:
[a] add a symbol <sp> between every word;
[b] boost silence;
[c] add some "silent sentences" in the transcript.
What do you think?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think the issue is likely that you have OOVs in your transcript,
which are getting turned into <UNK>, which has a pronunciation of NSN
in the Tedlium setup, which is eating up a bunch of speech. [Note:
the use of NSN to model unknown-words is slightly against the meaning
of NSN which I intended to mean non-spoken noise, but it doesn't
really matter; it's just a cosmetic issue].
Likely you made some kind of scripting error and your "text" file has
some kind of garbage before the "well" that doesn't appear in
words.txt.
I guess the main reason is that the model for silence is poor, so it cannot
add SIL between words of a sentence, hence all words stick together. I am
thinking of some possible solutions but not sure which is best:
[a] add a symbol <sp> between every word;
[b] boost silence;
[c] add some "silent sentences" in the transcript.
In the alignment output, there are 2 NSN_S between EVERY sentence. In the transcript (the STM file that I feed), I have already removed all punctuation marks, so I cannot see where does the NSN_S come from (unless Kaldi automatically insert them between sentences?) Which files in data/ can I check?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Suppose I have a sentence:
Well let me see yes that's it.
The actual spoken speech sounds like this:
Well (short pause) let me see (short pause) yes that's it.
I used Karel's nnet to do the alignment and the result is that the whole sentence has been aligned into the final part (yes that's it) of the actual spoken period (while the earlier period was assigned as NSN).
Let us call it a chunking effect because there is a tendency for the phones to be chunked together within one spoken period.
Just wondering is it a normal thing or is it due to parameter setting?
Sounds like an alignment error - could be a search error, fixable by
higher beam, or simply a modeling error. However it's unusual that
NSN could appear in the alignment if there was no such marking in the
transcript, because it's not normally an optional-silence that the
lexicon would allows between words (normally only SIL is allowed).
Dan
On Mon, Jan 5, 2015 at 12:14 AM, number31 number31@users.sf.net wrote:
I thought both SIL and NSN are generated by Kaldi, no? I am using the TedLium scripts.
In the transcript the first phones are supposed to be w eh l... but the alignment output gives NSN_S before w eh l... There are some raining sound at the background so I have supposed NSN_S means this kind of noise, but if it is not generated by Kaldi then it would be very strange.
I guess the main reason is that the model for silence is poor, so it cannot add SIL between words of a sentence, hence all words stick together. I am thinking of some possible solutions but not sure which is best:
[a] add a symbol <sp> between every word;
[b] boost silence;
[c] add some "silent sentences" in the transcript.
What do you think?
I think the issue is likely that you have OOVs in your transcript,
which are getting turned into <UNK>, which has a pronunciation of NSN
in the Tedlium setup, which is eating up a bunch of speech. [Note:
the use of NSN to model unknown-words is slightly against the meaning
of NSN which I intended to mean non-spoken noise, but it doesn't
really matter; it's just a cosmetic issue].
Likely you made some kind of scripting error and your "text" file has
some kind of garbage before the "well" that doesn't appear in
words.txt.
Dan
On Mon, Jan 5, 2015 at 3:28 AM, number31 number31@users.sf.net wrote:
In the alignment output, there are 2 NSN_S between EVERY sentence. In the transcript (the STM file that I feed), I have already removed all punctuation marks, so I cannot see where does the NSN_S come from (unless Kaldi automatically insert them between sentences?) Which files in data/ can I check?