I'm using /egs/wsj to get the ctm file for NIST scoring tools.
With the /steps/get_train_ctm.sh script, I can get a final ctm file for the whole training data; while using /steps/get_ctm.sh script, I got several ctm files under $decode_dir/score_( is from minlmwt to maxlmwt). In each ctm file, the content is just one single line like:
"440c0401 1 0.87 0.41"
without any words behind the during time, and these ctm files only include the same file(like 440c0401), totally different from the get_train_ctm.sh performed ctm file.
Each file in my eval set includes only one utterance, and the eval set is directly extracted from the training data.
I wonder how to get the "real" ctm file I need for the eval set, as get_train_ctm.sh does for the training dataset. Did I miss someting or did it wrong?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The multiple directories from steps/get_ctm.sh are fine -- they are just decodings with different LM scale. Likely you have some file mismatch. It will be helpful if you check the logs at $your_decoding_dir/scoring/log/get_ctm.*.log, and see if there are warnings or errors.
Guoguo
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That output is consistent with what Guoguo said, i.e. a lexicon
mismatch that caused int2sym.pl to die. I see now that get_ctm.sh and
get_train_ctm.sh are not properly detecting that error due to not
setting pipefail for the pipe that had a problem, and I just committed
a fix for this that should make this problem cause a failure earlier
on.
Dan
The multiple directories from steps/get_ctm.sh are fine -- they are just
decodings with different LM scale. Likely you have some file mismatch. It
will be helpful if you check the logs at
$your_decoding_dir/scoring/log/get_ctm.*.log, and see if there are warnings
or errors.
I'm using /egs/wsj to get the ctm file for NIST scoring tools.
With the /steps/get_train_ctm.sh script, I can get a final ctm file for the whole training data; while using /steps/get_ctm.sh script, I got several ctm files under $decode_dir/score_( is from minlmwt to maxlmwt). In each ctm file, the content is just one single line like:
"440c0401 1 0.87 0.41"
without any words behind the during time, and these ctm files only include the same file(like 440c0401), totally different from the get_train_ctm.sh performed ctm file.
Each file in my eval set includes only one utterance, and the eval set is directly extracted from the training data.
I wonder how to get the "real" ctm file I need for the eval set, as get_train_ctm.sh does for the training dataset. Did I miss someting or did it wrong?
The multiple directories from steps/get_ctm.sh are fine -- they are just decodings with different LM scale. Likely you have some file mismatch. It will be helpful if you check the logs at $your_decoding_dir/scoring/log/get_ctm.*.log, and see if there are warnings or errors.
Guoguo
That output is consistent with what Guoguo said, i.e. a lexicon
mismatch that caused int2sym.pl to die. I see now that get_ctm.sh and
get_train_ctm.sh are not properly detecting that error due to not
setting pipefail for the pipe that had a problem, and I just committed
a fix for this that should make this problem cause a failure earlier
on.
Dan
On Tue, May 5, 2015 at 7:40 AM, Guoguo Chen chenguoguo@users.sf.net wrote: