I'm the author of a subtitle editor and I'm in the process of adding speech
recognition so that a rough first draft of subtitles -- at least start and end
times if not text -- could be automatically generated from video. I've found
your wiki page "Using PocketSphinx with GStreamer and Python" most helpful.
I'm primarily interested in getting the times when someone starts speaking and
when that speaking ends and having this done in roughly subtitle length
pieces. I'm using gstreamer vader element signals "vader-start" and "vader-
stop" to get those times.
I plan to expose vader parameters "threshold" and "run-length" (with "auto-
threshold" set to false) to users in a GUI dialog. Despite experimentation,
I'm having trouble understanding the threshold parameter. The valid values
range from -1 to 1. What do the negative values mean? What can I except will
happen if I decrease/increase the value from the default of 0.0078125?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Negative value has no meaning. There is special -1.0/32768 value to enable
auto threshold by means of setting threshold, but it's not public
What can I except will happen if I decrease/increase the value from the
default of 0.0078125?
The threshold is value of the noise level energy. Default is 265/32768 so
everything below 256 in volume will be counted as noise and everything upper
is speech. You can reduce it to say 10 and then you will filter less noise and
detect more speech. You can increase the energy to 1000 and then quiet speech
will be counted as noise.
The trunk version has been updated to reflect this.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm the author of a subtitle editor and I'm in the process of adding speech
recognition so that a rough first draft of subtitles -- at least start and end
times if not text -- could be automatically generated from video. I've found
your wiki page "Using PocketSphinx with GStreamer and Python" most helpful.
I'm primarily interested in getting the times when someone starts speaking and
when that speaking ends and having this done in roughly subtitle length
pieces. I'm using gstreamer vader element signals "vader-start" and "vader-
stop" to get those times.
I plan to expose vader parameters "threshold" and "run-length" (with "auto-
threshold" set to false) to users in a GUI dialog. Despite experimentation,
I'm having trouble understanding the threshold parameter. The valid values
range from -1 to 1. What do the negative values mean? What can I except will
happen if I decrease/increase the value from the default of 0.0078125?
Negative value has no meaning. There is special -1.0/32768 value to enable
auto threshold by means of setting threshold, but it's not public
The threshold is value of the noise level energy. Default is 265/32768 so
everything below 256 in volume will be counted as noise and everything upper
is speech. You can reduce it to say 10 and then you will filter less noise and
detect more speech. You can increase the energy to 1000 and then quiet speech
will be counted as noise.
The trunk version has been updated to reflect this.