In the unit test file test_state_align.c provided by pocketsphinx, words are added like below.
My question is, whether I can get more accurate word boundaries if I add a SIL after every word?
I read somewhere that in sphinx SIL had skippable state. So I assume that adding SIL won't degrade alignment accuracy. Is that correct?
Do you mean SIL with skippable states by "optional silence"?
Using current PS codebase, if I add a SIL after a word like below, why it's not a good idea? I'm not challenge you here, just want to understand the reason and know the details.
As for my understanding,
(1) short pause(silence) usually appears between words when people speak. In this case, adding a SIL improve tha word boundary accuracy.
(2) when there is no pause or break between words, an extra SIL occupies only 3~4 frames from statistial aspect, which only degrades accuracy a little
I'm having the same problem: The recognizer will only detect <sil> for long silences, but not for short ones. When I perform alignment on the results, some words (and their first phone) will start a litte too early because they contain the short silence before them.
Has there been any development since the question was asked? Is there any way to get more precise alignment around short pauses? I don't think I have the knowledge required to hack the aligner.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Short silences should be detected by recognizer, you might want to increase -silprob value.
It might be also helpful to use more accurate acoustic models here because silence detection depends a lot on the quality of the acoustic model. If the original model was not accurate, phonemes might eat silence.
Last edit: Nickolay V. Shmyrev 2016-11-24
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Regarding the acoustic model: I'm using the generic US English acoustic model, continuous, v5.2. Is there any higher-quality acoustic model available?
It is trained on large data but that might not be very accurate for silences and fillers since training data is not carefully transcribed. Maybe you can try hub4wsj_sc_8k:
I'm a bit overwhelmed by the number of available acoustic models.
All my recordings are high-quality, low-noise, at least 44kHz (which I downsample). So my guess is that 8kHz models won't be optimal.
I'd like to support a wide range of speakers (male, female, children) and speaking modes (normal speech, shouting, whispering, etc.). So my guess is that broadcast news will be too limited.
I'm using the results for lip-sync. So phones and noises should be recognized as accurately as possible.
Is there an existing acoustic model that fits these requirements?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes it's true there is no perfect match. WSJ models are high quality but not that large size. En-us generic model are larger but they are trained from a less accurate source. I would try to recognize a test set and check, you need a test set anyway, it's critical for many other things.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the tip! I have a 30-minute test set, which should be sufficient. So I'll do some tests against it. Before I do, let me make sure I understand my options.
The US English generic acoustic models are generated from about 800 hours of non-public data. The best version for high-quality recordings is cmusphinx-en-us-5.2.tar.gz (16kHz, continuous).
The HUB4 acoustic model is generated from broadcast news (16kHz).
The HUB4WSJ acoustic model is generated from two sources: The HUB4 broadcast news and recordings of adults reading news texts from the Wall Street Journal in dictation style. This model is available only in 8kHz.
Beside these, there are two large open-source speech corpora, LibriSpeech and TED-LIUM. Both offer a wide range of speaking styles. But there are no ready-made acoustic models based on them, so I'd have to train my own acoustic model.
Is that correct? Did I miss anything?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
And, both tedlium and librispeech are automatically annotated, not manually annotated. This means their results on silences/fillers might be suboptimal.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I processed my 27-minute audio body using HUB4. I found that I only got a total accuracy of 55%, whereas I'd gotten 72% with the US English generic acoustic model. So it seems that the generic model is still the best model for my purposes.
Regarding correct alignment around pauses: I experimented with -silprob. The default value is 0.005; I tried values between 0.001 and 0.1. I found that for my test recording, the value had no impact on detected silences between words. All it did was turn some +BREATH+s into silences.
On the whole, however, my largest problem is the alignment itself. It seems to me that the alignment algorithm likes to begin words a little early. That means that the first phone of a word often starts with a bit of silence (about 60-100ms). On the other hand, the ending of a word is usually just on time. For my purposes, the opposite would be preferable: To have words start on time and (if necessary) end a little later. Is there a way to influence the alignment process?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I tried it with fwdflat and bestpath disabled. At least for the one recording (60s) I tested, there was no difference at all regarding phone alignment. Some words still start up to 100ms before there is any audible sound.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In the unit test file test_state_align.c provided by pocketsphinx, words are added like below.
My question is, whether I can get more accurate word boundaries if I add a SIL after every word?
I read somewhere that in sphinx SIL had skippable state. So I assume that adding SIL won't degrade alignment accuracy. Is that correct?
No. Optional silence is not supported in ps_alignment yet. It's not a good idea to add silence after every word either.
No, it's not the case
Then how to handle the case when there is a long pause between two words in an audio in the process of forced alignment?
You need to update the aligner code to include optional silence.
Do you mean SIL with skippable states by "optional silence"?
Using current PS codebase, if I add a SIL after a word like below, why it's not a good idea? I'm not challenge you here, just want to understand the reason and know the details.
As for my understanding,
(1) short pause(silence) usually appears between words when people speak. In this case, adding a SIL improve tha word boundary accuracy.
(2) when there is no pause or break between words, an extra SIL occupies only 3~4 frames from statistial aspect, which only degrades accuracy a little
Last edit: puluzhe 2014-03-13
There is no such thing in CMUSphinx, it's not the same as in HTK. In CMUSphinx SIL is usual phone with 3 states.
Because if there is no silence between words it will still try to match silence
This is wrong, people do not usually pause between words.
It's better to make silence optional
I see. Thank you very much for your patience:)
I will look into aligner code and try to make some changes.
I'm having the same problem: The recognizer will only detect
<sil>
for long silences, but not for short ones. When I perform alignment on the results, some words (and their first phone) will start a litte too early because they contain the short silence before them.Has there been any development since the question was asked? Is there any way to get more precise alignment around short pauses? I don't think I have the knowledge required to hack the aligner.
Short silences should be detected by recognizer, you might want to increase -silprob value.
It might be also helpful to use more accurate acoustic models here because silence detection depends a lot on the quality of the acoustic model. If the original model was not accurate, phonemes might eat silence.
Last edit: Nickolay V. Shmyrev 2016-11-24
Thanks, I'll try out different
-silprob
values.Regarding the acoustic model: I'm using the generic US English acoustic model, continuous, v5.2. Is there any higher-quality acoustic model available?
It is trained on large data but that might not be very accurate for silences and fillers since training data is not carefully transcribed. Maybe you can try hub4wsj_sc_8k:
https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Archive/US%20English%20HUB4WSJ%20Acoustic%20Model/hub4wsj_sc_8k.tar.gz/download
or even
https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Archive/US%20English%20HUB4%20Acoustic%20Model/hub4opensrc.cd_continuous_8gau.zip/download
I'm a bit overwhelmed by the number of available acoustic models.
Is there an existing acoustic model that fits these requirements?
Hi Daniel
Yes it's true there is no perfect match. WSJ models are high quality but not that large size. En-us generic model are larger but they are trained from a less accurate source. I would try to recognize a test set and check, you need a test set anyway, it's critical for many other things.
Hi Nickolay,
Thanks for the tip! I have a 30-minute test set, which should be sufficient. So I'll do some tests against it. Before I do, let me make sure I understand my options.
cmusphinx-en-us-5.2.tar.gz
(16kHz, continuous).Beside these, there are two large open-source speech corpora, LibriSpeech and TED-LIUM. Both offer a wide range of speaking styles. But there are no ready-made acoustic models based on them, so I'd have to train my own acoustic model.
Is that correct? Did I miss anything?
This is correct.
And, both tedlium and librispeech are automatically annotated, not manually annotated. This means their results on silences/fillers might be suboptimal.
I processed my 27-minute audio body using HUB4. I found that I only got a total accuracy of 55%, whereas I'd gotten 72% with the US English generic acoustic model. So it seems that the generic model is still the best model for my purposes.
Regarding correct alignment around pauses: I experimented with
-silprob
. The default value is 0.005; I tried values between 0.001 and 0.1. I found that for my test recording, the value had no impact on detected silences between words. All it did was turn some+BREATH+
s into silences.On the whole, however, my largest problem is the alignment itself. It seems to me that the alignment algorithm likes to begin words a little early. That means that the first phone of a word often starts with a bit of silence (about 60-100ms). On the other hand, the ending of a word is usually just on time. For my purposes, the opposite would be preferable: To have words start on time and (if necessary) end a little later. Is there a way to influence the alignment process?
This might be an effect from the codebase issues. You need to try without fwdflat and bestpath first and check time correctness.
I tried it with
fwdflat
andbestpath
disabled. At least for the one recording (60s) I tested, there was no difference at all regarding phone alignment. Some words still start up to 100ms before there is any audible sound.