Menu

does introducing SIL between words improve forced alignment accuracy

Help
puluzhe
2014-03-12
2016-12-08
  • puluzhe

    puluzhe - 2014-03-12

    In the unit test file test_state_align.c provided by pocketsphinx, words are added like below.
    My question is, whether I can get more accurate word boundaries if I add a SIL after every word?
    I read somewhere that in sphinx SIL had skippable state. So I assume that adding SIL won't degrade alignment accuracy. Is that correct?

    al = ps_alignment_init(d2p);
    TEST_EQUAL(1, ps_alignment_add_word(al, dict_wordid(dict, "<s>"), 0));
    TEST_EQUAL(2, ps_alignment_add_word(al, dict_wordid(dict, "go"), 0));
    TEST_EQUAL(3, ps_alignment_add_word(al, dict_wordid(dict, "forward"), 0));
    TEST_EQUAL(4, ps_alignment_add_word(al, dict_wordid(dict, "ten"), 0));
    TEST_EQUAL(5, ps_alignment_add_word(al, dict_wordid(dict, "meters"), 0));
    TEST_EQUAL(6, ps_alignment_add_word(al, dict_wordid(dict, "</s>"), 0));
    TEST_EQUAL(0, ps_alignment_populate(al));
    
     
  • Nickolay V. Shmyrev

    My question is, whether I can get more accurate word boundaries if I add a SIL after every word?

    No. Optional silence is not supported in ps_alignment yet. It's not a good idea to add silence after every word either.

    I read somewhere that in sphinx SIL had skippable state.

    No, it's not the case

     
  • puluzhe

    puluzhe - 2014-03-12

    Then how to handle the case when there is a long pause between two words in an audio in the process of forced alignment?

     
  • Nickolay V. Shmyrev

    Then how to handle the case when there is a long pause between two words in an audio in the process of forced alignment?

    You need to update the aligner code to include optional silence.

     
  • puluzhe

    puluzhe - 2014-03-13

    Do you mean SIL with skippable states by "optional silence"?

    Using current PS codebase, if I add a SIL after a word like below, why it's not a good idea? I'm not challenge you here, just want to understand the reason and know the details.
    As for my understanding,
    (1) short pause(silence) usually appears between words when people speak. In this case, adding a SIL improve tha word boundary accuracy.
    (2) when there is no pause or break between words, an extra SIL occupies only 3~4 frames from statistial aspect, which only degrades accuracy a little

    al = ps_alignment_init(d2p);
    TEST_EQUAL(1, ps_alignment_add_word(al, dict_wordid(dict, "<s>"), 0));
    TEST_EQUAL(2, ps_alignment_add_word(al, dict_wordid(dict, "go"), 0));
    TEST_EQUAL(1, ps_alignment_add_word(al, dict_wordid(dict, "<sil>"), 0));
    TEST_EQUAL(3, ps_alignment_add_word(al, dict_wordid(dict, "forward"), 0));
    TEST_EQUAL(1, ps_alignment_add_word(al, dict_wordid(dict, "<sil>"), 0));
    TEST_EQUAL(4, ps_alignment_add_word(al, dict_wordid(dict, "ten"), 0));
    TEST_EQUAL(1, ps_alignment_add_word(al, dict_wordid(dict, "<sil>"), 0));
    TEST_EQUAL(5, ps_alignment_add_word(al, dict_wordid(dict, "meters"), 0));
    TEST_EQUAL(6, ps_alignment_add_word(al, dict_wordid(dict, "</s>"), 0));
    TEST_EQUAL(0, ps_alignment_populate(al));
    
     

    Last edit: puluzhe 2014-03-13
  • Nickolay V. Shmyrev

    Do you mean SIL with skippable states by "optional silence"?

    There is no such thing in CMUSphinx, it's not the same as in HTK. In CMUSphinx SIL is usual phone with 3 states.

    Using current PS codebase, if I add a SIL after a word like below, why it's not a good idea?

    Because if there is no silence between words it will still try to match silence

    short pause(silence) usually appears between words when people speak. In this case, adding a SIL improve tha word boundary accuracy.

    This is wrong, people do not usually pause between words.

    when there is no pause or break between words, an extra SIL occupies only 3~4 frames from statistial aspect, which only degrades accuracy a little

    It's better to make silence optional

     
  • puluzhe

    puluzhe - 2014-03-13

    I see. Thank you very much for your patience:)

    I will look into aligner code and try to make some changes.

     
  • Daniel Wolf

    Daniel Wolf - 2016-11-22

    I'm having the same problem: The recognizer will only detect <sil> for long silences, but not for short ones. When I perform alignment on the results, some words (and their first phone) will start a litte too early because they contain the short silence before them.

    Has there been any development since the question was asked? Is there any way to get more precise alignment around short pauses? I don't think I have the knowledge required to hack the aligner.

     
    • Nickolay V. Shmyrev

      Short silences should be detected by recognizer, you might want to increase -silprob value.

      It might be also helpful to use more accurate acoustic models here because silence detection depends a lot on the quality of the acoustic model. If the original model was not accurate, phonemes might eat silence.

       

      Last edit: Nickolay V. Shmyrev 2016-11-24
      • Daniel Wolf

        Daniel Wolf - 2016-11-24

        Thanks, I'll try out different -silprob values.

        Regarding the acoustic model: I'm using the generic US English acoustic model, continuous, v5.2. Is there any higher-quality acoustic model available?

         
        • Nickolay V. Shmyrev

          Regarding the acoustic model: I'm using the generic US English acoustic model, continuous, v5.2. Is there any higher-quality acoustic model available?

          It is trained on large data but that might not be very accurate for silences and fillers since training data is not carefully transcribed. Maybe you can try hub4wsj_sc_8k:

          https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Archive/US%20English%20HUB4WSJ%20Acoustic%20Model/hub4wsj_sc_8k.tar.gz/download

          or even

          https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Archive/US%20English%20HUB4%20Acoustic%20Model/hub4opensrc.cd_continuous_8gau.zip/download

           
          • Daniel Wolf

            Daniel Wolf - 2016-11-24

            I'm a bit overwhelmed by the number of available acoustic models.

            • All my recordings are high-quality, low-noise, at least 44kHz (which I downsample). So my guess is that 8kHz models won't be optimal.
            • I'd like to support a wide range of speakers (male, female, children) and speaking modes (normal speech, shouting, whispering, etc.). So my guess is that broadcast news will be too limited.
            • I'm using the results for lip-sync. So phones and noises should be recognized as accurately as possible.

            Is there an existing acoustic model that fits these requirements?

             
            • Nickolay V. Shmyrev

              Hi Daniel

              Yes it's true there is no perfect match. WSJ models are high quality but not that large size. En-us generic model are larger but they are trained from a less accurate source. I would try to recognize a test set and check, you need a test set anyway, it's critical for many other things.

               
              • Daniel Wolf

                Daniel Wolf - 2016-11-28

                Hi Nickolay,

                Thanks for the tip! I have a 30-minute test set, which should be sufficient. So I'll do some tests against it. Before I do, let me make sure I understand my options.

                1. The US English generic acoustic models are generated from about 800 hours of non-public data. The best version for high-quality recordings is cmusphinx-en-us-5.2.tar.gz (16kHz, continuous).
                2. The HUB4 acoustic model is generated from broadcast news (16kHz).
                3. The HUB4WSJ acoustic model is generated from two sources: The HUB4 broadcast news and recordings of adults reading news texts from the Wall Street Journal in dictation style. This model is available only in 8kHz.

                Beside these, there are two large open-source speech corpora, LibriSpeech and TED-LIUM. Both offer a wide range of speaking styles. But there are no ready-made acoustic models based on them, so I'd have to train my own acoustic model.

                Is that correct? Did I miss anything?

                 
                • Nickolay V. Shmyrev

                  This is correct.

                   
                  • Nickolay V. Shmyrev

                    And, both tedlium and librispeech are automatically annotated, not manually annotated. This means their results on silences/fillers might be suboptimal.

                     
  • Daniel Wolf

    Daniel Wolf - 2016-12-07

    I processed my 27-minute audio body using HUB4. I found that I only got a total accuracy of 55%, whereas I'd gotten 72% with the US English generic acoustic model. So it seems that the generic model is still the best model for my purposes.

    Regarding correct alignment around pauses: I experimented with -silprob. The default value is 0.005; I tried values between 0.001 and 0.1. I found that for my test recording, the value had no impact on detected silences between words. All it did was turn some +BREATH+s into silences.

    On the whole, however, my largest problem is the alignment itself. It seems to me that the alignment algorithm likes to begin words a little early. That means that the first phone of a word often starts with a bit of silence (about 60-100ms). On the other hand, the ending of a word is usually just on time. For my purposes, the opposite would be preferable: To have words start on time and (if necessary) end a little later. Is there a way to influence the alignment process?

     
    • Nickolay V. Shmyrev

      This might be an effect from the codebase issues. You need to try without fwdflat and bestpath first and check time correctness.

       
      • Daniel Wolf

        Daniel Wolf - 2016-12-08

        I tried it with fwdflat and bestpath disabled. At least for the one recording (60s) I tested, there was no difference at all regarding phone alignment. Some words still start up to 100ms before there is any audible sound.

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.