Menu

How to improve accuracy of phone detection?

Help
2016-01-08
2019-07-08
<< < 1 2 (Page 2 of 2)
  • Nickolay V. Shmyrev

    Does it give me timespamps for each phoneme (like pocketsphinx), or only per word?

    No, not yet unfortunately. Sphinx4 aligner is still in the development stage though, so it might not be that great. Accurate algorithm would require some more attention.

    The pocketsphinx aligner is very fast. Someone wrote that the Sphinx4 aligner takes 30min for 3min of audio. Is that still correct?

    No, it should not be that bad. The alignment must be almost realtime.

    Actually I recommend you to check this project:

    https://github.com/lowerquality/gentle

    It should output what you need - words and phonemes with stamps:

    https://github.com/lowerquality/gentle/blob/master/tests/data/lucier_golden.json

     
    • Daniel Wolf

      Daniel Wolf - 2016-02-11

      Thanks a lot!

       
  • Saurabh Shrivastava

    Hi Daniel! I have the exact same requirement. I see you have successfully implemented the same in Rhubarb Lip-sync. I am trying to accompalish similar thing and was hoping you can give me some pointers. I am hoping to work on a project for GSoC this summer.

    I have sent you message on SourceForge regarding the same.

     
  • Yuheng Zou

    Yuheng Zou - 2019-07-08

    Hello, Daniel and Nicoley! I found in rhubarb-lip-sync, the VAD part is done by WebRTC. Are there any solutions doing VAD and forced phoneme alignment in a single unified framework with only pocketsphinx? Because pocketsphinx can do VAD, I think this may be possible.

     
  • Yuheng Zou

    Yuheng Zou - 2019-07-08

    Hello, Daniel and Nicoley! I found in rhubarb-lip-sync, the VAD part is done by WebRTC. Are there any solutions doing VAD and forced phoneme alignment in a single unified framework with only pocketsphinx? Because pocketsphinx can do VAD, I think this may be possible.

     
  • Ezra Miller

    Ezra Miller - 2016-02-11

    Hi Daniel, Nickolay,

    I am trying to solve a very similar problem, and I spent hours trying to figure out what the example code in test_state_align.c does (unfortunately there's literally no comments).

    I was finally able to figure out how to use this phoneme alignment function, and managed to print out more accurate phonemes (and time stamps). However I am running into the same problem that Daniel is facing now.

    If, for example, I try to get the phoneme time stamps for the following sentence:
    Hey <0.3 sec pause> Move

    The phonemes I'm getting are HH, EY, M, UW, V. These are the correct phonemes, but the problem is, the 0.3 sec pause is being added to the duration of the "M" phoneme.

    Is there any way I can fix it?

    Thanks

     

    Last edit: Ezra Miller 2016-02-11
    • Nickolay V. Shmyrev

      If you have silence between "hey" and "move", you need to insert <sil> there in alignment word sequence before starting alignment.

       
      • Ezra Miller

        Ezra Miller - 2016-02-11

        Thank you Nickolay,

        By adding "<sil>" between every word I was able to fix long gaps between phonemes.

        However, if the recording starts with a silence, this offset gets ignored when looking at phoneme time stamps.

        Is there a way to know at which time the first phoneme (in my example, "HH") is starting - relative to the actual file, and not the word? (In other words, if there is a 0.5 sec silence/non-speech in the beginning of the file, I would like to see "HH" start at around 0.5s).

        Thanks again!

         
        • Nickolay V. Shmyrev

          Pocketsphinx performs voice activity detection and you need to shift results to detected offset if you want to print them. acmod_stream_offset returns that. You can also disable voice activity detection with -remove_silence no, though I don't really recommend doing that.

           
          • Ezra Miller

            Ezra Miller - 2016-02-12

            Hi Nickolay,

            I wanted to thank you again for your advice.

            Running with the -remove_silence no flag seems to work pretty well for my use case.

            Also, as I mentioned earlier, adding "<sil>" between every every word solved the problem with the long gaps. If there is a short gap between words, then SIL gets a duration of 0.03 seconds (this seems to be the lowest) - in this case I simply ignore it and add its duration to the following phoneme.

            Appreciate your help, it works great now!

             
  • Seif Mostafa

    Seif Mostafa - 2017-03-19

    for arabic ?

     
<< < 1 2 (Page 2 of 2)

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.