CMU Sphinx / Forums / Help: How to improve accuracy of phone detection?

Nickolay V. Shmyrev - 2016-02-11

Does it give me timespamps for each phoneme (like pocketsphinx), or only per word?

No, not yet unfortunately. Sphinx4 aligner is still in the development stage though, so it might not be that great. Accurate algorithm would require some more attention.

The pocketsphinx aligner is very fast. Someone wrote that the Sphinx4 aligner takes 30min for 3min of audio. Is that still correct?

No, it should not be that bad. The alignment must be almost realtime.

Actually I recommend you to check this project:

https://github.com/lowerquality/gentle

It should output what you need - words and phonemes with stamps:

https://github.com/lowerquality/gentle/blob/master/tests/data/lucier_golden.json

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Wolf - 2016-02-11
  
  Thanks a lot!
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Saurabh Shrivastava - 2017-03-18

Hi Daniel! I have the exact same requirement. I see you have successfully implemented the same in Rhubarb Lip-sync. I am trying to accompalish similar thing and was hoping you can give me some pointers. I am hoping to work on a project for GSoC this summer.

I have sent you message on SourceForge regarding the same.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Yuheng Zou - 2019-07-08

Hello, Daniel and Nicoley! I found in rhubarb-lip-sync, the VAD part is done by WebRTC. Are there any solutions doing VAD and forced phoneme alignment in a single unified framework with only pocketsphinx? Because pocketsphinx can do VAD, I think this may be possible.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Yuheng Zou - 2019-07-08

Hello, Daniel and Nicoley! I found in rhubarb-lip-sync, the VAD part is done by WebRTC. Are there any solutions doing VAD and forced phoneme alignment in a single unified framework with only pocketsphinx? Because pocketsphinx can do VAD, I think this may be possible.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ezra Miller - 2016-02-11

Hi Daniel, Nickolay,

I am trying to solve a very similar problem, and I spent hours trying to figure out what the example code in test_state_align.c does (unfortunately there's literally no comments).

I was finally able to figure out how to use this phoneme alignment function, and managed to print out more accurate phonemes (and time stamps). However I am running into the same problem that Daniel is facing now.

If, for example, I try to get the phoneme time stamps for the following sentence:
Hey <0.3 sec pause> Move

The phonemes I'm getting are HH, EY, M, UW, V. These are the correct phonemes, but the problem is, the 0.3 sec pause is being added to the duration of the "M" phoneme.

Is there any way I can fix it?

Thanks

Last edit: Ezra Miller 2016-02-11

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-02-11
  
  If you have silence between "hey" and "move", you need to insert <sil> there in alignment word sequence before starting alignment.</sil>
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ezra Miller - 2016-02-11
    
    Thank you Nickolay,
    
    By adding "<sil>" between every word I was able to fix long gaps between phonemes. </sil>
    
    However, if the recording starts with a silence, this offset gets ignored when looking at phoneme time stamps.
    
    Is there a way to know at which time the first phoneme (in my example, "HH") is starting - relative to the actual file, and not the word? (In other words, if there is a 0.5 sec silence/non-speech in the beginning of the file, I would like to see "HH" start at around 0.5s).
    
    Thanks again!
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2016-02-11
      
      Pocketsphinx performs voice activity detection and you need to shift results to detected offset if you want to print them. acmod_stream_offset returns that. You can also disable voice activity detection with -remove_silence no, though I don't really recommend doing that.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Ezra Miller - 2016-02-12
        
        Hi Nickolay,
        
        I wanted to thank you again for your advice.
        
        Running with the -remove_silence no flag seems to work pretty well for my use case.
        
        Also, as I mentioned earlier, adding "<sil>" between every every word solved the problem with the long gaps. If there is a short gap between words, then SIL gets a duration of 0.03 seconds (this seems to be the lowest) - in this case I simply ignore it and add its duration to the following phoneme.</sil>
        
        Appreciate your help, it works great now!
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Seif Mostafa - 2017-03-19

for arabic ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

How to improve accuracy of phone detection?

Speech Recognition Toolkit

Forums

Help

How to improve accuracy of phone detection? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

How to improve accuracy of phone detection?