I need to compare 2 audio files or 2 mic output sessions between each other (same voice, accent). They will contain some short phrases and i need to detect whether the second audio or mic session is similar to the first one. I've tried to use dictionaries containing phrase text or use phonem comparison but that doesn't give enough accurracy.
Example:
User say: This is test phrase
Then he say: Something else
compare result should show very low similarity
another example
User say: This is test phrase
Then he say: This is test phrase
compare result should give some high value
Is it possible to achive using pocketsphinx library on Android? Can anybody point me to the right direction?
P.S.: When i try to use phonem recognition it may give me very different phonems every time for same phrase, example:
SIL Z IH S IH S UW B OW D ER S SIL
SIL Z IH S UW S NG P OW Z SIL D ER TH SIL
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank You for the help. I've performed some tests using DWT, FastDWT and extracted MFC coefficients via sphinx_fe tool. Looks like DWT approach works not bad for single words but for phrase the warp distance may be very big. Maybe there should be a way to split phrase to words and then do comparison.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello
I need to compare 2 audio files or 2 mic output sessions between each other (same voice, accent). They will contain some short phrases and i need to detect whether the second audio or mic session is similar to the first one. I've tried to use dictionaries containing phrase text or use phonem comparison but that doesn't give enough accurracy.
Example:
User say: This is test phrase
Then he say: Something else
compare result should show very low similarity
another example
User say: This is test phrase
Then he say: This is test phrase
compare result should give some high value
Is it possible to achive using pocketsphinx library on Android? Can anybody point me to the right direction?
P.S.: When i try to use phonem recognition it may give me very different phonems every time for same phrase, example:
SIL Z IH S IH S UW B OW D ER S SIL
SIL Z IH S UW S NG P OW Z SIL D ER TH SIL
No
Google for dynamic time warping
Thank You for the answer. I've found your more detailed answer here https://sourceforge.net/p/cmusphinx/discussion/help/thread/d4ca2b80/#8d7f. Is it possible to extract that MFC coefficient in any way using pocketsphinx Android?
You'd better use managed java code for that.
Thank You for the help. I've performed some tests using DWT, FastDWT and extracted MFC coefficients via sphinx_fe tool. Looks like DWT approach works not bad for single words but for phrase the warp distance may be very big. Maybe there should be a way to split phrase to words and then do comparison.