I'am a final year student of M.Tech presenting working on a project of finding out of vocabulary words in the test speech for Kannada language. I would be really grateful if you could tell me how to carry forward this project. For starters I've installed the sphinxbase,sphinxtrain and pocketsphinx and ran the sphinxtrain on the an4 database specified in the tutorial.I have 1.8 hrs of kannada speech recorded with transcriptions and dictionary, phones all ready. I've read papers of Dr. Long Qin telling me that a hybrid language model and hybrid lexicon would help me in OOV detection which he has implemented using the sphinx-3 decoder. Is this just enough for the OOV detection parts or i need modify the decoder even or do i need to change the decoder altogether? your help would be great. Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Finding OOV words is an open, but interesting research problem. Within your master, there are many things you can do without rewriting the decoder for sure.
There can be many other suggestions from other forum members, I'll just develop a little your own idea. One strategy for detecting OOV is joint word/subword decoding, where subword can be defined at the level of syllables or morphemes (in case the language has these properties). Implementing mixed word-subword decoder is straightforward: you decompose the words in your original dictionary, apply this to the training text and create a new language model. You will also need to derive the pronunciations for the resulting subword units. These lexicon and language model can be directly used in the decoding. Another question is a little more tricky: Will it work well? We do not know.
If I was first using Sphinx, I'd suggest step-by-step research as follows:
- know your problem: make some statistics on your test set: how many OOV words do you have? Are there homophones that will be difficult to detect from the acoustics?
- prepare a baseline and know your word error rates. Sphinx conventional pipeline suggests <UNK> dictionary unit that can potentially give you the places where the word is likely to be OOV. Does this simple approach allow to find something useful?
- keep reading state-of-the-art. Your keywords are: OOV detection, subword decomposition. You can get some ideas from the work on other languages (check this one for example http://www.mirlab.org/conference_papers/International_Conference/ICASSP%202012/pdfs/0005181.pdf)
- Try to modify dictionary and language model to decode with subwords. The errors are higher, but you can, for example, create full-word based hypothesis and subword-based one. Then you can try to do some pattern matching to detect where these two are different too much. Another option is to try building a hybrid word-subword system. Both approaches do not require changing the sphinx code.
Last edit: Arseniy Gorin 2016-11-06
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'am a final year student of M.Tech presenting working on a project of finding out of vocabulary words in the test speech for Kannada language. I would be really grateful if you could tell me how to carry forward this project. For starters I've installed the sphinxbase,sphinxtrain and pocketsphinx and ran the sphinxtrain on the an4 database specified in the tutorial.I have 1.8 hrs of kannada speech recorded with transcriptions and dictionary, phones all ready. I've read papers of Dr. Long Qin telling me that a hybrid language model and hybrid lexicon would help me in OOV detection which he has implemented using the sphinx-3 decoder. Is this just enough for the OOV detection parts or i need modify the decoder even or do i need to change the decoder altogether? your help would be great. Thanks
Yes.
No.
Thank you Sir
Finding OOV words is an open, but interesting research problem. Within your master, there are many things you can do without rewriting the decoder for sure.
There can be many other suggestions from other forum members, I'll just develop a little your own idea. One strategy for detecting OOV is joint word/subword decoding, where subword can be defined at the level of syllables or morphemes (in case the language has these properties). Implementing mixed word-subword decoder is straightforward: you decompose the words in your original dictionary, apply this to the training text and create a new language model. You will also need to derive the pronunciations for the resulting subword units. These lexicon and language model can be directly used in the decoding. Another question is a little more tricky: Will it work well? We do not know.
If I was first using Sphinx, I'd suggest step-by-step research as follows:
- know your problem: make some statistics on your test set: how many OOV words do you have? Are there homophones that will be difficult to detect from the acoustics?
- prepare a baseline and know your word error rates. Sphinx conventional pipeline suggests <UNK> dictionary unit that can potentially give you the places where the word is likely to be OOV. Does this simple approach allow to find something useful?
- keep reading state-of-the-art. Your keywords are: OOV detection, subword decomposition. You can get some ideas from the work on other languages (check this one for example http://www.mirlab.org/conference_papers/International_Conference/ICASSP%202012/pdfs/0005181.pdf)
- Try to modify dictionary and language model to decode with subwords. The errors are higher, but you can, for example, create full-word based hypothesis and subword-based one. Then you can try to do some pattern matching to detect where these two are different too much. Another option is to try building a hybrid word-subword system. Both approaches do not require changing the sphinx code.
Last edit: Arseniy Gorin 2016-11-06
Thank you. I will surely look into the inputs given by you and update you with my progress.