I want to use the Sphinx Engine for Integrity Protection of speechdata (detecting manipulations etc). My idea is to use the phonemes as the relevant features to protect the content (like a fingerprint).
My questions:
- How robust does the allphone mode work: independance of speaker? of dictionary? of language? same-speaker-but-different-time? ...
- Should I better use Sphinx 2 or 3?
- Would it be better to use only the phoems or to use the phonem-transcription of the (dictionary based) detected words?
I know the allphone mode has been adressed several times before...but I really didn't get it.
thx so much :-)
Sascha
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Allphone mode works fine. You don't need a dictionary or a language model. You do need a phoneset - and I imagine you'll want a fairly fine-grained phoneset for this job, which means you'll be training up your own acoustic model (I imagine - if you know any good AMs 'out there' please tell the group :-).
Are you interested in doing speaker recognition/differentiation? I wouldn't have thought phonemic transcription was fine grained enough for speaker recognition. Similarly with detecting manipulations - wouldn't it be better to work with the audioo data (eg look for unfeasible transitions in the spectral or cepstral files) ... unless you train 'phones' of each manipulation you're interested in ...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2003-02-24
thx Ivan,
we haven't had speaker detection/diffenrentiation in mind.
The idea is to use content-dependend features. Manipulations of the content (cropping, reassembling words e.g.) should lead to a different feature extraction result - detecting the location of the manipulation. Why I choose phonems (better wanna try...):
- Phonemic features may be more robust to (allowed) transformations like DA/AD-Conversion or audio compression than other, spectrum/cepstrum-based features.
- Phonemic features offer a very(!) low payload description of the content.
so: what about the AM provided with SPhinx? Are they "good" enough? Shouls I use sphinx2-(all)phone or s3allphone?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I want to use the Sphinx Engine for Integrity Protection of speechdata (detecting manipulations etc). My idea is to use the phonemes as the relevant features to protect the content (like a fingerprint).
My questions:
- How robust does the allphone mode work: independance of speaker? of dictionary? of language? same-speaker-but-different-time? ...
- Should I better use Sphinx 2 or 3?
- Would it be better to use only the phoems or to use the phonem-transcription of the (dictionary based) detected words?
I know the allphone mode has been adressed several times before...but I really didn't get it.
thx so much :-)
Sascha
Allphone mode works fine. You don't need a dictionary or a language model. You do need a phoneset - and I imagine you'll want a fairly fine-grained phoneset for this job, which means you'll be training up your own acoustic model (I imagine - if you know any good AMs 'out there' please tell the group :-).
Are you interested in doing speaker recognition/differentiation? I wouldn't have thought phonemic transcription was fine grained enough for speaker recognition. Similarly with detecting manipulations - wouldn't it be better to work with the audioo data (eg look for unfeasible transitions in the spectral or cepstral files) ... unless you train 'phones' of each manipulation you're interested in ...
thx Ivan,
we haven't had speaker detection/diffenrentiation in mind.
The idea is to use content-dependend features. Manipulations of the content (cropping, reassembling words e.g.) should lead to a different feature extraction result - detecting the location of the manipulation. Why I choose phonems (better wanna try...):
- Phonemic features may be more robust to (allowed) transformations like DA/AD-Conversion or audio compression than other, spectrum/cepstrum-based features.
- Phonemic features offer a very(!) low payload description of the content.
so: what about the AM provided with SPhinx? Are they "good" enough? Shouls I use sphinx2-(all)phone or s3allphone?