The documentation I've read indicates that Sphinx requires the notation used
in phonetic transcriptions to be purely alphanumeric and case-insensitive. My http://code.google.com/p/spoken-language-
recognition/ Language
Identification project needs to be able to recognise phones from a wide range
of languages, so devising a notation that meets these constraints is a bit
unwieldy, besides which it feels a bit like re-inventing the wheel, when I'm
already familiar with pre-existing systems (for example X-Sampa).
I was wondering, would it be worth my while to try to modify Sphinx to remove
the restrictions on phonetic notation, and if so, where would I start?
Assuming that I'm familiar with Java, is it more work to make a transcription
system that's compatible with Sphinx,, or to make Sphinx compatible with an
existing transcription system?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm not really sure in the success of this. The main issue is Sphinxtrain
(perl/C) and not Sphinx4 (Java). There are bits here and there, most of them
are related to tied state building. The issue is that phone name is used there
to create a file in filesystem which is not good (you can encode phone name
somehow before you create appropriate tree file name). Maybe there are some
other bits in the decoder in model definition parser.
I would still invent some script which convert phonetic name to ascii-only
unique string. That string can still be readable, for example a_ ->
a_underscore a+ -> a_plus. That can be used to map the dictionary to
alphanumeric format before training and will solve problem easily. Same
algorithm can be used in sphinxtrain to solve ascii-only restriction during
the training thus relaxing the restrictions of the phone names.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
hm... reinventing the wheel... interesting... but why nobody didnt make
universal phonetic signs based on articulatory features? More commonly: I
think the speech is the gestures: people with listening/talking disabilites
(or, how to name this polit-correctly ;) ) talks by finger gestures, while
usual people - by tongue/jaw/lips gestures. The sense is the gesture while the
sound is only a spotlight tool that enlighten the vocal tract form. Thus to
make REALLY all-languages-universal phonetic notation it is direct & clear way
to make signs, representing the vocal tract form, of course, with
modifications for voiced/fricative or so on.
Am I 1st who thinks so?... At least I've never read about anything similar...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yeah, it is the challenge, but: if you wanna draw in air, e.g., triangle - in
the aim to show to your friend that something is of triangled form,you can
draw it veeery far from geometrically correct triangle. As for square or so
on. But your friend will understand you! Moreover: le'z imagine such a "game"
you must xmit some code drawing some set of geom figures in the air by finger:
square, circle,triangle - ok, let it be some set of symbols to be shown in
some order, while your friend must write down this "code". Then you can "draw"
very distorted , angle-smoothed or so, figures, though friend will remain able
to distinguish it from one another until distortion will mess it up together.
As well in speech: underarticulated phonemes is something alike "poitning
gesture": you need not touch some object to make other understand that you
mean this object and not other. But if there is 2+ object at close angle, thet
(s)he can reask you what do you mean more xactly 'cause it 'd be not so clear
what a direction your finger shows in. As well the direction of articulation
can be extrapolated by listener's brain - nobody wonders how people (and even
animals) can predict mechanical moving of some physical object. Moreover: from
my own experiments with size of speech units codebook every noticeable change
in spectrum is perceived as some new phonem! BTW that's why sounds so bad old
concatenative synthesis: spectral jump in the middle of vowel in CV+VC
synthetic sequence is percieved as false phonem @ that place! I.e. 2 vowels
instead of 1 ! =)
Of course, sometimes 2 or more vowels can glue together, but this type of
speed-based reduction can be explained by coarticulation and therefore the
phones that can be interpolated to this mid-position can be restored by
listeners 'cause (s)he knows the grammar! While you can't write down
transcription of fast speech for unknown well for you language! - Just try it
out! =)
Anyway, all may statements about areagramms are based on my own experiments as
well over the excellent-quality sources like audiobooks, as well over talk-
show podcasts, where participants talks as fast as they can. I can show
results as spectral-form plots. BTW - isn't it convinient tool of speech
researching to draw areagramm as spectrogram? It you 'd see spectrogramm-view
of stereo sound file, then it is IMHO very convinient to see in left channel
spectrum of speech while in right channel - areagram at the same time!
Moreover: spectral representation enables mp3 compression such a data. Did
anyone here see such a plots before? I have a feeling it's reinventing of a
wheel, but nowhere 've seen this representation.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The documentation I've read indicates that Sphinx requires the notation used
in phonetic transcriptions to be purely alphanumeric and case-insensitive. My
http://code.google.com/p/spoken-language-
recognition/ Language
Identification project needs to be able to recognise phones from a wide range
of languages, so devising a notation that meets these constraints is a bit
unwieldy, besides which it feels a bit like re-inventing the wheel, when I'm
already familiar with pre-existing systems (for example X-Sampa).
I was wondering, would it be worth my while to try to modify Sphinx to remove
the restrictions on phonetic notation, and if so, where would I start?
Assuming that I'm familiar with Java, is it more work to make a transcription
system that's compatible with Sphinx,, or to make Sphinx compatible with an
existing transcription system?
Hello
I'm not really sure in the success of this. The main issue is Sphinxtrain
(perl/C) and not Sphinx4 (Java). There are bits here and there, most of them
are related to tied state building. The issue is that phone name is used there
to create a file in filesystem which is not good (you can encode phone name
somehow before you create appropriate tree file name). Maybe there are some
other bits in the decoder in model definition parser.
I would still invent some script which convert phonetic name to ascii-only
unique string. That string can still be readable, for example a_ ->
a_underscore a+ -> a_plus. That can be used to map the dictionary to
alphanumeric format before training and will solve problem easily. Same
algorithm can be used in sphinxtrain to solve ascii-only restriction during
the training thus relaxing the restrictions of the phone names.
hm... reinventing the wheel... interesting... but why nobody didnt make
universal phonetic signs based on articulatory features? More commonly: I
think the speech is the gestures: people with listening/talking disabilites
(or, how to name this polit-correctly ;) ) talks by finger gestures, while
usual people - by tongue/jaw/lips gestures. The sense is the gesture while the
sound is only a spotlight tool that enlighten the vocal tract form. Thus to
make REALLY all-languages-universal phonetic notation it is direct & clear way
to make signs, representing the vocal tract form, of course, with
modifications for voiced/fricative or so on.
Am I 1st who thinks so?... At least I've never read about anything similar...
The fast spoken speech will be very different in terms of gestures than slow
one. It would be challenging to describe all this properly.
Yeah, it is the challenge, but: if you wanna draw in air, e.g., triangle - in
the aim to show to your friend that something is of triangled form,you can
draw it veeery far from geometrically correct triangle. As for square or so
on. But your friend will understand you! Moreover: le'z imagine such a "game"
you must xmit some code drawing some set of geom figures in the air by finger:
square, circle,triangle - ok, let it be some set of symbols to be shown in
some order, while your friend must write down this "code". Then you can "draw"
very distorted , angle-smoothed or so, figures, though friend will remain able
to distinguish it from one another until distortion will mess it up together.
As well in speech: underarticulated phonemes is something alike "poitning
gesture": you need not touch some object to make other understand that you
mean this object and not other. But if there is 2+ object at close angle, thet
(s)he can reask you what do you mean more xactly 'cause it 'd be not so clear
what a direction your finger shows in. As well the direction of articulation
can be extrapolated by listener's brain - nobody wonders how people (and even
animals) can predict mechanical moving of some physical object. Moreover: from
my own experiments with size of speech units codebook every noticeable change
in spectrum is perceived as some new phonem! BTW that's why sounds so bad old
concatenative synthesis: spectral jump in the middle of vowel in CV+VC
synthetic sequence is percieved as false phonem @ that place! I.e. 2 vowels
instead of 1 ! =)
Of course, sometimes 2 or more vowels can glue together, but this type of
speed-based reduction can be explained by coarticulation and therefore the
phones that can be interpolated to this mid-position can be restored by
listeners 'cause (s)he knows the grammar! While you can't write down
transcription of fast speech for unknown well for you language! - Just try it
out! =)
Anyway, all may statements about areagramms are based on my own experiments as
well over the excellent-quality sources like audiobooks, as well over talk-
show podcasts, where participants talks as fast as they can. I can show
results as spectral-form plots. BTW - isn't it convinient tool of speech
researching to draw areagramm as spectrogram? It you 'd see spectrogramm-view
of stereo sound file, then it is IMHO very convinient to see in left channel
spectrum of speech while in right channel - areagram at the same time!
Moreover: spectral representation enables mp3 compression such a data. Did
anyone here see such a plots before? I have a feeling it's reinventing of a
wheel, but nowhere 've seen this representation.
Oh, xcuse me! I was not 1st in inventing of areagrams in speech research:
http://www.ee.iitb.ac.in/~spilab/Publicatios/sfrsm03_shah_areagram.pdf
http://www.smc-
conference.org/smc10/smcnetwork.org/files/proceedings/2010/31.pdf
http://www.ee.iitb.ac.in/~spilab/papers/2004/paper_msshah_icsci2004.pdf
, but I have done far more. E.g. I can restore log area for voicless
fricatives as well as for vowels.
It results "quasi-mechanic" model of speech production and recognition. (
Demo's are available.=) )