Menu

Phonetic notation

2011-05-13
2012-09-22
  • petebleackley

    petebleackley - 2011-05-13

    The documentation I've read indicates that Sphinx requires the notation used
    in phonetic transcriptions to be purely alphanumeric and case-insensitive. My
    http://code.google.com/p/spoken-language-
    recognition/
    Language
    Identification project needs to be able to recognise phones from a wide range
    of languages, so devising a notation that meets these constraints is a bit
    unwieldy, besides which it feels a bit like re-inventing the wheel, when I'm
    already familiar with pre-existing systems (for example X-Sampa).

    I was wondering, would it be worth my while to try to modify Sphinx to remove
    the restrictions on phonetic notation, and if so, where would I start?
    Assuming that I'm familiar with Java, is it more work to make a transcription
    system that's compatible with Sphinx,, or to make Sphinx compatible with an
    existing transcription system?

     
  • Nickolay V. Shmyrev

    Hello

    I'm not really sure in the success of this. The main issue is Sphinxtrain
    (perl/C) and not Sphinx4 (Java). There are bits here and there, most of them
    are related to tied state building. The issue is that phone name is used there
    to create a file in filesystem which is not good (you can encode phone name
    somehow before you create appropriate tree file name). Maybe there are some
    other bits in the decoder in model definition parser.

    I would still invent some script which convert phonetic name to ascii-only
    unique string. That string can still be readable, for example a_ ->
    a_underscore a+ -> a_plus. That can be used to map the dictionary to
    alphanumeric format before training and will solve problem easily. Same
    algorithm can be used in sphinxtrain to solve ascii-only restriction during
    the training thus relaxing the restrictions of the phone names.

     
  • Rmkf

    Rmkf - 2011-06-03

    hm... reinventing the wheel... interesting... but why nobody didnt make
    universal phonetic signs based on articulatory features? More commonly: I
    think the speech is the gestures: people with listening/talking disabilites
    (or, how to name this polit-correctly ;) ) talks by finger gestures, while
    usual people - by tongue/jaw/lips gestures. The sense is the gesture while the
    sound is only a spotlight tool that enlighten the vocal tract form. Thus to
    make REALLY all-languages-universal phonetic notation it is direct & clear way
    to make signs, representing the vocal tract form, of course, with
    modifications for voiced/fricative or so on.
    Am I 1st who thinks so?... At least I've never read about anything similar...

     
  • Nickolay V. Shmyrev

    The fast spoken speech will be very different in terms of gestures than slow
    one. It would be challenging to describe all this properly.

     
  • Rmkf

    Rmkf - 2011-06-05

    Yeah, it is the challenge, but: if you wanna draw in air, e.g., triangle - in
    the aim to show to your friend that something is of triangled form,you can
    draw it veeery far from geometrically correct triangle. As for square or so
    on. But your friend will understand you! Moreover: le'z imagine such a "game"
    you must xmit some code drawing some set of geom figures in the air by finger:
    square, circle,triangle - ok, let it be some set of symbols to be shown in
    some order, while your friend must write down this "code". Then you can "draw"
    very distorted , angle-smoothed or so, figures, though friend will remain able
    to distinguish it from one another until distortion will mess it up together.
    As well in speech: underarticulated phonemes is something alike "poitning
    gesture": you need not touch some object to make other understand that you
    mean this object and not other. But if there is 2+ object at close angle, thet
    (s)he can reask you what do you mean more xactly 'cause it 'd be not so clear
    what a direction your finger shows in. As well the direction of articulation
    can be extrapolated by listener's brain - nobody wonders how people (and even
    animals) can predict mechanical moving of some physical object. Moreover: from
    my own experiments with size of speech units codebook every noticeable change
    in spectrum is perceived as some new phonem! BTW that's why sounds so bad old
    concatenative synthesis: spectral jump in the middle of vowel in CV+VC
    synthetic sequence is percieved as false phonem @ that place! I.e. 2 vowels
    instead of 1 ! =)
    Of course, sometimes 2 or more vowels can glue together, but this type of
    speed-based reduction can be explained by coarticulation and therefore the
    phones that can be interpolated to this mid-position can be restored by
    listeners 'cause (s)he knows the grammar! While you can't write down
    transcription of fast speech for unknown well for you language! - Just try it
    out! =)

    Anyway, all may statements about areagramms are based on my own experiments as
    well over the excellent-quality sources like audiobooks, as well over talk-
    show podcasts, where participants talks as fast as they can. I can show
    results as spectral-form plots. BTW - isn't it convinient tool of speech
    researching to draw areagramm as spectrogram? It you 'd see spectrogramm-view
    of stereo sound file, then it is IMHO very convinient to see in left channel
    spectrum of speech while in right channel - areagram at the same time!
    Moreover: spectral representation enables mp3 compression such a data. Did
    anyone here see such a plots before? I have a feeling it's reinventing of a
    wheel, but nowhere 've seen this representation.

     

Log in to post a comment.