
Phonetic characters

  • Chris von See

    Chris von See - 2005-05-04

    Hi -

    I'm trying to extract the text from a document which contains phonetic spellings of words.  The phonetics display  fine in the Multivalent browser, but when I extract the text I get garbage.  Can you tell me how to correctly extract these characters?  I'd even be happy to have them encoded as numeric character entities...


    • Tom Phelps

      Tom Phelps - 2005-05-04

      It may be the case that text is extracted correctly as Unicode but that the receiving application does not correctly interpret Unicode or have the fonts to display it.  Or it may be the case that the text in the PDF uses a special font that happens to display ordinary text characters to appear this way.  Without looking at the PDF I can't say.

    • Chris von See

      Chris von See - 2005-05-04

      The viewing application is Multivalent.  The one doing the extracting is Multivalent's ExtractText tool. 

      If you give me an FTP site I'll send you the PDF...


    • Tom Phelps

      Tom Phelps - 2005-05-05

      You can email it to an address I only look at when I know there is something to retrieve: usenetposter atsign comcast period net, and I'll take a look.

    • Tom Phelps

      Tom Phelps - 2005-05-05

      I looked at the PDF you emailed.  The font with the phoenetic glyphs is named Unispell.  With such a name you would expect Unicode compatibility, and furthermore the PDF has an explicit ToUnicode mapping for it.  Inexplicably, however, the Unicode of the phoenetic glyphs is given as various Latin 1 uppercase letters.  The PDF's producer, "Creo Normalizer JTP", must have a bug.

    • Chris von See

      Chris von See - 2005-05-05

      Hmmm.  If there is indeed a bug in the FTP producer software, how would the Multivalent viewer generate the correct characters?  I'm thoroughly confused.


    • Tom Phelps

      Tom Phelps - 2005-05-05

      All of the letters are just numbers.  When mapped into the embedded font to get shapes, it corresponds to a shape that looks like a phonetic character.  When mapped into Unicode or ASCII, it corresponds to an uppercase Latin character.


Log in to post a comment.