I'm trying to extract the text from a document which contains phonetic spellings of words. The phonetics display fine in the Multivalent browser, but when I extract the text I get garbage. Can you tell me how to correctly extract these characters? I'd even be happy to have them encoded as numeric character entities...
Thanks
Chris
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It may be the case that text is extracted correctly as Unicode but that the receiving application does not correctly interpret Unicode or have the fonts to display it. Or it may be the case that the text in the PDF uses a special font that happens to display ordinary text characters to appear this way. Without looking at the PDF I can't say.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You can email it to an address I only look at when I know there is something to retrieve: usenetposter atsign comcast period net, and I'll take a look.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I looked at the PDF you emailed. The font with the phoenetic glyphs is named Unispell. With such a name you would expect Unicode compatibility, and furthermore the PDF has an explicit ToUnicode mapping for it. Inexplicably, however, the Unicode of the phoenetic glyphs is given as various Latin 1 uppercase letters. The PDF's producer, "Creo Normalizer JTP", must have a bug.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hmmm. If there is indeed a bug in the FTP producer software, how would the Multivalent viewer generate the correct characters? I'm thoroughly confused.
Thanks
Chris
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
All of the letters are just numbers. When mapped into the embedded font to get shapes, it corresponds to a shape that looks like a phonetic character. When mapped into Unicode or ASCII, it corresponds to an uppercase Latin character.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi -
I'm trying to extract the text from a document which contains phonetic spellings of words. The phonetics display fine in the Multivalent browser, but when I extract the text I get garbage. Can you tell me how to correctly extract these characters? I'd even be happy to have them encoded as numeric character entities...
Thanks
Chris
It may be the case that text is extracted correctly as Unicode but that the receiving application does not correctly interpret Unicode or have the fonts to display it. Or it may be the case that the text in the PDF uses a special font that happens to display ordinary text characters to appear this way. Without looking at the PDF I can't say.
The viewing application is Multivalent. The one doing the extracting is Multivalent's ExtractText tool.
If you give me an FTP site I'll send you the PDF...
Thanks
Chris
You can email it to an address I only look at when I know there is something to retrieve: usenetposter atsign comcast period net, and I'll take a look.
I looked at the PDF you emailed. The font with the phoenetic glyphs is named Unispell. With such a name you would expect Unicode compatibility, and furthermore the PDF has an explicit ToUnicode mapping for it. Inexplicably, however, the Unicode of the phoenetic glyphs is given as various Latin 1 uppercase letters. The PDF's producer, "Creo Normalizer JTP", must have a bug.
Hmmm. If there is indeed a bug in the FTP producer software, how would the Multivalent viewer generate the correct characters? I'm thoroughly confused.
Thanks
Chris
All of the letters are just numbers. When mapped into the embedded font to get shapes, it corresponds to a shape that looks like a phonetic character. When mapped into Unicode or ASCII, it corresponds to an uppercase Latin character.