Typing unicode strings problem

Help
2011-07-06
2013-05-28
  • kirill yashuk
    kirill yashuk
    2011-07-06

    Hello! I`ve got trouble with  creator.textShow("my hieroglyph string")

    As I read it in the PDF spec, I should be able to output unicode strings. My problem with jPod is that I textShow() is using current font encoding. All the fonts supplied are single-byte fonts and are not suitable for unicode. I don`t want to include heavy fonts in my app(as I`m running it on a phone).

    So.. can I somehow just define a font I want my pdf to use(assuming it exists on some other device, which will be opening the result document)?

    Also tried storing a FontDescriptor using page.getResources().addFontResource(); but resources of pages and the doc itself are null all the time. Probably If I had the descriptor in my doc, I`d be able to output text with low level BT…ET.

     
  • mtraut
    mtraut
    2011-07-07

    jPod uses the current font encoding as thats the only way to do it. Otherwise a reader is not able to decode. If you want another encoding, you just assign it to the font.

    As far as i remember there are no fonts supplied with jPod (talking about jPodRenderer?) as none are needed. The fonts supplied with jPodRenderer are not "single byte fonts". A font is just a collection of glyphs, accessed by index or adobe glyph name. An encoding maps character codepoint to glyphs. The fonts supplied by URW++ have a complete international character set - just select an appropriate encoding.

    PDF standard encodings (not fonts) are single byte - but that suits a very wide range of applications (Remember that most of the time a PDF is static and will not receive new characters in unforseen languages). These encodings are used by the PDSingleFont subtypes. This is by no means a restriction on the available glyph set.

    If you absolutely need more than 256 characters in a single font, the next step to take is to construct a multibyte encoding and using a composite font (Type 0). There is a whole chapter of its own on this topic in the spec. This way you can index a font (for example the jPodRenderer type 1 fonts) using plain double byte ints or a "CID to GID" map with a TrueType font. This is a little bit advanced stuff.

    If your resources are null - create them. Its an object like all the others - see snippet from CSCreator.

        protected PDResources getResources() {
            if (resources == null) {
                resources = getResourcesProvider().getResources();
                if (resources == null) {
                    resources = (PDResources) PDResources.META.createNew();
                    getResourcesProvider().setResources(resources);
                }
            }
            return resources;
        }
    

    But theres no improvement in doing the work of CSCreator yourself, mangling with "BT" and "ET". You can't circumvent the spec… You need a font, set an ancoding and encode the character codepoint with this encoding before streaming the encoded value in a graphics text object.

     
  • kirill yashuk
    kirill yashuk
    2011-07-07

    Thanx for the reply! We`re using jPod library here, seems like jPodRenderer is something built on top of that. jPod distribution does have some fonts included, seems like all of them are Type1 (PDSingleByteFont is a superclass of PDFontType1, probably some uncommon definition here).

    I don`t feel like constructing encodings instead of using Unicode if it`s possible =)

    The PDF spec says "For text strings encoded in Unicode, the first two bytes shall be 254 followed by 255. These two bytes represent the Unicode byte order marker, U+FEFF, indicating that the string is encoded in the UTF-16BE (big-endian) encoding scheme specified in the Unicode standard."

    So why can`t I just output Unicode strings with BT-ET and point to a TrueType Unicode font descriptor? Well I mean that the doc will have to be rendered on the viewing machine. Almost sure it should be able to look up fonts while rendering.

    Really sorry if I`m writing nondence, I`m absolutly new to managing fonts/encodings.

    Thanxs for the tip in code, guess setting resources on my own won`t cause lifecycle problems for jPod.

     
  • mtraut
    mtraut
    2011-07-07

    Well - you confused me, i really had to go back to verify that there are NO fonts with jPod. Maybe you mean the AFM files, which are the pure metrics for the 14 PDF default fonts - needed to construct correct text graphics.

    Anyway, PDFType1 fonts are single byte fonts in the sense that the PDF spec mandates single byte encodings from unicode to glyph.

    Regarding your citation: this is completely right, but it has nothing to to with text rendering. On the same page above you can see that the spec is talking about clear text strings, like values in a form field. These strings are in no way associated with fonts.

    When rendering we talk about the "byte string " flavor. Anything you need to know about this you will find in the chapters "graphics" and "text".

    And again: you always output unicode strings with jPod (java strings). The fonts encoding (which is mandatory) maps the unicode codepoint to some value usable for a renderer to select the glyph in the font program. The bytes in a byte string for text objects do not have necessarily any human readable structure.

    If you need to have a font with more than 255 codepoints you MUST construct a Type0 font - read chapter 5.6, Composite Fonts. You may find more information on the default glyph selection looking in "AdobeGlyphList.txt".

     
  • kirill yashuk
    kirill yashuk
    2011-07-07

    Wow, I can`t take that I`m not able to output Unicode right out of the box =) that`s shocking..

    Thanx for the reply, I`ll move on to Type 0 fonts now. I just can`t get why do we need a font with it`s encoding and metrics while saving the document.. If we could declare the data encoding+font name and rely on whatever reader`s font does with the doc.

     
  • kirill yashuk
    kirill yashuk
    2011-07-13

    Hello once again! I`ve extebded CIDFontType2 to pick a .ttf file and use it in it`s descriptor`s fontFile2, put the whole thing into PDFontType0 as descendant font.

    Now I see some asian hieroglyphs in the output. The thing is the text is Cyrillic =) The glyphs are picked with some shift from the codepoint, this shift varies a few times through the unicode table. So that I find the same wrong glyph instead of every occurrence of the same codepoint.

    For example I do creator.textShow(" "); textShow kindly converst this string into 2 bytes (0x00, 0x20; u0020 is exactly the space character)

    When I check the document in the HexEditor, I see these 2 bytes inside the BT..ET

    When I open the document with the Reader, it outputs "=" (u003D) instead of my desired u0020

    So I can read the whole text, but it`s encrypted letter by letter. What could I be missing? Would really appreciate some hint..slighly going nuts here.

    P.S. here are the fonts

    2 0 obj
    <<
    /BaseFont /Arial
    /Subtype /Type0
    /Encoding /Identity-H
    /DescendantFonts
    /Type /Font
    >>
    endobj
    3 0 obj
    <<
    /Supplement 0
    /Ordering (Identity)
    /Registry (Adobe)
    >>
    endobj
    4 0 obj
    <<
    /BaseFont /Arial
    /Subtype /CIDFontType2
    /FontDescriptor 5 0 R
    /CIDSystemInfo 3 0 R
    /CIDToGIDMap /Identity
    /Type /Font
    >>
    endobj

    5 0 obj
    <<
    /FontFile2 10 0 R
    /Type /FontDescriptor
    //metrics here
    >>
    endobj

     
  • kirill yashuk
    kirill yashuk
    2011-07-14

    Ok, seems like I got it.

    CID in a TrueType is not equal to the Unicode value

    Here is what IdentityCMap does while we`re outputing our string

    @Override
    public void putNextDecoded(OutputStream os, int character)
    throws IOException {
    // write cid value high byte first
    os.write((character >> 8) & 0xff);
    os.write(character & 0xff);
    }

    So it treats a character like it`s value is equal to CID, while CID should be (probably) received from the font by that char.

     
  • mtraut
    mtraut
    2011-07-15

    Do it like me - read the "Composite Font" chapter in the spec about hundred times - and be still confused.

    The way you are doing it in your example is still not conformant.

    - If you use an external TrueType Font you must not use a CIDtoGID Map. You must specify a CMap (as you did). The result is explicitly implementation (viewer) dependent. The viewer selects the TrueType cmap depending on the predefined CMap used - i'm not aware of a specification for the behavior implemented in Adobe Reader. I didn't try but "UniGB−UTF16−H" for example should result in some useful output. You can download the CMap definitions from Adobe, b.t.w.

    - I don't know what Adobe does with an Identity map in place. jPodRenderer would simply map to the glyph index (an approach that most probably will fail on different installations because of differences in the font program).

    - If you embedd the font you will be better off, as you KNOW the glyph indices and can map them using the CIDToGIDMap

    - USing one off the standard fonts (Type0) may be the best portable choice, as the MUST be available on each viewer and you can address the glyphs using the standard CID's

     
  • kirill yashuk
    kirill yashuk
    2011-07-15

    Well now things work fine with embedding the font and picking CIDs from it. Thanx for the Type0 hint, that was real horrorshow=)

    The worst thing about that was that the spec stays silent about the picking glyph procedure for multibyte fonts.

    Another surprise is that I should explicitly put glyphs` widths in the font description as written in "Glyph Metrics in CIDFonts". Seems like PDF readers feel too lazy to pick that from the font.

     
  • mtraut
    mtraut
    2011-07-15

    Glad you've made it. Feel free to donate a short multibyte example for the project :-)

    The redundant width definition may result from pdf creation tools, not renderers. With only knowing the width you can already create page content (or, much more important, form field content), you will never parse a font.

    You can find example code for embedding font information in the jPodRenderer code (you still have to adapt it to your multibyte scenario).

     
  • kirill yashuk
    kirill yashuk
    2011-07-15

    I had to extend the lib (paralelly tuning stuff for android) and add an external ttf parser, so I`m afraid that just an example won`t work. + I can`t share the job, cause my contract forbids me to. Just hope someone will find the thread helpful, cause it was hard to even find some info of making this work.. except for the spec.