Getting Font information from TextExtract

Tools
2006-05-24
2013-04-29
  • Within a text extraction routine where I iterate over the nodes in a PDF docutment, I'd like to obtain the font information.

    If n is a leaf node.

    How can I find out the font of the text it contains?

    This is the very ugly hack I've tried to use, but it's horrible and doesn't get the appropriate font information for each node, only some of them.

    String font = "";
    Mark m = n.getSticky(0);
    if( m.getOwner() instanceof SpanPDF ) {
           
        SpanPDF span = (SpanPDF) m.getOwner();
        font = span.font.getName();
        font = font.substring(font.indexOf("+")+1, font.length();

    }

    Do you have any suggestions, or ideas for places I could look to solve this?

    Thanks

    Gully

     
    • Tom Phelps
      Tom Phelps
      2006-05-24

      The node.getSticky() holds span transitions and may cover many nodes or just part of one node.  You can get the multivalent.Context type for the document from the style sheet, traverse the document tree, invoke at each span transition context.reset(node, offset), then read the font attributes from the Context.  PDF uses the "spot font" field for embedded fonts.  PDF does not have overlapping font-related spans, in which case you can be somewhat simpler for this particular task.

       
    • OK, this looks somewhat doable. If I figure out a solution, I'll post it here.

       
    • OK, done, this works perfectly and was very easy to implement.

      Thanks enormously!

      The solution is spread out over many places in the code but is exactly as you suggested (so this is completely superfluous, but I said I'd post it so hey).

      StyleSheet ss = doc.getStyleSheet();
      Context context = ss.getContext();

      // Given a node n
      Context context.reset(n, 1);
      NFont f = this.context.spot;

      Thanks again

      Gully