
base attribute in TOKEN

  • Şafak Ökmen

    Şafak Ökmen - 2012-12-13


    what is the difference of "base" and "y" in TOKENs?

    Is there a spec available? DTD or Schema for the generated XML?


  • Herve Dejean

    Herve Dejean - 2012-12-14

    The base attribute corresponds to the typographical baseline (

    The @y corresponds to the top line of the bounding box

    A schema can be found under the CVS menu.

  • Şafak Ökmen

    Şafak Ökmen - 2012-12-14

    Is there any possibility

    • for baselines to be different for any two tokens with same @y ?

    • tokens with different @y to have the same baseline?

    Some pdf pages are shown tabular but were not created tabular in the first place or somehow 'edited' afterwards I suppose (not talking of ocr). Some lines therein are thus parsed as two different lines (cut in two) and given different @y and different baselines which actually only differ in their @y approx. <=0.9 and as a person you can tell from the document they are supposed to be on the same line. I did not calculate the difference in baselines, which might be equally big or small.

    Line refering to tokens with equal @y.

    For my use I compared lines with each other and combined them back when they differed in their @y by only <=0.9 but maybe there could be a more elegant solution?

    If baselines were able to tell reliably which tokens are really actually on the same line in the document (verifiable by a person who looks at the document or prints it) indifferent of someone having post-edited the document and messed up the @y values, that would be a gain I suppose. But no idea if possible. Or equally...maybe if @ys remained the same and baselines were messed up. If both were messed up, how is it possible for the pdf viewer to parse them on the same line? Maybe there is another indicator for tokens being on the same line?



Log in to post a comment.