Thread: [Podofo-users] Margins and Fonts
A PDF parsing, modification and creation library.
Brought to you by:
domseichter
From: Trevor K. <en...@gm...> - 2009-07-22 21:10:02
|
Hello, First of all, thank you for creating PoDoFo. Writing to existing PDFs is a lot easier with PoDoFo then I ever thought it would be. Margins.. Most PDFs I've seen leave some space between the edge of the "body" text and the edge of the page. This space I will call a margin ( I don't know if there is an official PDF term..). I'd like to figure out the width of the margins. As far as I can tell, there is no PdfPage.GetMargins() and the quick look through I did on the PDF spec didn't seem to mention any margins as I have defined them. The PDFs I want to work with with either have a bunch of body text, or be a slide (power point, etc) converted to PDF. I'd like to be able to find the size of the margins in order to draw inside them. Fonts. I did a quick modification to the concept of the hello world example that takes an existing PDF and adds some text to it. The size of the PDF file tripled for a short line of text. I am assuming this is because I embedded a new font when I added the text. If this is not true... maybe this doesn't matter so much. Anyways, is there a way to get the font(s) that are already embedded in a existing PDF and reuse them? If this stuff is possible but not currently doable in the code, I am open to helping provided it with some guidance. -- Sincerely, Trevor Kaufman |
From: Leonard R. <lro...@ad...> - 2009-07-22 21:19:59
|
There is no such thing as a margin in PDF. Text (and any other object) are just drawn anywhere on a page that you wish. If there happens to be what you (as a human) perceive as "white space" around such objects, that's simply where the producer choose to draw them. It is possible to determine (though not with PoDoFo at this time) the "bounding box" of all content on the page and then compare that to the visible area of the page to determine if there is any difference and if so, what size it is. The current version of PoDoFo does what is called "full embedding" - meaning that every byte of the font is included in the PDF, hence the reason for the large size. PDF also supports 'subset embedding' where only those glyphs used in the text are stored in the PDF, but the work on that isn't complete yet...when it is, your file size will go WAY down. Leonard -----Original Message----- From: Trevor Kaufman [mailto:en...@gm...] Sent: Wednesday, July 22, 2009 5:10 PM To: pod...@li... Subject: [Podofo-users] Margins and Fonts Hello, First of all, thank you for creating PoDoFo. Writing to existing PDFs is a lot easier with PoDoFo then I ever thought it would be. Margins.. Most PDFs I've seen leave some space between the edge of the "body" text and the edge of the page. This space I will call a margin ( I don't know if there is an official PDF term..). I'd like to figure out the width of the margins. As far as I can tell, there is no PdfPage.GetMargins() and the quick look through I did on the PDF spec didn't seem to mention any margins as I have defined them. The PDFs I want to work with with either have a bunch of body text, or be a slide (power point, etc) converted to PDF. I'd like to be able to find the size of the margins in order to draw inside them. Fonts. I did a quick modification to the concept of the hello world example that takes an existing PDF and adds some text to it. The size of the PDF file tripled for a short line of text. I am assuming this is because I embedded a new font when I added the text. If this is not true... maybe this doesn't matter so much. Anyways, is there a way to get the font(s) that are already embedded in a existing PDF and reuse them? If this stuff is possible but not currently doable in the code, I am open to helping provided it with some guidance. -- Sincerely, Trevor Kaufman ------------------------------------------------------------------------------ _______________________________________________ Podofo-users mailing list Pod...@li... https://lists.sourceforge.net/lists/listinfo/podofo-users |
From: Trevor K. <en...@gm...> - 2009-07-22 22:28:18
|
Thanks for the reply. Is there a plan for adding the ability to compute the "bounding box" into PoDoFo? On Wed, Jul 22, 2009 at 5:19 PM, Leonard Rosenthol<lro...@ad...> wrote: > There is no such thing as a margin in PDF. Text (and any other object) are just drawn anywhere on a page that you wish. If there happens to be what you (as a human) perceive as "white space" around such objects, that's simply where the producer choose to draw them. It is possible to determine (though not with PoDoFo at this time) the "bounding box" of all content on the page and then compare that to the visible area of the page to determine if there is any difference and if so, what size it is. > > The current version of PoDoFo does what is called "full embedding" - meaning that every byte of the font is included in the PDF, hence the reason for the large size. PDF also supports 'subset embedding' where only those glyphs used in the text are stored in the PDF, but the work on that isn't complete yet...when it is, your file size will go WAY down. > > Leonard > > -----Original Message----- > From: Trevor Kaufman [mailto:en...@gm...] > Sent: Wednesday, July 22, 2009 5:10 PM > To: pod...@li... > Subject: [Podofo-users] Margins and Fonts > > Hello, > > First of all, thank you for creating PoDoFo. Writing to existing PDFs > is a lot easier with PoDoFo then I ever thought it would be. > > Margins.. Most PDFs I've seen leave some space between the edge of the > "body" text and the edge of the page. This space I will call a margin > ( I don't know if there is an official PDF term..). I'd like to figure > out the width of the margins. As far as I can tell, there is no > PdfPage.GetMargins() and the quick look through I did on the PDF spec > didn't seem to mention any margins as I have defined them. The PDFs I > want to work with with either have a bunch of body text, or be a slide > (power point, etc) converted to PDF. I'd like to be able to find the > size of the margins in order to draw inside them. > > Fonts. I did a quick modification to the concept of the hello world > example that takes an existing PDF and adds some text to it. The size > of the PDF file tripled for a short line of text. I am assuming this > is because I embedded a new font when I added the text. If this is not > true... maybe this doesn't matter so much. Anyways, is there a way to > get the font(s) that are already embedded in a existing PDF and reuse > them? > > If this stuff is possible but not currently doable in the code, I am > open to helping provided it with some guidance. > > -- > Sincerely, > Trevor Kaufman > > ------------------------------------------------------------------------------ > _______________________________________________ > Podofo-users mailing list > Pod...@li... > https://lists.sourceforge.net/lists/listinfo/podofo-users > -- -- Sincerely, Trevor Kaufman |
From: Pierre M. <pi...@mo...> - 2009-07-22 23:20:45
|
Vous (Trevor Kaufman) avez écrit : > Is there a plan for adding the ability to compute the "bounding box" > into PoDoFo? In my opinion it’s really out of PoDoFo’s scope. -- Pierre Marchand |
From: Craig R. <cr...@po...> - 2009-07-22 23:54:34
|
On Wed, 2009-07-22 at 17:09 -0400, Trevor Kaufman wrote: > Margins.. Most PDFs I've seen leave some space between the edge of the > "body" text and the edge of the page. This space I will call a margin > ( I don't know if there is an official PDF term..). I'd like to figure > out the width of the margins. As far as I can tell, there is no > PdfPage.GetMargins() and the quick look through I did on the PDF spec > didn't seem to mention any margins as I have defined them. The PDFs I > want to work with with either have a bunch of body text, or be a slide > (power point, etc) converted to PDF. I'd like to be able to find the > size of the margins in order to draw inside them. As Leonard noted, there's no easy way to find the bounding box of the PDF content. PDF does actually have definitions for boxes that you might call margins of various sorts (PDF Reference 10.10.1, "page boundaries) - but they're frequently left unset or incorrectly set, and are defined to include "meaningful whitespace" anyway. Getting the bounding box of the content involves processing the PDF content stream(s) and tracking parts of the the graphics state while checking where each drawing operator would draw. At present PoDoFo doesn't do this - and in fact knows nothing about what the operators in content streams do. I'd really like to see a way to use Poppler's PDF content stream processing with PoDoFo as the PDF file structure access backend, so things like this and thumbnail generation could be handled. Right now, though, there's nothing like that, and I haven't looked at Poppler in detail to see what doing it would involve and how maintainable such a modification would be. > Fonts. I did a quick modification to the concept of the hello world > example that takes an existing PDF and adds some text to it. The size > of the PDF file tripled for a short line of text. I am assuming this > is because I embedded a new font when I added the text. If this is not > true... maybe this doesn't matter so much. Anyways, is there a way to > get the font(s) that are already embedded in a existing PDF and reuse > them? The trouble there is that most PDFs contain fonts embedded as subsets. These fonts only contain the glyphs that are actually used in the PDF document. For a PDF with the text "aardvark" the glyphs for "a", "d", "k", "r" and "v" would be included. If you wanted to add the text "hello world" you'd have to re-embed the font (if you had an _identical_ copy and could identify the embedded font), embed a different subset, or use a different font, since it would be lacking the glyphs you needed. Right now PoDoFo doesn't support subsetting during embedding. Dom's done some work on this but I don't know what the current status of it is. So - you probably could re-use an already embedded font, IF you could determine that it was fully embedded or contained all the glyphs you needed. Right now, though, PoDoFo doesn't offer you any help with this so you'd have to do it using the low-level document structure. > If this stuff is possible but not currently doable in the code, I am > open to helping provided it with some guidance. I'm not sure how much help I can be right now. Looking at PdfFont.h it appears that there's some facility for using existing fonts already, but you'd need to be able to find the font in the document structure and determine that it was (a) the font you needed and (b) not a subset, or a subset containing the glyphs you need. A class to enumerate fonts in a PDF and provide some information about them (like subset glyphs, etc) would be a good thing to tackle. -- Craig Ringer |
From: Martin S. <ma...@on...> - 2009-07-23 00:03:01
|
2009/7/22 Trevor Kaufman <en...@gm...>: > Margins.. Most PDFs I've seen leave some space between the edge of the > "body" text and the edge of the page. This space I will call a margin > ( I don't know if there is an official PDF term..). I'd like to figure > out the width of the margins. As far as I can tell, there is no > PdfPage.GetMargins() and the quick look through I did on the PDF spec > didn't seem to mention any margins as I have defined them. The PDFs I > want to work with with either have a bunch of body text, or be a slide > (power point, etc) converted to PDF. I'd like to be able to find the > size of the margins in order to draw inside them. You can use ghostscript (gs -sDEVICE=bbox) to get the bbox of the non-white area. Leonard has explained the idea of "margins" in PDF. > Fonts. I did a quick modification to the concept of the hello world > example that takes an existing PDF and adds some text to it. The size > of the PDF file tripled for a short line of text. I am assuming this > is because I embedded a new font when I added the text. If this is not > true... maybe this doesn't matter so much. Anyways, is there a way to > get the font(s) that are already embedded in a existing PDF and reuse > them? If you intend to use a lot of of fonts (or want to include pdfs with fonts), you should look at pdftex/luatex. We've spent years on getting subsetting etc. right, especially with included pdfs. Best Martin PS: I'm not dissing PoDoFo - these are simply areas where other applications already deliver what's needed. |
From: Leonard R. <lro...@ad...> - 2009-07-23 01:19:54
|
Connecting up Poppler to PoDoFo wouldn't be that difficult - I've sat Xpdf (on which Poppler is based) on top of other PDF libraries in the past to use its rendering facilities in conjunction with other PDF reading/writing needs. Basically you just need to replace parts of the core Object/Dict/Array/Stream classes. I would think, however, that the big issue would be the licensing concerns. Leonard -----Original Message----- From: Craig Ringer [mailto:cr...@po...] Sent: Wednesday, July 22, 2009 7:54 PM To: Trevor Kaufman Cc: pod...@li... Subject: Re: [Podofo-users] Margins and Fonts On Wed, 2009-07-22 at 17:09 -0400, Trevor Kaufman wrote: > Margins.. Most PDFs I've seen leave some space between the edge of the > "body" text and the edge of the page. This space I will call a margin > ( I don't know if there is an official PDF term..). I'd like to figure > out the width of the margins. As far as I can tell, there is no > PdfPage.GetMargins() and the quick look through I did on the PDF spec > didn't seem to mention any margins as I have defined them. The PDFs I > want to work with with either have a bunch of body text, or be a slide > (power point, etc) converted to PDF. I'd like to be able to find the > size of the margins in order to draw inside them. As Leonard noted, there's no easy way to find the bounding box of the PDF content. PDF does actually have definitions for boxes that you might call margins of various sorts (PDF Reference 10.10.1, "page boundaries) - but they're frequently left unset or incorrectly set, and are defined to include "meaningful whitespace" anyway. Getting the bounding box of the content involves processing the PDF content stream(s) and tracking parts of the the graphics state while checking where each drawing operator would draw. At present PoDoFo doesn't do this - and in fact knows nothing about what the operators in content streams do. I'd really like to see a way to use Poppler's PDF content stream processing with PoDoFo as the PDF file structure access backend, so things like this and thumbnail generation could be handled. Right now, though, there's nothing like that, and I haven't looked at Poppler in detail to see what doing it would involve and how maintainable such a modification would be. > Fonts. I did a quick modification to the concept of the hello world > example that takes an existing PDF and adds some text to it. The size > of the PDF file tripled for a short line of text. I am assuming this > is because I embedded a new font when I added the text. If this is not > true... maybe this doesn't matter so much. Anyways, is there a way to > get the font(s) that are already embedded in a existing PDF and reuse > them? The trouble there is that most PDFs contain fonts embedded as subsets. These fonts only contain the glyphs that are actually used in the PDF document. For a PDF with the text "aardvark" the glyphs for "a", "d", "k", "r" and "v" would be included. If you wanted to add the text "hello world" you'd have to re-embed the font (if you had an _identical_ copy and could identify the embedded font), embed a different subset, or use a different font, since it would be lacking the glyphs you needed. Right now PoDoFo doesn't support subsetting during embedding. Dom's done some work on this but I don't know what the current status of it is. So - you probably could re-use an already embedded font, IF you could determine that it was fully embedded or contained all the glyphs you needed. Right now, though, PoDoFo doesn't offer you any help with this so you'd have to do it using the low-level document structure. > If this stuff is possible but not currently doable in the code, I am > open to helping provided it with some guidance. I'm not sure how much help I can be right now. Looking at PdfFont.h it appears that there's some facility for using existing fonts already, but you'd need to be able to find the font in the document structure and determine that it was (a) the font you needed and (b) not a subset, or a subset containing the glyphs you need. A class to enumerate fonts in a PDF and provide some information about them (like subset glyphs, etc) would be a good thing to tackle. -- Craig Ringer ------------------------------------------------------------------------------ _______________________________________________ Podofo-users mailing list Pod...@li... https://lists.sourceforge.net/lists/listinfo/podofo-users |
From: Craig R. <cr...@po...> - 2009-07-23 03:24:18
|
On Wed, 2009-07-22 at 18:19 -0700, Leonard Rosenthol wrote: > Connecting up Poppler to PoDoFo wouldn't be that difficult - I've sat Xpdf (on which Poppler is based) on top of other PDF libraries in the past to use its rendering facilities in conjunction with other PDF reading/writing needs. Basically you just need to replace parts of the core Object/Dict/Array/Stream classes. > > I would think, however, that the big issue would be the licensing concerns. Yeah. It's full GPL. There's nothing wrong with linking PoDoFo to it or providing the facility to do so, but any PoDoFo user using a copy of PoDoFo with support for using poppler enabled would have to follow the rules of the full GPL, not just the lesser/library GPL. -- Craig Ringer |
From: Craig R. <cr...@po...> - 2009-07-23 03:34:26
|
On Thu, 2009-07-23 at 02:02 +0200, Martin Schröder wrote: > > Fonts. I did a quick modification to the concept of the hello world > > example that takes an existing PDF and adds some text to it. The size > > of the PDF file tripled for a short line of text. I am assuming this > > is because I embedded a new font when I added the text. If this is not > > true... maybe this doesn't matter so much. Anyways, is there a way to > > get the font(s) that are already embedded in a existing PDF and reuse > > them? > > If you intend to use a lot of of fonts (or want to include pdfs with > fonts), you should look at pdftex/luatex. We've spent years on getting > subsetting etc. right, especially with included pdfs. PdfTex is full GPL, isn't it? Pity (as far as I'm concerned) since that means we can't re-use or library-ify any of the font parsing and subsetting code unless that code is explicitly re-licensed to dual GPL/LGPL by its author(s) :S I'm increasingly irritated by the lack of a good standalone open source library for font subsetting and parsing. Freetype is designed to make it hard to get at the guts of it for this sort of use, and it's not really built to handle subsetting and such anyway. I'd really like to move toward a shared library for font subsetting for cairo, pdftex, Scribus, PoDoFo, etc, but feel somewhat hobbled by the full-GPL-only nature of most of the existing subsetting implementations. > PS: I'm not dissing PoDoFo - these are simply areas where other > applications already deliver what's needed. I think people need to be less defensive about their software ;-) There are _often_ cases where others' is a better fit for somebody's needs than one's own. For example, PoDoFo's design makes it good for low-level PDF inspection and editing and for the creation of PDFs with complex or custom constructs that higher level tools may not understand or support. On the other hand, both in design and in current implementation state it's not ideal for simple generation of new PDF documents, it's a terrible choice for high level typesetting and layout, and it's not great at working with fonts. I'll recommend Cairo over PoDoFo for a bunch of uses. I don't have much experience with pdftex, but even so it'd be one of the first things that'd come to mind for document processing needs (gee, surprising that). I didn't even know about luatex - but it seems like a pretty good idea. Lua is a good choice of language, too. I've had enough experience working with one common/popular embedding language, Python, to know how shocking it is when you want to embed it. -- Craig Ringer |
From: Martin S. <ma...@on...> - 2009-07-24 15:25:58
|
2009/7/23 Craig Ringer <cr...@po...>: > On Thu, 2009-07-23 at 02:02 +0200, Martin Schröder wrote: >> If you intend to use a lot of of fonts (or want to include pdfs with >> fonts), you should look at pdftex/luatex. We've spent years on getting >> subsetting etc. right, especially with included pdfs. > > PdfTex is full GPL, isn't it? Pity (as far as I'm concerned) since that > means we can't re-use or library-ify any of the font parsing and > subsetting code unless that code is explicitly re-licensed to dual > GPL/LGPL by its author(s) :S The GPL comes from XPDF. If that would be gone (and replaced with some LPGL/BSD thing), one could rethink the licensing of a librarified luaTeX. But that's not in the forseeable future. > I'm increasingly irritated by the lack of a good standalone open source > library for font subsetting and parsing. Freetype is designed to make it > hard to get at the guts of it for this sort of use, and it's not really > built to handle subsetting and such anyway. I'd really like to move > toward a shared library for font subsetting for cairo, pdftex, Scribus, > PoDoFo, etc, but feel somewhat hobbled by the full-GPL-only nature of > most of the existing subsetting implementations. Agreed - that would be nice. Although luaTeX would loose it's lead. :-) >> PS: I'm not dissing PoDoFo - these are simply areas where other >> applications already deliver what's needed. > > I think people need to be less defensive about their software ;-) > > There are _often_ cases where others' is a better fit for somebody's > needs than one's own. For example, PoDoFo's design makes it good for > low-level PDF inspection and editing and for the creation of PDFs with > complex or custom constructs that higher level tools may not understand > or support. On the other hand, both in design and in current > implementation state it's not ideal for simple generation of new PDF > documents, it's a terrible choice for high level typesetting and layout, > and it's not great at working with fonts. Amen. podofobrowser is a godsend. :-) > I'll recommend Cairo over PoDoFo for a bunch of uses. I don't have much > experience with pdftex, but even so it'd be one of the first things > that'd come to mind for document processing needs (gee, surprising > that). > > I didn't even know about luatex - but it seems like a pretty good idea. > Lua is a good choice of language, too. I've had enough experience > working with one common/popular embedding language, Python, to know how > shocking it is when you want to embed it. My dream is a luatex with a good pdf library that can be used as a pdf (dis)assembler from lua. Best Martin |