Thank you for your work, it's very useful in my final year project.
I am using your example TextExtractionSample.cs and have noticed that the combination of letters 'fl' isn't extracted correctly by PDF Clown. I zipped the code to recreate the problem: http://www.geocities.com/kostyabkg/TextExtractionSample_recreate_problem.zip
This code takes in the pdf file, parses it according to my needs and outputs into the out_k.txt file.
Just look for 'fl' sequences in the the pdf file, in the words like 'floor'. And then check the output file out_k.txt, which doesn't contain those letter suquences, so instead of 'floor' I get 'oor'.
as I explained in other messages that appeared on this forum, I'm absolutely aware (and I stated explicitly so in the comment accompanying the sample code) of the current limitations of the text extraction *sample* (it's just an experimental piece of code that checks some of the low-level parsing functionalities exposed by the library without pretending to be a strong solution for text extraction :-) ).
Your issue is probably caused by a custom encoding (PDF Clown currently doesn't support custom encoding definitions on reading), so I have no ready-made suggestion to offer you, other than being patient (I'm sorry)...
No need to apologize, I am not complaining at all, it's all fine with me. I was just brining another issue to your attention (providing feedback). I'll try to give more expanded feedback on PDFClown library once I am done with my project.
Thank you again,
I really appreciate your feedback (I was just trying to avoid any misunderstanding -- I'd like to say that PDF Clown already features full-fledged text extraction, but that's simply not the case at the moment; it will be progressively achieved in the future releases though).
Here are my personal impressions on working with PDFClown:
- The tutorial (userGuide.pdf) explaining the PDF files and PDFClown interaction is pursuing a good goal, but imho it doesn't reach it. It didn't tell me much as a programmer at all. I had a task to extract the text from the PDF files, but the tutorial didn't help me. Only when I looked at TextExtractionSample I figured out how things are working.
- The structure of the library is not clear to me: what's the difference between it.stefanochizzolini.clown.documents.contents.objects
and it.stefanochizzolini.clown.objects? There are so many things called "objects" that it's getting really confusing. IMHO, all I'd need to get to a quickstart is the hierarchy/structure of namespaces in the PDFClown library, the samples that would illustrate it and the brief structure of PDF files, related to PDFClown namespaces. I know there is a PDF reference and I had to look there, but that's no easy reading at all.
I guess that's all, so summarizing I'd say the following: it's a great working attempt to write compliant PDF library, it's great how all the classes are related to PDF specification and reflect it, but it's hard to get a head around it at first and apply it immediately. So imho what needs to be there is a short tutorial kind of thing saying: this is how PDF documents look like, this is how PDFClown library reflects/implements them and here are a few simple examples (which are already available in samples directory).
Thank you and I hope I am not too ruthless in my feedback :-)
I've been asking myself for months why nobody had yet suggested something about documentation and usability... so your honest criticism is a good thing. In particular, I admit that the user guide (still a work-in-progress, anyway...) lacks both an exhaustive description of the namespace structure of the library (correlated to the PDF spec) and a hands-on practical explanation of how to apply the elements of that structure to the actual user needs. These topics are undisputably relevant, so I'll consider implementing them in the next releases of the guide.
By the way: despite the lack of documentation, I assure you that the structure of the library has been defined carefully to rigorously expose the PDF functionalities. For example, it.stefanochizzolini.clown.objects namespace contains primitive objects (the basic bricks that build up a PDF document, such as numbers, strings, streams, arrays and so on), whilst it.stefanochizzolini.clown.documents.contents.objects namespace contains (you guess?) content objects (the basic bricks that build up a content stream). So: content objects are contained by content streams, whilst content streams are contained by PDF documents.
Log in to post a comment.