how to get text position?

Help
Minh Tran
2007-11-28
2013-05-28
  • Minh Tran

    Minh Tran - 2007-11-28

    First of all ,thanks for the lib. Currently i need to use pdfclown to extract text chunks with their coordinates. Can you pinpoint where to start ? I am a pdf novice so if this functionality is still not in the current release yet , what will i need to do to write this functionality based on your current implementation? Is it possible to identify page's border size to know if the text will flow down to the next line? Thanks alot for help.

     
    • mtraut

      mtraut - 2007-11-29

      First of all, thanks for using jPod...:-)

      ... really, the very first thing to do as PDF novice is: read the spec. You need a good understanding of the PDF data structures, regardless of which lib you use...

      After that, use one of the examples that deals with page content, for example "Watermark". You should see how to track down the pages in the doc.

      Text extraction is currently not released as it is not "complete", but as you are not the first that needs some hints, i have created a package "snippets" where you can download the current text extraction. This code is unsupported and will most certainly change to the next release! But it will show you how to create a ICSDevice that will filter text and position from a content stream...

      With the following code, you should hav a jump start to text extraction:

          {
              ...

              PDPage page = doc.getPageTree().getFirstPage();
              while (page != null) {
                  CSTextExtractor extractor = new CSTextExtractor();
                  extractText(extractor, page);
                  String extract = extractor.getContent();
                  page = page.getNextPage();
              }
              ...
          }

          protected void extractText(CSTextExtractor extractor, PDPage page) {
              try {
                  CSDeviceBasedInterpreter interpreter = new CSDeviceBasedInterpreter(
                          null, extractor);
                  interpreter.process(page.getContentStream(), page.getResources());
              } catch (CSException e) {
                  // exception? not of interest...
              }
          }

      michael

       
    • Minh Tran

      Minh Tran - 2007-11-29

      thanks alot for quick response. Definitely i need to read spec and i am doing that. But for my urgent project i need a quick start not waiting to finish reading 1200 pages of the reference. Your help is truly appreciated. Once again thanks for spend time on the project and for helping others.

       

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks