how to get text position?

Help
Minh Tran
2007-11-28
2013-05-28
  • Minh Tran
    Minh Tran
    2007-11-28

    First of all ,thanks for the lib. Currently i need to use pdfclown to extract text chunks with their coordinates. Can you pinpoint where to start ? I am a pdf novice so if this functionality is still not in the current release yet , what will i need to do to write this functionality based on your current implementation? Is it possible to identify page's border size to know if the text will flow down to the next line? Thanks alot for help.

     
    • mtraut
      mtraut
      2007-11-29

      First of all, thanks for using jPod...:-)

      ... really, the very first thing to do as PDF novice is: read the spec. You need a good understanding of the PDF data structures, regardless of which lib you use...

      After that, use one of the examples that deals with page content, for example "Watermark". You should see how to track down the pages in the doc.

      Text extraction is currently not released as it is not "complete", but as you are not the first that needs some hints, i have created a package "snippets" where you can download the current text extraction. This code is unsupported and will most certainly change to the next release! But it will show you how to create a ICSDevice that will filter text and position from a content stream...

      With the following code, you should hav a jump start to text extraction:

          {
              ...

              PDPage page = doc.getPageTree().getFirstPage();
              while (page != null) {
                  CSTextExtractor extractor = new CSTextExtractor();
                  extractText(extractor, page);
                  String extract = extractor.getContent();
                  page = page.getNextPage();
              }
              ...
          }

          protected void extractText(CSTextExtractor extractor, PDPage page) {
              try {
                  CSDeviceBasedInterpreter interpreter = new CSDeviceBasedInterpreter(
                          null, extractor);
                  interpreter.process(page.getContentStream(), page.getResources());
              } catch (CSException e) {
                  // exception? not of interest...
              }
          }

      michael

       
    • Minh Tran
      Minh Tran
      2007-11-29

      thanks alot for quick response. Definitely i need to read spec and i am doing that. But for my urgent project i need a quick start not waiting to finish reading 1200 pages of the reference. Your help is truly appreciated. Once again thanks for spend time on the project and for helping others.