Tipos for using ExtractText on windows

  • zakwer

    zakwer - 2005-11-04

    Sorry if this subject is more MS-DOS oriented than anything else.

    I am trying to extract the text out of some pdf files on a Windows PC.

    Using the "system symbol" (MS-DOS console) I have managed to "pipe" (ala UNIX | command) the output of ExtractText into a text file with this command:

    java -classpath H:\TMP\PDF\Multivalent20040415.jar tool.doc.ExtractText -verbose SAMPLE.pdf | TEXTO.txt

    The extracted text shows up in a notepad file, and I can then copy and paste it into a text editor to remove the funky return characters. Fine. Thanks. Love it.

    I wonder if anyone knows how to automate it a little more in order to work with a folder full of pdfs, as using the above command only "pipes" the first pdf file into the text file, seemengly ignoring the rest.

    Also if there are any other file formats available for output, I have tried .rtf, .doc, .htm to no avail.


    • Brian

      Brian - 2005-11-04

      MS-DOS does have its own scripting tools, and you use them to build batch processing files (.BAT). They are not particularly extensive, or elegant when compared to Unix shell scripts, but you should be able to do what you hope to, here.

      The following URL has some brief guidelines on how to use the handful of DOS scripting commands. http://www.cs.ntu.edu.au/homepages/bea/home/subjects/ith305/description.html

      The FOR command will probably do what you want it to.

    • Robert Eden

      Robert Eden - 2005-11-04

      Instead of sending to a pipe, redirect to a file.

      > file.txt
      | file.txt

      (I'm actually surprised the pipe worked)


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks