Menu

I Need Some Help !!! :(

Help
Anonymous
2004-05-28
2004-05-28
  • Anonymous

    Anonymous - 2004-05-28

    Hello everyone,

    thanx alot for this useful forum, it really helps alot.
    anywayz, im stuck in something. it's like this, my task is to process a PDF file and extract all its contents and send the text to a translator interface that will return a translated text and then i will have to construct the document back in the same format but the text would be in different language :) ... sounds easy but there are some bloody annoying minor things to be handled...

    I have used pdftohtml converter to convert the PDF into html and I used htmlparser to parse the html document... my problem is when pdftohtml converts the pdf into html, all the text in the pdf file will be put in html tags line by line (long sentences will be broken into multiple lines, each line in one tag)... I need to arrange the text strings before i send them to the translation machine as complete sentences or complete paragraphs so the translation will have some logical meaning ...

    I developed some algorithm to concatenate the strings into sentences but it doesn't work perfectly .. some strings get lost :S and I dunno why ... and moreover i lose the format of the document cuz the text will be pasted in different position (usually the latest retrieved line of text) in the html file ...

    Does anyone have any idea how to handle these things ... could anyone plz help me in this !! :(

    Your help is really appreciated ...
    thanx
    Muaz H.

     
    • Derrick Oswald

      Derrick Oswald - 2004-05-28

      Look at http://htmlparser.sourceforge.net/wiki/index.php/CustomTagLinks to see how to write a custom tag.
      Whatever tag surrounds your lines of text, lets say it's <TEXT>,  can be recognized as a custom tag. In the doSemanticAction() method you can concatenate your strings.

      You could take any tag class as an example to work from, but this might work:

      public class MyText extends CompositeTag
      {
          public static StringBuffer mBuffer = new StringBuffer ();

          /**
           * The set of names handled by this tag.
           */
          private static final String[] mIds = new String[] {"TEXT"};

          /**
           * Create a new text tag.
           */
          public MyText ()
          {
          }

          /**
           * Return the set of names handled by this tag.
           * @return The names to be matched that create tags of this type.
           */
          public String[] getIds ()
          {
              return (mIds);
          }

          public void doSemanticAction () throws ParserException
          {
              mBuffer.append (getChildren ().toString());
          }
      }

      Then after the parse:
          MyText.mBuffer.toString ()

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.