Parse dirty html

Help
Benjamin
2009-06-03
2013-01-03
  • Benjamin

    Benjamin - 2009-06-03

    Hello,

    Thanks for your great library !

    I have a problem with getTextExtractor  with some (little dirty) html code like the following  :

    A la demande de la ville, le cinéma. La décision a été prise.<img width='1' height='1' src='http://rss.vcbvbvc.com/c/f/s/47a73bc/mf.gif' border='0'/><div class='mf-viral'><table border='0'><tr><td valign='middle'><a href="http://res.cvbbcvbvc.com/viral/sendemail2_fr.html?title=Un film sur "fgfgd" banni&link=http://www.dfgdfg.fr/actualite/societe/un-film-sur-gffg-banni-d-un-cinema-parisien_764721.html?" target="_blank"><img src="http://rss.fghfgh.com/images/partagez.gif" border="0" /></a></td><td valign='middle'><a href="http://res.sdfsdfsdf.com/viral/bookmark_fr.cfm?title=Un film sur fgfgh banni d'un cinéma&link=http://www.dfgdfg.fr/actualite/societe/un-film-sur-fgfg-banni-d-un-cinema-parisien_764721.html?" target="_blank"><img src="http://rss.dfgdfg.com/images/bookmark.gif" border="0" /></a></td></tr></table></div><br/><br/><a href="http://da.dfgdfg.com/r/u/159/f/9916/c/568/s/75133884/a2.htm"><img src="http://da.dfgdfg.com/r//u/159/f/c/568/s/75133884/a2.img" border="0"/></a>

    Source source=new Source(new URL(sourceUrlString));
    System.out.println(source.getTextExtractor().toString());

    The result still contain some html. 
    Is it normal ?
    The parser could'nt be a little bit more liberal ?

    Thanks again
    Ben

     
    • Martin Jericho

      Martin Jericho - 2009-06-03

      Hi Ben,

      The parser doesn't filter out any HTML tags because the source text doesn't contain any.  For some reason they are all encoded.

      The following code might achieve what you want:

      Source source=new Source(new URL(sourceUrlString));
      source=CharacterReference.decode(source);
      System.out.println(source.getTextExtractor().toString());

      Cheers
      Martin

       
      • Benjamin

        Benjamin - 2009-06-03

        Hello Martin,

        Thanks for your answer.

        Actually, this is the exact (dirty) code I'm using :
        Source source=new Source(new URL(sourceUrlString));
        Source source2=new Source(CharacterReference.decode(source.toString()));
        System.out.println("\n\n###\n"+source2.getTextExtractor().toString());

        This code compile fine, some html tags are remove, but not all.

        I try the cleaner code you sent me, but it can't compile :

        ExtractText.java:43: incompatible types
        found   : java.lang.String
        required: net.htmlparser.jericho.Source
                source=CharacterReference.decode(source);

        Thanks
        Ben

         
        • Martin Jericho

          Martin Jericho - 2009-06-03

          Sorry, the line was meant to be:
          source=new Source(CharacterReference.decode(source));

          But forget that anyway, as it would be just reversing what you are doing with the line:
          Source source2=new Source(CharacterReference.decode(source.toString()));

          Why are you doing that?  Remove the line and it should all work fine.

           
          • Benjamin

            Benjamin - 2009-06-03

            Ok, I agree with you, I had more or less the same code, which bring us to my problem.

            The following html code is not correctly convert in html by the function

            A la demande de la ville, le cinéma. La décision a été prise.<img width='1' height='1' src='http://rss.vcbvbvc.com/c/f/s/47a73bc/mf.gif' border='0'/><div class='mf-viral'><table border='0'><tr><td valign='middle'><a href="http://res.cvbbcvbvc.com/viral/sendemail2_fr.html?title=Un film sur "fgfgd" banni&link=http://www.dfgdfg.fr/actualite/societe/un-film-sur-gffg-banni-d-un-cinema-parisien_764721.html?" target="_blank"><img src="http://rss.fghfgh.com/images/partagez.gif" border="0" /></a></td><td valign='middle'><a href="http://res.sdfsdfsdf.com/viral/bookmark_fr.cfm?title=Un film sur fgfgh banni d'un cinéma&link=http://www.dfgdfg.fr/actualite/societe/un-film-sur-fgfg-banni-d-un-cinema-parisien_764721.html?" target="_blank"><img src="http://rss.dfgdfg.com/images/bookmark.gif" border="0" /></a></td></tr></table></div><br/><br/><a href="http://da.dfgdfg.com/r/u/159/f/9916/c/568/s/75133884/a2.htm"><img src="http://da.dfgdfg.com/r//u/159/f/c/568/s/75133884/a2.img" border="0"/></a>

            Source source=new Source(new URL(sourceUrlString));
            source=new Source(CharacterReference.decode(source));
            System.out.println(source.getTextExtractor().toString());

             
            • Martin Jericho

              Martin Jericho - 2009-06-03

              OK the problem has nothing to do with the fact that the original file is HTML encoded.

              The output I get from the TextExtractor is:

              A la demande de la ville, le cinéma. La décision a été prise.
              <a href="http://res.cvbbcvbvc.com/viral/sendemail2_fr.html?title=Un film sur "fgfgd" banni&link=http://www.dfgdfg.fr/actualite/societe/un-film-sur-gffg-banni-d-un-cinema-parisien_764721.html?" target="_blank">

              If you have a look at the log output you will see what the problem is:

              INFO: StartTag a at (r1,c214,p213) has missing whitespace after quoted attribute value at position (r1,c292,p291)
              INFO: StartTag a at (r1,c214,p213) contains attribute name with invalid character at position (r1,c297,p296)
              INFO: StartTag a at (r1,c214,p213) has missing whitespace after quoted attribute value at position (r1,c299,p298)
              INFO: StartTag a at (r1,c214,p213) contains attribute name with invalid character at position (r1,c304,p303)
              INFO: StartTag a at (r1,c214,p213) rejected because it contains too many errors
              INFO: Encountered possible StartTag at (r1,c214,p213) whose content does not match a registered StartTagType

              The <a> element that is not removed from the text contains quote characters inside the href attribute.
              You can increase the number of errors the parser tolerates using the Attributes.setDefaultMaxErrorCount static method.  Try setting it to 5.

               
              • Benjamin

                Benjamin - 2009-06-03

                That works fine !

                Thanks
                Ben

                 

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks