Parse dirty html

Help
Benjamin
2009-06-03
2013-01-03
  • Benjamin
    Benjamin
    2009-06-03

    Hello,

    Thanks for your great library !

    I have a problem with getTextExtractor  with some (little dirty) html code like the following  :

    A la demande de la ville, le cinéma. La décision a été prise.<img width='1' height='1' src='http://rss.vcbvbvc.com/c/f/s/47a73bc/mf.gif' border='0'/><div class='mf-viral'><table border='0'><tr><td valign='middle'><a href="http://res.cvbbcvbvc.com/viral/sendemail2_fr.html?title=Un film sur "fgfgd" banni&link=http://www.dfgdfg.fr/actualite/societe/un-film-sur-gffg-banni-d-un-cinema-parisien_764721.html?" target="_blank"><img src="http://rss.fghfgh.com/images/partagez.gif" border="0" /></a></td><td valign='middle'><a href="http://res.sdfsdfsdf.com/viral/bookmark_fr.cfm?title=Un film sur fgfgh banni d'un cinéma&link=http://www.dfgdfg.fr/actualite/societe/un-film-sur-fgfg-banni-d-un-cinema-parisien_764721.html?" target="_blank"><img src="http://rss.dfgdfg.com/images/bookmark.gif" border="0" /></a></td></tr></table></div><br/><br/><a href="http://da.dfgdfg.com/r/u/159/f/9916/c/568/s/75133884/a2.htm"><img src="http://da.dfgdfg.com/r//u/159/f/c/568/s/75133884/a2.img" border="0"/></a>

    Source source=new Source(new URL(sourceUrlString));
    System.out.println(source.getTextExtractor().toString());

    The result still contain some html. 
    Is it normal ?
    The parser could'nt be a little bit more liberal ?

    Thanks again
    Ben

     
    • Martin Jericho
      Martin Jericho
      2009-06-03

      Hi Ben,

      The parser doesn't filter out any HTML tags because the source text doesn't contain any.  For some reason they are all encoded.

      The following code might achieve what you want:

      Source source=new Source(new URL(sourceUrlString));
      source=CharacterReference.decode(source);
      System.out.println(source.getTextExtractor().toString());

      Cheers
      Martin

       
      • Benjamin
        Benjamin
        2009-06-03

        Hello Martin,

        Thanks for your answer.

        Actually, this is the exact (dirty) code I'm using :
        Source source=new Source(new URL(sourceUrlString));
        Source source2=new Source(CharacterReference.decode(source.toString()));
        System.out.println("\n\n###\n"+source2.getTextExtractor().toString());

        This code compile fine, some html tags are remove, but not all.

        I try the cleaner code you sent me, but it can't compile :

        ExtractText.java:43: incompatible types
        found   : java.lang.String
        required: net.htmlparser.jericho.Source
                source=CharacterReference.decode(source);

        Thanks
        Ben

         
        • Martin Jericho
          Martin Jericho
          2009-06-03

          Sorry, the line was meant to be:
          source=new Source(CharacterReference.decode(source));

          But forget that anyway, as it would be just reversing what you are doing with the line:
          Source source2=new Source(CharacterReference.decode(source.toString()));

          Why are you doing that?  Remove the line and it should all work fine.

           
          • Benjamin
            Benjamin
            2009-06-03

            Ok, I agree with you, I had more or less the same code, which bring us to my problem.

            The following html code is not correctly convert in html by the function

            A la demande de la ville, le cinéma. La décision a été prise.<img width='1' height='1' src='http://rss.vcbvbvc.com/c/f/s/47a73bc/mf.gif' border='0'/><div class='mf-viral'><table border='0'><tr><td valign='middle'><a href="http://res.cvbbcvbvc.com/viral/sendemail2_fr.html?title=Un film sur "fgfgd" banni&link=http://www.dfgdfg.fr/actualite/societe/un-film-sur-gffg-banni-d-un-cinema-parisien_764721.html?" target="_blank"><img src="http://rss.fghfgh.com/images/partagez.gif" border="0" /></a></td><td valign='middle'><a href="http://res.sdfsdfsdf.com/viral/bookmark_fr.cfm?title=Un film sur fgfgh banni d'un cinéma&link=http://www.dfgdfg.fr/actualite/societe/un-film-sur-fgfg-banni-d-un-cinema-parisien_764721.html?" target="_blank"><img src="http://rss.dfgdfg.com/images/bookmark.gif" border="0" /></a></td></tr></table></div><br/><br/><a href="http://da.dfgdfg.com/r/u/159/f/9916/c/568/s/75133884/a2.htm"><img src="http://da.dfgdfg.com/r//u/159/f/c/568/s/75133884/a2.img" border="0"/></a>

            Source source=new Source(new URL(sourceUrlString));
            source=new Source(CharacterReference.decode(source));
            System.out.println(source.getTextExtractor().toString());

             
            • Martin Jericho
              Martin Jericho
              2009-06-03

              OK the problem has nothing to do with the fact that the original file is HTML encoded.

              The output I get from the TextExtractor is:

              A la demande de la ville, le cinéma. La décision a été prise.
              <a href="http://res.cvbbcvbvc.com/viral/sendemail2_fr.html?title=Un film sur "fgfgd" banni&link=http://www.dfgdfg.fr/actualite/societe/un-film-sur-gffg-banni-d-un-cinema-parisien_764721.html?" target="_blank">

              If you have a look at the log output you will see what the problem is:

              INFO: StartTag a at (r1,c214,p213) has missing whitespace after quoted attribute value at position (r1,c292,p291)
              INFO: StartTag a at (r1,c214,p213) contains attribute name with invalid character at position (r1,c297,p296)
              INFO: StartTag a at (r1,c214,p213) has missing whitespace after quoted attribute value at position (r1,c299,p298)
              INFO: StartTag a at (r1,c214,p213) contains attribute name with invalid character at position (r1,c304,p303)
              INFO: StartTag a at (r1,c214,p213) rejected because it contains too many errors
              INFO: Encountered possible StartTag at (r1,c214,p213) whose content does not match a registered StartTagType

              The <a> element that is not removed from the text contains quote characters inside the href attribute.
              You can increase the number of errors the parser tolerates using the Attributes.setDefaultMaxErrorCount static method.  Try setting it to 5.

               
              • Benjamin
                Benjamin
                2009-06-03

                That works fine !

                Thanks
                Ben