Menu

turn off extra <html> tags

mpease
2007-03-16
2016-08-17
  • mpease

    mpease - 2007-03-16

    Hi  -

      cleaning html works well, but I don't want these

    <html><head/>
    <body>blah <p>blah</p> blah</body>
    </html>

    tags involved..

    Is there any way to instead output this:
    blah <p>blah</p> blah

    ?

    Thank you-
    Matt

     
    • Vladimir Nikic

      Vladimir Nikic - 2007-03-19

      Well, currentlu not so easy.
      Adding basic HTML skeletion - tags HTML, HEAD and BODY is done
      staticaly during cleanup process.
      Maybe sounds like bad solution, but I think the best way at the moment
      to remove them after cleaning is by using string manipulation functions.

       
    • Sam De Block

      Sam De Block - 2007-04-11

      Pattern pattern = Pattern.compile("<html><head/><body>(.*?)</body></html>");
      Matcher matcher = pattern.matcher(output);
      matcher.find();
      System.out.println(matcher.group(1));

      This should do the trick

       
      • Vladimir Nikic

        Vladimir Nikic - 2007-05-05

        Support for removing HTML envelope added now in version 1.2

         
    • coffee tea

      coffee tea - 2008-01-15

      You can do like this:
             HtmlCleaner hc = new HtmlCleaner(str);
              try{
                  hc.setOmitHtmlEnvelope(true);
                  hc.setOmitXmlDeclaration(true);
                  hc.clean();
                  return hc.getXmlAsString();
              }catch(IOException e){
                  return str;
              }

       
      • Werner Donné

        Werner Donné - 2016-08-11

        Hello,

        This is not the solution. I'm trying version 2.16. When I set those properties, with CleanerProperties, the HTML envelope is always removed, even is the input contained one. This way it is not possible see whether the input was a complete document or only a fragment.

        I can understand that it is difficult to make the HTML envelope conditional in the code. A compromise could be to turn on "autoGenerated" property for the added envelope tags if the input didn't contain them.

        Best regards,

        Werner.

         
        • Scott Wilson

          Scott Wilson - 2016-08-17

          Hi Werner,

          The envelope is recreated as part of parsing the document type and encoding so we create a valid output DOM, particularly for the JDOM serializer. However I can see it being a real pain if you do need to know if you had a document or a fragment as input. Another option would be to have a property to have the OmitHtmlEnvelope property set dynamically based on the content.

           

Log in to post a comment.

MongoDB Logo MongoDB