Well, currentlu not so easy.
Adding basic HTML skeletion - tags HTML, HEAD and BODY is done
staticaly during cleanup process.
Maybe sounds like bad solution, but I think the best way at the moment
to remove them after cleaning is by using string manipulation functions.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You can do like this:
HtmlCleaner hc = new HtmlCleaner(str);
try{
hc.setOmitHtmlEnvelope(true);
hc.setOmitXmlDeclaration(true);
hc.clean();
return hc.getXmlAsString();
}catch(IOException e){
return str;
}
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This is not the solution. I'm trying version 2.16. When I set those properties, with CleanerProperties, the HTML envelope is always removed, even is the input contained one. This way it is not possible see whether the input was a complete document or only a fragment.
I can understand that it is difficult to make the HTML envelope conditional in the code. A compromise could be to turn on "autoGenerated" property for the added envelope tags if the input didn't contain them.
Best regards,
Werner.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The envelope is recreated as part of parsing the document type and encoding so we create a valid output DOM, particularly for the JDOM serializer. However I can see it being a real pain if you do need to know if you had a document or a fragment as input. Another option would be to have a property to have the OmitHtmlEnvelope property set dynamically based on the content.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi -
cleaning html works well, but I don't want these
<html><head/>
<body>blah <p>blah</p> blah</body>
</html>
tags involved..
Is there any way to instead output this:
blah <p>blah</p> blah
?
Thank you-
Matt
Well, currentlu not so easy.
Adding basic HTML skeletion - tags HTML, HEAD and BODY is done
staticaly during cleanup process.
Maybe sounds like bad solution, but I think the best way at the moment
to remove them after cleaning is by using string manipulation functions.
Pattern pattern = Pattern.compile("<html><head/><body>(.*?)</body></html>");
Matcher matcher = pattern.matcher(output);
matcher.find();
System.out.println(matcher.group(1));
This should do the trick
Support for removing HTML envelope added now in version 1.2
You can do like this:
HtmlCleaner hc = new HtmlCleaner(str);
try{
hc.setOmitHtmlEnvelope(true);
hc.setOmitXmlDeclaration(true);
hc.clean();
return hc.getXmlAsString();
}catch(IOException e){
return str;
}
Hello,
This is not the solution. I'm trying version 2.16. When I set those properties, with CleanerProperties, the HTML envelope is always removed, even is the input contained one. This way it is not possible see whether the input was a complete document or only a fragment.
I can understand that it is difficult to make the HTML envelope conditional in the code. A compromise could be to turn on "autoGenerated" property for the added envelope tags if the input didn't contain them.
Best regards,
Werner.
Hi Werner,
The envelope is recreated as part of parsing the document type and encoding so we create a valid output DOM, particularly for the JDOM serializer. However I can see it being a real pain if you do need to know if you had a document or a fragment as input. Another option would be to have a property to have the OmitHtmlEnvelope property set dynamically based on the content.