String html = "<b>One </b>Two"; TagNode node = cleaner.clean(html); String cleanedHtml = (new PrettyHtmlSerialiser(props, " ")).getAsString(node, "UTF-8");
will produce cleaned HTML of
<b>One</b>Two
dropping the space inside the tag.
SimpleHtmlSerialiser preserves the space.
Yes, I can confirm it does this. The offending code is in PrettyHtmlSerializer.getSingleLineOfChildren where it trims the child content. I think a more appropriate algorithm in this instance is a whitespace collapse function, which turns any amount of whitespace into a single space character.
Would the white-space collapsing would fit more logically in the cleaner phase, and not in the renderer? The single/mutiple whitespace equivalence is a property of HTML itself, and would apply however it's rendered.
Logically you'd think so, however in the parsing rules that doesn't seem to be the case. If you put two spaces between words in a tag, then it renders in a browser (e.g. Chrome) as a single space - but if you inspect the DOM via xPath, both spaces are still there. So its not actually part of the cleaning phase.
... but in any case it would suggest for this that the best option is to not trim any spaces in the output from PrettyHtmlSerializer