A patch that allows decoding of entities when deserializing (i.e. "&" in the HTML file becomes "&" in the ContentNode).
The code in the patch supports named, decimal and hexadecimal entities. It is covered by tests.
One of the tests in the patch fails. It's a test for handling entities in CDATA sections. According to the specification, the entity should be left intact, but it appears that the CDATA is not being handled at all. In my application there is no CDATA at all, so it doesn't affect me, but I included the test in case somebody cares.
Huh, SF parser ate my entity. Let's try that again:
... "&" becomes "&" ...
The CDATA issue is probably just that the test isn't including CDATA within a script or style tag, so the tokenizer dumps it. If I change your test to:
... it passes OK.
Thanks for the patch, Alexey - I'll add it for the next release.