HtmlCleaner / Patches / #16 Patch for deserializing entities when reading HTML

#16 Patch for deserializing entities when reading HTML

Milestone: 2.9

Status: closed-accepted

Owner: Scott Wilson

Labels: None

Priority: 5

Updated: 2014-05-19

Created: 2014-05-18

Creator: Alexey Lukashev

Private: No

A patch that allows decoding of entities when deserializing (i.e. "&" in the HTML file becomes "&" in the ContentNode).

The code in the patch supports named, decimal and hexadecimal entities. It is covered by tests.

One of the tests in the patch fails. It's a test for handling entities in CDATA sections. According to the specification, the entity should be left intact, but it appears that the CDATA is not being handled at all. In my application there is no CDATA at all, so it doesn't affect me, but I included the test in case somebody cares.

1 Attachments

deserialize-content-entities.patch

Discussion

Alexey Lukashev - 2014-05-18

Huh, SF parser ate my entity. Let's try that again:

... "&" becomes "&" ...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2014-05-19

The CDATA issue is probably just that the test isn't including CDATA within a script or style tag, so the tokenizer dumps it. If I change your test to:

public void testCData() { doTest("<script>"+CData.BEGIN_CDATA + "&" + CData.END_CDATA+"</script>", "&"); }

... it passes OK.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2014-05-19

Thanks for the patch, Alexey - I'll add it for the next release.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2014-05-19

status: open --> closed-accepted

assigned_to: Scott Wilson

Group: 2.6 --> 2.9
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Patch for deserializing entities when reading HTML

Group

Searches

Help

#16 Patch for deserializing entities when reading HTML

Discussion