Menu

#16 Patch for deserializing entities when reading HTML

2.9
closed-accepted
None
5
2014-05-19
2014-05-18
No

A patch that allows decoding of entities when deserializing (i.e. "&" in the HTML file becomes "&" in the ContentNode).

The code in the patch supports named, decimal and hexadecimal entities. It is covered by tests.

One of the tests in the patch fails. It's a test for handling entities in CDATA sections. According to the specification, the entity should be left intact, but it appears that the CDATA is not being handled at all. In my application there is no CDATA at all, so it doesn't affect me, but I included the test in case somebody cares.

1 Attachments

Discussion

  • Alexey Lukashev

    Alexey Lukashev - 2014-05-18

    Huh, SF parser ate my entity. Let's try that again:

    ... "&" becomes "&" ...

     
  • Scott Wilson

    Scott Wilson - 2014-05-19

    The CDATA issue is probably just that the test isn't including CDATA within a script or style tag, so the tokenizer dumps it. If I change your test to:

      public void testCData() {
        doTest("<script>"+CData.BEGIN_CDATA + "&amp;" + CData.END_CDATA+"</script>", "&amp;");
      }
    

    ... it passes OK.

     
  • Scott Wilson

    Scott Wilson - 2014-05-19

    Thanks for the patch, Alexey - I'll add it for the next release.

     
  • Scott Wilson

    Scott Wilson - 2014-05-19
    • status: open --> closed-accepted
    • assigned_to: Scott Wilson
    • Group: 2.6 --> 2.9
     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.