it seems like duplicate </body are added at the end when parsing this url
also it adds a strange /]]/ before the duplicate scripts
m.news24.com/news24/World/News/passenger-describes-india-train-derailment-over-100-dead-20161120
One is the conditional processing rules in comments before the head tag - its not really standard HTML so HC just moves the comments into the BODY. As there are two HTML tags in the document...
The next is the handling of one of the scripts, which looks odd. I'll focus on that as there is clearly something wrong here.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, I think the problem here is that the CDATA section doesn't have an end token. So the CDATA is assumed to be everything up to the end of the doc. I've added a check for that - if there's a start CDATA with no end, we wind back to the start.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What happens now is that if the CDATA start token has no corresponding end token, I terminate the section immediately after the start token. Previously it continued to the end of the document before terminating, which is why the output from that page looked so strange.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hmm, there are two main issues here.
One is the conditional processing rules in comments before the head tag - its not really standard HTML so HC just moves the comments into the BODY. As there are two HTML tags in the document...
The next is the handling of one of the scripts, which looks odd. I'll focus on that as there is clearly something wrong here.
OK, I think the problem here is that the CDATA section doesn't have an end token. So the CDATA is assumed to be everything up to the end of the doc. I've added a check for that - if there's a start CDATA with no end, we wind back to the start.
Fixed in 2.19
shouldnt you close the cdata once it's wrapping tag end is reached?
What happens now is that if the CDATA start token has no corresponding end token, I terminate the section immediately after the start token. Previously it continued to the end of the document before terminating, which is why the output from that page looked so strange.