User Activity

  • Posted a comment on discussion Help on HtmlCleaner

    Thanks, Scott. Happy to hear the good news. PS: Actually the patch was created by one of my colleagues, Roland Stiller based on what legrass did, so please credit it to them. (He has no sourceforge account, that was the reason why I posted it.)

  • Posted a comment on discussion Help on HtmlCleaner

    Hi Scott, here comes the patch we created to address this entity translation issue in DomSerializer. We were contemplating to create a custom serializer but finally we decided to give a try for this minimalistic patch. (properties RecognizeUnicodeChars and TranslateSpecialEntities are taken into account.) What do you think and what else do you need from our side to have this feature in one of the upcoming releases?

  • Modified a comment on ticket #185 on HtmlCleaner

    Hi Scott, I made a terrible mistake because I previously attached the version I was playing a bit. Sorry about that. Now attached the original but for the sake of safety here you can find it online: http://www.jpost.com/Breaking-News/Putin-congratulates-Frances-Macron-urging-for-united-efforts-amid-terror-threats-490105 Here is the incriminated CDATA section that closing tag lacks the '>': //<![CDATA[ (function(){ var b,c=window.deployads_ab_pct=10;b=Math.random()>c/100;var f=location.search.match(/[?&]deployads-ab=([^&]+)/);f&&2===f.length&&(b="pub"===f[1]);...

  • Posted a comment on ticket #185 on HtmlCleaner

    Hi Scott, I made a terrible mistake because I previously attached the version I was playing a bit. Sorry about that. Now attached the original but for the sake of safety here you can find it online: http://www.jpost.com/Breaking-News/Putin-congratulates-Frances-Macron-urging-for-united-efforts-amid-terror-threats-490105 Here is the incriminated CDATA section that closing tag lacks the '>': //<![CDATA[ (function(){ var b,c=window.deployads_ab_pct=10;b=Math.random()>c/100;var f=location.search.match(/[?&]deployads-ab=([^&]+)/);f&&2===f.length&&(b="pub"===f[1]);...

  • Posted a comment on ticket #185 on HtmlCleaner

    Hi Scott, the exception disappeared. Thanks for the fix. However, can you have a look at this page (attached). Here as far I see an unclosed CDATA section at the beginning of the document makes HtmlCleaner remove almost all valuable content. I'm using HtmlCleaner v2.21 DomSerializer output. Is there a way to make CDATA parsing a bit more intuitive and in such an unbalanced case stop where it "should"?

  • Posted a comment on discussion Help on HtmlCleaner

    Yes, it would be very helpful for us as well because we ran into the same or very similar situation. We use HTMLCleaner to "normalize" and clean HTML pages so Xpath expressions could be executed against those to get the plain text equivalent of certain parts of the document. After reading through this thread I have been contemplating to write our own Serializer too. I think extracting the node type-specific branches to dedicated methods would help a lot in order to avoid unnecessary code duplication....

View All

Personal Data

Username:
pgerzson
Joined:
2001-03-03 01:39:38
Location:
Hungary / CEST
Gender:
Male

Projects

  • No projects to display.

Personal Tools

MongoDB Logo MongoDB