User Activity

  • Posted a comment on discussion Help on Jericho HTML Parser

    Hi Ethan, Thank you for the suggestion. Yes I got a request for this already last year: https://sourceforge.net/p/jerichohtml/bugs/93/ The biggest barrier at the moment is the fact that I implemented a new major feature a few years ago (a web crawler API) but it remains poorly documented, and could probably use a couple of minor enhancements before it is officially released. That means all bug fixes since then have just gone into the DEV release: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip...

  • Modified ticket #95 on Jericho HTML Parser

    Typo in StreamEncodingDetector.isDifinitive

  • Posted a comment on ticket #95 on Jericho HTML Parser

    Haha thanks!

  • Posted a comment on ticket #93 on Jericho HTML Parser

    Hi Samuel. Thank you for doing all of this. Unfortunately this library is way down my priority list these days. I still use it in heaps of projects, and it still works well, and I still fix an occasional reported bug, but I haven't done an official release for years, and I can't see myself getting to it in the near future. When I do eventually release a new version, I will definitely incorporate your suggestions. Cheers Martin

  • Posted a comment on discussion Help on Jericho HTML Parser

    P.S. When you want to include HTML in your post, you need to enclose it in a code block, otherwise the HTML is parsed and doesn't show properly. For example, your sample document should look like this: <html> <head> <meta http-equiv=&quot;Content-Type&quot; content=&quot;html; charset=UTF-8&quot;> </head> </html>

  • Posted a comment on discussion Help on Jericho HTML Parser

    Hi Davy, The sample HTML you are feeding it doesn't specify a valid encoding and is therefore parsed correctly. Because the quotes are encoded, they are included in the value of the content attribute, which is why the end quote is interpreted as part of the encoding name. You say that the sample content occurs when it is "inserted into an iframe". I assume you mean it appears as the value of the iframe srcdoc attribute. In that case, your sample document should be the HTML containing the iframe,...

  • Posted a comment on discussion Help on Jericho HTML Parser

    Hi Remi, I didn't document anywhere why I made the decision to remove the content of button elements. In general I was copying the behaviour of how some email clients create pure text versions of HTML emails. Maybe I just thought they should be removed because all other form elements (INPUT, TEXTAREA etc) are removed. Or maybe I just didn't think much about it! I've modified the Render class in version 3.5 to include the content of BUTTON elements. Until version 3.5 is officially released, the development...

  • Posted a comment on ticket #94 on Jericho HTML Parser

    No you don't need to create multiple copies of the Source object. The important thing is just that you find the server tags and call source.ignoreWhenParsing() on them before searching for the HTML tags that contain them. This implies that you need to call it before a full sequential parse is performed, but that may never even be called if you don't call any of the methods getAllTags(), getAllStartTags(), getAllElements(), getChildElements(), iterator() or Segment.getNodeIterator() method on the...

View All

Personal Data

Username:
mjericho
Joined:
2002-01-06 08:10:46

Projects

This is a list of open source software projects that Martin Jericho is associated with:

Personal Tools