Activity for Martin Jericho

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Andy. Thanks for reporting the issue. I see what you mean about sourceforge. I just noticed they removed all of the documentation from my project's website a couple of months ago without notification. I just fixed that. But no I don't have any intention of moving the project to github at this point in time. Firstly, you might like to try using the latest DEV version 3.5. There have been a few improvements and bug fixes to the Renderer class. You can download it here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip...

  • Martin Jericho Martin Jericho created ticket #26188

    Page Broken (500): https://sourceforge.net/p/jerichohtml/discussion/350025/moderate/save_moderation

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Ethan, Thank you for the suggestion. Yes I got a request for this already last year: https://sourceforge.net/p/jerichohtml/bugs/93/ The biggest barrier at the moment is the fact that I implemented a new major feature a few years ago (a web crawler API) but it remains poorly documented, and could probably use a couple of minor enhancements before it is officially released. That means all bug fixes since then have just gone into the DEV release: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip...

  • Martin Jericho Martin Jericho modified ticket #95

    Typo in StreamEncodingDetector.isDifinitive

  • Martin Jericho Martin Jericho posted a comment on ticket #95

    Haha thanks!

  • Martin Jericho Martin Jericho posted a comment on ticket #93

    Hi Samuel. Thank you for doing all of this. Unfortunately this library is way down my priority list these days. I still use it in heaps of projects, and it still works well, and I still fix an occasional reported bug, but I haven't done an official release for years, and I can't see myself getting to it in the near future. When I do eventually release a new version, I will definitely incorporate your suggestions. Cheers Martin

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    P.S. When you want to include HTML in your post, you need to enclose it in a code block, otherwise the HTML is parsed and doesn't show properly. For example, your sample document should look like this: <html> <head> <meta http-equiv=&quot;Content-Type&quot; content=&quot;html; charset=UTF-8&quot;> </head> </html>

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Davy, The sample HTML you are feeding it doesn't specify a valid encoding and is therefore parsed correctly. Because the quotes are encoded, they are included in the value of the content attribute, which is why the end quote is interpreted as part of the encoding name. You say that the sample content occurs when it is "inserted into an iframe". I assume you mean it appears as the value of the iframe srcdoc attribute. In that case, your sample document should be the HTML containing the iframe,...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Remi, I didn't document anywhere why I made the decision to remove the content of button elements. In general I was copying the behaviour of how some email clients create pure text versions of HTML emails. Maybe I just thought they should be removed because all other form elements (INPUT, TEXTAREA etc) are removed. Or maybe I just didn't think much about it! I've modified the Render class in version 3.5 to include the content of BUTTON elements. Until version 3.5 is officially released, the development...

  • Martin Jericho Martin Jericho posted a comment on ticket #94

    No you don't need to create multiple copies of the Source object. The important thing is just that you find the server tags and call source.ignoreWhenParsing() on them before searching for the HTML tags that contain them. This implies that you need to call it before a full sequential parse is performed, but that may never even be called if you don't call any of the methods getAllTags(), getAllStartTags(), getAllElements(), getChildElements(), iterator() or Segment.getNodeIterator() method on the...

  • Martin Jericho Martin Jericho modified ticket #93

    please support java modules in the next release

  • Martin Jericho Martin Jericho posted a comment on ticket #93

    Hi Samael, Thanks for raising this issue. Sorry I'm not very familiar with the java modules system so I'd appreciate if you could give me some advice and assistance. I'm still targeting java 1.7 to maximise compatibility with older programs. In order to support modules I'd have to compile targeting java 9, right? I'm not sure whether that would mean the library doesn't work with projects targeting Java 8, which I believe is still very common. Do you know whether that would be an issue? Have you already...

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    Hi Andrew, The release.txt file does mention "minor changes to Renderer behaviour" for version 3.5. The new behaviour is more consistent with browser behaviour so it is most likely an intended change. Cheers Martin

  • Martin Jericho Martin Jericho modified a comment on ticket #94

    Hi Chris, The situation you've encountered is an-unavoidable consequence of using xml-style server tags inside HTML tags, and the reason I consider it to be bad practice. It is no possible for a parser to know which tags are server tags unless you tell it, so the solution you proposed does not work in all cases. If you can't avoid the use of xml-style server tags in your HTML, the way you tell the parser which tags are server tags is to explicitly search for them first. For each server tag you find,...

  • Martin Jericho Martin Jericho modified ticket #94

    Incorrectly parsing attribute values containing tags containing quotes

  • Martin Jericho Martin Jericho posted a comment on ticket #94

    Hi Chris, The situation you've encountered is an-unavoidable consequence of using xml-style server tags inside HTML tags, and the reason I consider it to be bad practice. It is no possible for a parser to know which tags are server tags unless you tell it, so the solution you proposed does not work in all cases. If you can't avoid the use of xml-style server tags in your HTML, the way you tell the parser which tags are server tags is to explicitly search for them first. For each server tag you find,...

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    Hi Andrew. Version 3.5 hasn't been officially released yet because the newest feature, a web crawler API, has not been fully documented yet. The project is not dead, and minor improvements continue to make their way into the DEV version, but other time commitments have prevented the completion of the documentation and an official release for years. The 3.5-dev version (http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip) is always a release candidate and can be used as a reliable substitute...

  • Martin Jericho Martin Jericho modified ticket #22

    Renderer Always Adds Brackets Around Links

  • Martin Jericho Martin Jericho posted a comment on ticket #22

    Hi Ryan, You can customise this by overriding the renderHyperlinkURL method. An example is provided in the documentation: http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Renderer.html#renderHyperlinkURL(net.htmlparser.jericho.StartTag) If you only want to remove the brackets if the URL is the same as the element contents, you can check for startTag.getAttributeValue("href").equals(startTag.getElement().getTextExtractor().toString()) or if you want it a bit more efficient and disregard...

  • Martin Jericho Martin Jericho modified ticket #92

    Query parameter names in hyperlinks being incorrectly decoded

  • Martin Jericho Martin Jericho posted a comment on ticket #92

    I just confirmed that character references are decoded inside link elements, at least in Chrome. But the character reference in your example is not terminated, meaning it is missing the final semicolon. When this parser was firs written, most browsers still decoded unterminated character references, but each browser behaved differently. So I created a CompatibilityMode class to encapsulate the decoding behaviour of unterminated character references. To configure the parser not to decode any unterminated...

  • Martin Jericho Martin Jericho posted a comment on ticket #92

    Hi Remi, I'm not aware that HTML character references shouldn't be decoded inside link elements. Do you have a source for that? Cheers Martin

  • Martin Jericho Martin Jericho posted a comment on ticket #91

    The HTML should reflect the fact that these links are on separate lines. CSS should not be used to change the meaning of the content, only to style it. Correct HTML would wrap these links in some sort of block element such as li or div, or ideally the HTML5 menuitem element. It's an important aspect of making the web content accessible.

  • Martin Jericho Martin Jericho posted a comment on ticket #91

    Hi Daniel, In your example you're using relative URLs (foo.com), which the default Renderer class doesn't render at all. This is documented in the Renderer.renderHyperlinkURL(String) method. Because the URL isn't rendered, the content inside the hyperlink is just rendered as normal text. Configuration options such as setHyperlinkContentDelimiters only affect the output when the URL is rendered. I think your only option is going to be to create a copy of the Renderer class and cusomise the A_ElementHandler,...

  • Martin Jericho Martin Jericho posted a comment on ticket #91

    Hi Daniel, It is very uncommon for CSS to style a hyperlink to add whitespace around it. Two consecutive A elements without whitespace in between will almost always render as a single word in a browser. Therefore the behaviour of the Render class is correct and your end user probably does want to know about the lack of a space between the two words when running the spell checker in your example above. Note that the current release version 3.4 contains a bug relating to the rendering of hyperlinks....

  • Martin Jericho Martin Jericho created ticket #364

    Computer hangs while entering password at boot time

  • Martin Jericho Martin Jericho modified ticket #90

    Renderer class picks out content from within a script tag

  • Martin Jericho Martin Jericho posted a comment on ticket #90

    Thanks for the bug report! Fixed in version 3.5. Although the parser was already designed to ignore other tags inside SCRIPT elements, there was a bug triggered by the presence of server tags inside the script element. In your example it was the <%- data.price %> tag causing the problem. Until version 3.5 is officially released, the development version is available here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip Although it has been 5 years since the last official release, version...

  • Martin Jericho Martin Jericho modified ticket #89

    ArrayIndexOutOfBoundsException from Renderer: negative left margin.

  • Martin Jericho Martin Jericho modified ticket #88

    HTML5 parsing problems - links without quotes

  • Martin Jericho Martin Jericho posted a comment on ticket #89

    Thanks for the bug report! Fixed in version 3.5. Negative margins and padding are now treated as zero margin. Until version 3.5 is officially released, the development version is available here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip Although it has been almost 5 years since the last official release, version 3.5 includes a new major feature that requires significant time to document, and I don't envisage having spare time in the foreseeable future. So the official 3.5 release...

  • Martin Jericho Martin Jericho modified a comment on discussion Help

    If you only need to check for matching start and end tags it's easy. Create a stack, iterate through each tag, if it's a start tag add it to the stack, if it's an end tag check that the tag at the top of the stack has the same name. If not, check whether the start tag has an optional end tag. But there are many more potential problems with HTML than mismatched tags, and instead of writing your own code to detect and fix problems, why not use an existing library?

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    If you only need to check for matching start and end tags it's easy. Create a stack, iterate through each tag, if it's a start tag add it to the stack, if it's an end tag check that the tag at the top of the stack has the same name. If not, check whether the start tag has an optionan end tag. But there are many more potential problems with HTML than mismatched tags, and instead of writing your own code to detect and fix problems, why not use an existing library?

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Jiří, The Source Formatter is a tool for formatting valid HTML with indentation, not for fixing broken HTML. If you need to fix broken HTML you'll need to try a library that specialises in that task, such as HtmlCleaner https://sourceforge.net/projects/htmlcleaner/ (I haven't used it myself, but it's under active development) Cheers, Martin

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Wise Mike, I'm not sure why it's not working for you, but I suspect it will work if you download the Java JDK instead of the Java JRE. If that doesn't work, download the latest development version of the parser from the link below: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip Then open a command prompt in the folder samples/console and enter the following command: Encoding.bat arabic-test-file.html > out.txt (replace arabic-test-file.html with the full pathname of your arabic HTML...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Wise Mike, According to this page: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html the problem is that you have a "European languages" version of Java installed on your computer, which doesn't support the Windows-1256 encoding. This seems to be the default if the Java installer "recognizes that the host operating system only supports European languages". To fix the problem, you could either install the international version (apparently no need to download again, just...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Wise Mike, Could you please attach the HTML file that isn't doing what you expect? Cheers, Martin

  • Martin Jericho Martin Jericho posted a comment on ticket #88

    I added a WebBot class for crawling/downloading websites but I want to document it before releasing. There's a fair bit to document so it's taking a while, but I'm plugging away at it in my spare time. Still probably months rather than weeks away though, I don't have a lot of spare time.

  • Martin Jericho Martin Jericho posted a comment on ticket #88

    You have a test suite for parsing? Would you mind sharing that with me? I only have a handful of unit tests at present.

  • Martin Jericho Martin Jericho posted a comment on ticket #88

    Fixed in version 3.5. Until version 3.5 is officially released, the development version is available here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip

  • Martin Jericho Martin Jericho posted a comment on ticket #88

    Thanks Tobias. Strange that I wrote the parser to interpret the closing slash as an empty element tag when no browsers interpret it that way. Maybe browser behaviour has changed in that respect over the years. There's a fair bit of code and documentation to update to fix this but I'll see if I can get it done tonight.

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    Your example code didn't make it into the post, but try something like this: OutputDocument outputDocument=new OutputDocument(source); outputDocument.remove(source.getAllElements(HTMLElementName.STYLE)); for (StartTag startTag : source.getAllStartTags("style",null)) { // iterate all tags with a style attribute outputDocument.remove(startTag.getAttributes().get("style")); } String newHTML=outputDocument.toString();

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Yes use the Reader constructor for either Source or StreamedSource. For Source you can also pass a String to the constructor that accepts a CharSequence argument. The Source class always generates a String containing the entire document anyway so there's no advantage using the Reader constructor instead.

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    Hi Greg, Thanks for using the library and for taking an interest in hosting it on github. I'm curious as to why you thought it necessary to put it on github just to use it in an android app. Why not just use the jar file or the public maven repository? http://repo1.maven.org/maven2/net/htmlparser/jericho/jericho-html/3.4/ I haven't done any android development for a while but I believe it's still possible to specify a jar file as a dependency in an android project. If there is no real need for the...

  • Martin Jericho Martin Jericho posted a comment on ticket #87

    The program works fine for me. The likely reason it is "hanging" when you run it...

  • Martin Jericho Martin Jericho modified ticket #87

    getParentElement hangs

  • Martin Jericho Martin Jericho modified ticket #86

    &nbsp; in link results in space (0x20) rather than no break space (0xC2 0xA0)

  • Martin Jericho Martin Jericho posted a comment on ticket #86

    There is a static configuration variable to control this behaviour: http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Config.html#ConvertNonBreakingSpaces...

  • Martin Jericho Martin Jericho posted a comment on ticket #85

    This has been fixed in version 3.5. Until version 3.5 is officially released, the...

  • Martin Jericho Martin Jericho modified ticket #85

    html comments not ignored inside style tags

  • Martin Jericho Martin Jericho modified ticket #85

    html comments not ignored inside style tags

  • Martin Jericho Martin Jericho posted a comment on ticket #85

    Hi Code Buddy, I think you're probably right, the parser should be ignoring the HTML...

  • Martin Jericho Martin Jericho posted a comment on ticket #84

    Hi Guislain, I have (hopefully) fixed this in version 3.5. Until version 3.5 is officially...

  • Martin Jericho Martin Jericho modified ticket #84

    Tag and StartTag static init deadlock under heavy load

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    I've added the requested functionality to version 3.5. Until version 3.5 is officially...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Marco. This wouldn't be a simple tweak to the blockquote element handler, it would...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    No problem. I ran into the same issue myself!

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    This bug will be fixed in version 3.5. Until version 3.5 is officially released,...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Thanks for reporting the problem Damian. Could you please be a bit more specific...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Dear CNTLM users, It appears that the author of this project David Kubíček is no...

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    Hi Dan, Sorry for the delay responding, I've been on holidays! The normal approach...

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    Maybe log output goes somewhere else by default on the mac. Probably not an encoding...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    See the documentation of the OutputDocument class http://jericho.htmlparser.net/...

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    To avoid the error use the approach I already mentioned so that it doesn't attempt...

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    sourceText.substring(startTag.getEnd(),endTag.getBegin())

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    Sorry when you said your ${...} tags require context I didn't realise you have other...

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    You don't need to do the expression substitution before the tag parsing. Do all parsing...

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    The concept of a document hierarchy is fundamtentally problematic when you include...

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    Hi David, Yes allowing tags with start characters other than "<" would severely impact...

  • Martin Jericho Martin Jericho posted a comment on ticket #84

    Thanks for the bug report Guislain. I sometimes wonder how the jvm figures out these...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    You can create a java 1.6 compatible jar easily by changing the target argument in...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Now released to maven. http://repo1.maven.org/maven2/net/htmlparser/jericho/jeri...

  • Martin Jericho Martin Jericho created a blog post

    Jericho HTML Parser 3.4 released

  • Martin Jericho Martin Jericho modified ticket #80

    Jericho throws a "position discarded" exception with unlimited buffer

  • Martin Jericho Martin Jericho posted a comment on ticket #80

    This bug has now been closed - Version 3.4 released.

  • Martin Jericho Martin Jericho posted a comment on ticket #72

    This issue has now been closed - Version 3.4 released.

  • Martin Jericho Martin Jericho modified ticket #72

    Performance issues in handling attributes

  • Martin Jericho Martin Jericho modified ticket #71

    setHRLineLength(0) for Rendered still display "-" should be ""

  • Martin Jericho Martin Jericho posted a comment on ticket #71

    This bug has now been closed - Version 3.4 released.

  • Martin Jericho Martin Jericho modified ticket #68

    Unmatched <script> tag eats the rest of the markup in the document

  • Martin Jericho Martin Jericho modified ticket #64

    newLogger() does not cache its value, resulting in contention in multithreaded environments

  • Martin Jericho Martin Jericho posted a comment on ticket #64

    This bug has now been closed - Version 3.4 released.

  • Martin Jericho Martin Jericho modified ticket #63

    The parser uses exceptions for normal operations

  • Martin Jericho Martin Jericho posted a comment on ticket #63

    This bug has now been closed - Version 3.4 released.

  • Martin Jericho Martin Jericho modified ticket #62

    CharSequence StreamedSource should not require finalization

  • Martin Jericho Martin Jericho posted a comment on ticket #62

    This bug has now been closed - Version 3.4 released.

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Jim, Yes it's probably about time I did an official release. I'll see if I can...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    The explanation is already in the comment. If tag.name() equals one of the strings...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Sandro, You would have to do something like this: Element sourceElement = jspSource.getFirstElement("id",...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    I don't understand why you can't build an index externally rather than internally....

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Just search by tag name first, then check the resulting list for the attributes you...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Jianjin, You are welcome to put the latest code onto github, but you would have...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    hi Sandro, The StreamedSource API doesn't provide a means of finding a matching end...

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    Hi Nathan, You might want to look at the Element.getChildElements() method instead...

  • Martin Jericho Martin Jericho modified ticket #83

    Handling lang attributes

  • Martin Jericho Martin Jericho posted a comment on ticket #83

    Hi Quang, I suspect you're confusing language with character encoding. Your file...

  • Martin Jericho Martin Jericho modified ticket #82

    proxy connection fail

  • Martin Jericho Martin Jericho posted a comment on ticket #82

    Hi Alex, I'd suggest you use your own code to manage connections and load the content,...

1 >