mjericho Activity

Activity for Martin Jericho

1 year ago
Martin Jericho posted a comment on discussion Help

Hi Andy. Thanks for reporting the issue. I see what you mean about sourceforge. I just noticed they removed all of the documentation from my project's website a couple of months ago without notification. I just fixed that. But no I don't have any intention of moving the project to github at this point in time. Firstly, you might like to try using the latest DEV version 3.5. There have been a few improvements and bug fixes to the Renderer class. You can download it here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip...
1 year ago
Martin Jericho created ticket #26188

Page Broken (500): https://sourceforge.net/p/jerichohtml/discussion/350025/moderate/save_moderation
2 years ago
Martin Jericho posted a comment on discussion Help

Hi Ethan, Thank you for the suggestion. Yes I got a request for this already last year: https://sourceforge.net/p/jerichohtml/bugs/93/ The biggest barrier at the moment is the fact that I implemented a new major feature a few years ago (a web crawler API) but it remains poorly documented, and could probably use a couple of minor enhancements before it is officially released. That means all bug fixes since then have just gone into the DEV release: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip...
3 years ago
Martin Jericho modified ticket #95

Typo in StreamEncodingDetector.isDifinitive
3 years ago
Martin Jericho posted a comment on ticket #95

Haha thanks!
3 years ago
Martin Jericho posted a comment on ticket #93

Hi Samuel. Thank you for doing all of this. Unfortunately this library is way down my priority list these days. I still use it in heaps of projects, and it still works well, and I still fix an occasional reported bug, but I haven't done an official release for years, and I can't see myself getting to it in the near future. When I do eventually release a new version, I will definitely incorporate your suggestions. Cheers Martin
3 years ago
Martin Jericho posted a comment on discussion Help

P.S. When you want to include HTML in your post, you need to enclose it in a code block, otherwise the HTML is parsed and doesn't show properly. For example, your sample document should look like this: <html> <head> <meta http-equiv="Content-Type" content="html; charset=UTF-8"> </head> </html>
3 years ago
Martin Jericho posted a comment on discussion Help

Hi Davy, The sample HTML you are feeding it doesn't specify a valid encoding and is therefore parsed correctly. Because the quotes are encoded, they are included in the value of the content attribute, which is why the end quote is interpreted as part of the encoding name. You say that the sample content occurs when it is "inserted into an iframe". I assume you mean it appears as the value of the iframe srcdoc attribute. In that case, your sample document should be the HTML containing the iframe,...
3 years ago
Martin Jericho posted a comment on discussion Help

Hi Remi, I didn't document anywhere why I made the decision to remove the content of button elements. In general I was copying the behaviour of how some email clients create pure text versions of HTML emails. Maybe I just thought they should be removed because all other form elements (INPUT, TEXTAREA etc) are removed. Or maybe I just didn't think much about it! I've modified the Render class in version 3.5 to include the content of BUTTON elements. Until version 3.5 is officially released, the development...
4 years ago
Martin Jericho posted a comment on ticket #94

No you don't need to create multiple copies of the Source object. The important thing is just that you find the server tags and call source.ignoreWhenParsing() on them before searching for the HTML tags that contain them. This implies that you need to call it before a full sequential parse is performed, but that may never even be called if you don't call any of the methods getAllTags(), getAllStartTags(), getAllElements(), getChildElements(), iterator() or Segment.getNodeIterator() method on the...
4 years ago
Martin Jericho modified ticket #93

please support java modules in the next release
4 years ago
Martin Jericho posted a comment on ticket #93

Hi Samael, Thanks for raising this issue. Sorry I'm not very familiar with the java modules system so I'd appreciate if you could give me some advice and assistance. I'm still targeting java 1.7 to maximise compatibility with older programs. In order to support modules I'd have to compile targeting java 9, right? I'm not sure whether that would mean the library doesn't work with projects targeting Java 8, which I believe is still very common. Do you know whether that would be an issue? Have you already...
4 years ago
Martin Jericho posted a comment on discussion Open Discussion

Hi Andrew, The release.txt file does mention "minor changes to Renderer behaviour" for version 3.5. The new behaviour is more consistent with browser behaviour so it is most likely an intended change. Cheers Martin
4 years ago
Martin Jericho modified a comment on ticket #94

Hi Chris, The situation you've encountered is an-unavoidable consequence of using xml-style server tags inside HTML tags, and the reason I consider it to be bad practice. It is no possible for a parser to know which tags are server tags unless you tell it, so the solution you proposed does not work in all cases. If you can't avoid the use of xml-style server tags in your HTML, the way you tell the parser which tags are server tags is to explicitly search for them first. For each server tag you find,...
4 years ago
Martin Jericho modified ticket #94

Incorrectly parsing attribute values containing tags containing quotes
4 years ago
Martin Jericho posted a comment on ticket #94

Hi Chris, The situation you've encountered is an-unavoidable consequence of using xml-style server tags inside HTML tags, and the reason I consider it to be bad practice. It is no possible for a parser to know which tags are server tags unless you tell it, so the solution you proposed does not work in all cases. If you can't avoid the use of xml-style server tags in your HTML, the way you tell the parser which tags are server tags is to explicitly search for them first. For each server tag you find,...
4 years ago
Martin Jericho posted a comment on discussion Open Discussion

Hi Andrew. Version 3.5 hasn't been officially released yet because the newest feature, a web crawler API, has not been fully documented yet. The project is not dead, and minor improvements continue to make their way into the DEV version, but other time commitments have prevented the completion of the documentation and an official release for years. The 3.5-dev version (http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip) is always a release candidate and can be used as a reliable substitute...
5 years ago
Martin Jericho modified ticket #22

Renderer Always Adds Brackets Around Links
5 years ago
Martin Jericho posted a comment on ticket #22

Hi Ryan, You can customise this by overriding the renderHyperlinkURL method. An example is provided in the documentation: http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Renderer.html#renderHyperlinkURL(net.htmlparser.jericho.StartTag) If you only want to remove the brackets if the URL is the same as the element contents, you can check for startTag.getAttributeValue("href").equals(startTag.getElement().getTextExtractor().toString()) or if you want it a bit more efficient and disregard...
5 years ago
Martin Jericho modified ticket #92

Query parameter names in hyperlinks being incorrectly decoded
5 years ago
Martin Jericho posted a comment on ticket #92

I just confirmed that character references are decoded inside link elements, at least in Chrome. But the character reference in your example is not terminated, meaning it is missing the final semicolon. When this parser was firs written, most browsers still decoded unterminated character references, but each browser behaved differently. So I created a CompatibilityMode class to encapsulate the decoding behaviour of unterminated character references. To configure the parser not to decode any unterminated...
5 years ago
Martin Jericho posted a comment on ticket #92

Hi Remi, I'm not aware that HTML character references shouldn't be decoded inside link elements. Do you have a source for that? Cheers Martin
5 years ago
Martin Jericho posted a comment on ticket #91

The HTML should reflect the fact that these links are on separate lines. CSS should not be used to change the meaning of the content, only to style it. Correct HTML would wrap these links in some sort of block element such as li or div, or ideally the HTML5 menuitem element. It's an important aspect of making the web content accessible.
5 years ago
Martin Jericho posted a comment on ticket #91

Hi Daniel, In your example you're using relative URLs (foo.com), which the default Renderer class doesn't render at all. This is documented in the Renderer.renderHyperlinkURL(String) method. Because the URL isn't rendered, the content inside the hyperlink is just rendered as normal text. Configuration options such as setHyperlinkContentDelimiters only affect the output when the URL is rendered. I think your only option is going to be to create a copy of the Renderer class and cusomise the A_ElementHandler,...
5 years ago
Martin Jericho posted a comment on ticket #91

Hi Daniel, It is very uncommon for CSS to style a hyperlink to add whitespace around it. Two consecutive A elements without whitespace in between will almost always render as a single word in a browser. Therefore the behaviour of the Render class is correct and your end user probably does want to know about the lack of a space between the two words when running the spell checker in your example above. Note that the current release version 3.4 contains a bug relating to the rendering of hyperlinks....
5 years ago
Martin Jericho created ticket #364

Computer hangs while entering password at boot time
5 years ago
Martin Jericho modified ticket #90

Renderer class picks out content from within a script tag
5 years ago
Martin Jericho posted a comment on ticket #90

Thanks for the bug report! Fixed in version 3.5. Although the parser was already designed to ignore other tags inside SCRIPT elements, there was a bug triggered by the presence of server tags inside the script element. In your example it was the <%- data.price %> tag causing the problem. Until version 3.5 is officially released, the development version is available here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip Although it has been 5 years since the last official release, version...
5 years ago
Martin Jericho modified ticket #89

ArrayIndexOutOfBoundsException from Renderer: negative left margin.
5 years ago
Martin Jericho modified ticket #88

HTML5 parsing problems - links without quotes
6 years ago
Martin Jericho posted a comment on ticket #89

Thanks for the bug report! Fixed in version 3.5. Negative margins and padding are now treated as zero margin. Until version 3.5 is officially released, the development version is available here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip Although it has been almost 5 years since the last official release, version 3.5 includes a new major feature that requires significant time to document, and I don't envisage having spare time in the foreseeable future. So the official 3.5 release...
6 years ago
Martin Jericho modified a comment on discussion Help

If you only need to check for matching start and end tags it's easy. Create a stack, iterate through each tag, if it's a start tag add it to the stack, if it's an end tag check that the tag at the top of the stack has the same name. If not, check whether the start tag has an optional end tag. But there are many more potential problems with HTML than mismatched tags, and instead of writing your own code to detect and fix problems, why not use an existing library?
6 years ago
Martin Jericho posted a comment on discussion Help

If you only need to check for matching start and end tags it's easy. Create a stack, iterate through each tag, if it's a start tag add it to the stack, if it's an end tag check that the tag at the top of the stack has the same name. If not, check whether the start tag has an optionan end tag. But there are many more potential problems with HTML than mismatched tags, and instead of writing your own code to detect and fix problems, why not use an existing library?
6 years ago
Martin Jericho posted a comment on discussion Help

Hi Jiří, The Source Formatter is a tool for formatting valid HTML with indentation, not for fixing broken HTML. If you need to fix broken HTML you'll need to try a library that specialises in that task, such as HtmlCleaner https://sourceforge.net/projects/htmlcleaner/ (I haven't used it myself, but it's under active development) Cheers, Martin
7 years ago
Martin Jericho posted a comment on discussion Help

Hi Wise Mike, I'm not sure why it's not working for you, but I suspect it will work if you download the Java JDK instead of the Java JRE. If that doesn't work, download the latest development version of the parser from the link below: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip Then open a command prompt in the folder samples/console and enter the following command: Encoding.bat arabic-test-file.html > out.txt (replace arabic-test-file.html with the full pathname of your arabic HTML...
7 years ago
Martin Jericho posted a comment on discussion Help

Hi Wise Mike, According to this page: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html the problem is that you have a "European languages" version of Java installed on your computer, which doesn't support the Windows-1256 encoding. This seems to be the default if the Java installer "recognizes that the host operating system only supports European languages". To fix the problem, you could either install the international version (apparently no need to download again, just...
7 years ago
Martin Jericho posted a comment on discussion Help

Hi Wise Mike, Could you please attach the HTML file that isn't doing what you expect? Cheers, Martin
7 years ago
Martin Jericho posted a comment on ticket #88

I added a WebBot class for crawling/downloading websites but I want to document it before releasing. There's a fair bit to document so it's taking a while, but I'm plugging away at it in my spare time. Still probably months rather than weeks away though, I don't have a lot of spare time.
8 years ago
Martin Jericho posted a comment on ticket #88

You have a test suite for parsing? Would you mind sharing that with me? I only have a handful of unit tests at present.
8 years ago
Martin Jericho posted a comment on ticket #88

Fixed in version 3.5. Until version 3.5 is officially released, the development version is available here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip
8 years ago
Martin Jericho posted a comment on ticket #88

Thanks Tobias. Strange that I wrote the parser to interpret the closing slash as an empty element tag when no browsers interpret it that way. Maybe browser behaviour has changed in that respect over the years. There's a fair bit of code and documentation to update to fix this but I'll see if I can get it done tonight.
8 years ago
Martin Jericho posted a comment on discussion Open Discussion

Your example code didn't make it into the post, but try something like this: OutputDocument outputDocument=new OutputDocument(source); outputDocument.remove(source.getAllElements(HTMLElementName.STYLE)); for (StartTag startTag : source.getAllStartTags("style",null)) { // iterate all tags with a style attribute outputDocument.remove(startTag.getAttributes().get("style")); } String newHTML=outputDocument.toString();
8 years ago
Martin Jericho posted a comment on discussion Help

Yes use the Reader constructor for either Source or StreamedSource. For Source you can also pass a String to the constructor that accepts a CharSequence argument. The Source class always generates a String containing the entire document anyway so there's no advantage using the Reader constructor instead.
8 years ago
Martin Jericho posted a comment on discussion Open Discussion

Hi Greg, Thanks for using the library and for taking an interest in hosting it on github. I'm curious as to why you thought it necessary to put it on github just to use it in an android app. Why not just use the jar file or the public maven repository? http://repo1.maven.org/maven2/net/htmlparser/jericho/jericho-html/3.4/ I haven't done any android development for a while but I believe it's still possible to specify a jar file as a dependency in an android project. If there is no real need for the...
9 years ago
Martin Jericho posted a comment on ticket #87

The program works fine for me. The likely reason it is "hanging" when you run it...
9 years ago
Martin Jericho modified ticket #87

getParentElement hangs
9 years ago
Martin Jericho modified ticket #86

  in link results in space (0x20) rather than no break space (0xC2 0xA0)
9 years ago
Martin Jericho posted a comment on ticket #86

There is a static configuration variable to control this behaviour: http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Config.html#ConvertNonBreakingSpaces...
9 years ago
Martin Jericho posted a comment on ticket #85

This has been fixed in version 3.5. Until version 3.5 is officially released, the...
9 years ago
Martin Jericho modified ticket #85

html comments not ignored inside style tags
9 years ago
Martin Jericho modified ticket #85

html comments not ignored inside style tags
9 years ago
Martin Jericho posted a comment on ticket #85

Hi Code Buddy, I think you're probably right, the parser should be ignoring the HTML...
9 years ago
Martin Jericho posted a comment on ticket #84

Hi Guislain, I have (hopefully) fixed this in version 3.5. Until version 3.5 is officially...
9 years ago
Martin Jericho modified ticket #84

Tag and StartTag static init deadlock under heavy load
9 years ago
Martin Jericho posted a comment on discussion Help

I've added the requested functionality to version 3.5. Until version 3.5 is officially...
9 years ago
Martin Jericho posted a comment on discussion Help

Hi Marco. This wouldn't be a simple tweak to the blockquote element handler, it would...
9 years ago
Martin Jericho posted a comment on discussion Help

No problem. I ran into the same issue myself!
9 years ago
Martin Jericho posted a comment on discussion Help

This bug will be fixed in version 3.5. Until version 3.5 is officially released,...
9 years ago
Martin Jericho posted a comment on discussion Help

Thanks for reporting the problem Damian. Could you please be a bit more specific...
10 years ago
Martin Jericho posted a comment on discussion Help

Dear CNTLM users, It appears that the author of this project David Kubíček is no...
10 years ago
Martin Jericho posted a comment on discussion Open Discussion

Hi Dan, Sorry for the delay responding, I've been on holidays! The normal approach...
10 years ago
Martin Jericho posted a comment on discussion Open Discussion

Maybe log output goes somewhere else by default on the mac. Probably not an encoding...
10 years ago
Martin Jericho posted a comment on discussion Help

See the documentation of the OutputDocument class http://jericho.htmlparser.net/...
10 years ago
Martin Jericho posted a comment on discussion Open Discussion

To avoid the error use the approach I already mentioned so that it doesn't attempt...
10 years ago
Martin Jericho posted a comment on discussion Open Discussion

sourceText.substring(startTag.getEnd(),endTag.getBegin())
10 years ago
Martin Jericho posted a comment on discussion Open Discussion

Sorry when you said your ${...} tags require context I didn't realise you have other...
10 years ago
Martin Jericho posted a comment on discussion Open Discussion

You don't need to do the expression substitution before the tag parsing. Do all parsing...
10 years ago
Martin Jericho posted a comment on discussion Open Discussion

The concept of a document hierarchy is fundamtentally problematic when you include...
10 years ago
Martin Jericho posted a comment on discussion Open Discussion

Hi David, Yes allowing tags with start characters other than "<" would severely impact...
1 decade ago
Martin Jericho posted a comment on ticket #84

Thanks for the bug report Guislain. I sometimes wonder how the jvm figures out these...
1 decade ago
Martin Jericho posted a comment on discussion Help

You can create a java 1.6 compatible jar easily by changing the target argument in...
1 decade ago
Martin Jericho posted a comment on discussion Help

Now released to maven. http://repo1.maven.org/maven2/net/htmlparser/jericho/jeri...
1 decade ago
Martin Jericho created a blog post

Jericho HTML Parser 3.4 released
1 decade ago
Martin Jericho modified ticket #80

Jericho throws a "position discarded" exception with unlimited buffer
1 decade ago
Martin Jericho posted a comment on ticket #80

This bug has now been closed - Version 3.4 released.
1 decade ago
Martin Jericho posted a comment on ticket #72

This issue has now been closed - Version 3.4 released.
1 decade ago
Martin Jericho modified ticket #72

Performance issues in handling attributes
1 decade ago
Martin Jericho modified ticket #71

setHRLineLength(0) for Rendered still display "-" should be ""
1 decade ago
Martin Jericho posted a comment on ticket #71

This bug has now been closed - Version 3.4 released.
1 decade ago
Martin Jericho modified ticket #68

Unmatched <script> tag eats the rest of the markup in the document
1 decade ago
Martin Jericho modified ticket #64

newLogger() does not cache its value, resulting in contention in multithreaded environments
1 decade ago
Martin Jericho posted a comment on ticket #64

This bug has now been closed - Version 3.4 released.
1 decade ago
Martin Jericho modified ticket #63

The parser uses exceptions for normal operations
1 decade ago
Martin Jericho posted a comment on ticket #63

This bug has now been closed - Version 3.4 released.
1 decade ago
Martin Jericho modified ticket #62

CharSequence StreamedSource should not require finalization
1 decade ago
Martin Jericho posted a comment on ticket #62

This bug has now been closed - Version 3.4 released.
1 decade ago
Martin Jericho posted a comment on discussion Help

Hi Jim, Yes it's probably about time I did an official release. I'll see if I can...
1 decade ago
Martin Jericho posted a comment on discussion Help

The explanation is already in the comment. If tag.name() equals one of the strings...
1 decade ago
Martin Jericho posted a comment on discussion Help

Hi Sandro, You would have to do something like this: Element sourceElement = jspSource.getFirstElement("id",...
1 decade ago
Martin Jericho posted a comment on discussion Help

I don't understand why you can't build an index externally rather than internally....
1 decade ago
Martin Jericho posted a comment on discussion Help

Just search by tag name first, then check the resulting list for the attributes you...
1 decade ago
Martin Jericho posted a comment on discussion Help

Hi Jianjin, You are welcome to put the latest code onto github, but you would have...
1 decade ago
Martin Jericho posted a comment on discussion Help

hi Sandro, The StreamedSource API doesn't provide a means of finding a matching end...
1 decade ago
Martin Jericho posted a comment on discussion Open Discussion

Hi Nathan, You might want to look at the Element.getChildElements() method instead...
1 decade ago
Martin Jericho modified ticket #83

Handling lang attributes
1 decade ago
Martin Jericho posted a comment on ticket #83

Hi Quang, I suspect you're confusing language with character encoding. Your file...
1 decade ago
Martin Jericho modified ticket #82

proxy connection fail
1 decade ago
Martin Jericho posted a comment on ticket #82

Hi Alex, I'd suggest you use your own code to manage connections and load the content,...

1 >

Martin Jericho Activity

Activity for Martin Jericho

Martin Jericho posted a comment on discussion Help

Martin Jericho created ticket #26188

Martin Jericho posted a comment on discussion Help

Martin Jericho modified ticket #95

Martin Jericho posted a comment on ticket #95

Martin Jericho posted a comment on ticket #93

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on ticket #94

Martin Jericho modified ticket #93

Martin Jericho posted a comment on ticket #93

Martin Jericho posted a comment on discussion Open Discussion

Martin Jericho modified a comment on ticket #94

Martin Jericho modified ticket #94

Martin Jericho posted a comment on ticket #94

Martin Jericho posted a comment on discussion Open Discussion

Martin Jericho modified ticket #22

Martin Jericho posted a comment on ticket #22

Martin Jericho modified ticket #92

Martin Jericho posted a comment on ticket #92

Martin Jericho posted a comment on ticket #92

Martin Jericho posted a comment on ticket #91

Martin Jericho posted a comment on ticket #91

Martin Jericho posted a comment on ticket #91

Martin Jericho created ticket #364

Martin Jericho modified ticket #90

Martin Jericho posted a comment on ticket #90

Martin Jericho modified ticket #89

Martin Jericho modified ticket #88

Martin Jericho posted a comment on ticket #89

Martin Jericho modified a comment on discussion Help

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on ticket #88

Martin Jericho posted a comment on ticket #88

Martin Jericho posted a comment on ticket #88

Martin Jericho posted a comment on ticket #88

Martin Jericho posted a comment on discussion Open Discussion

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on discussion Open Discussion

Martin Jericho posted a comment on ticket #87

Martin Jericho modified ticket #87

Martin Jericho modified ticket #86

Martin Jericho posted a comment on ticket #86

Martin Jericho posted a comment on ticket #85

Martin Jericho modified ticket #85

Martin Jericho modified ticket #85

Martin Jericho posted a comment on ticket #85

Martin Jericho posted a comment on ticket #84

Martin Jericho modified ticket #84

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on discussion Open Discussion

Martin Jericho posted a comment on discussion Open Discussion

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on discussion Open Discussion

Martin Jericho posted a comment on discussion Open Discussion

Martin Jericho posted a comment on discussion Open Discussion

Martin Jericho posted a comment on discussion Open Discussion

Martin Jericho posted a comment on discussion Open Discussion

Martin Jericho posted a comment on discussion Open Discussion

Martin Jericho posted a comment on ticket #84

Martin Jericho posted a comment on discussion Help

Martin Jericho posted a comment on discussion Help

Martin Jericho created a blog post

Martin Jericho modified ticket #80

Martin Jericho posted a comment on ticket #80

Martin Jericho posted a comment on ticket #72

Martin Jericho modified ticket #72

Martin Jericho modified ticket #71