Hi Andy. Thanks for reporting the issue. I see what you mean about sourceforge. I just noticed they removed all of the documentation from my project's website a couple of months ago without notification. I just fixed that. But no I don't have any intention of moving the project to github at this point in time. Firstly, you might like to try using the latest DEV version 3.5. There have been a few improvements and bug fixes to the Renderer class. You can download it here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip...
Page Broken (500): https://sourceforge.net/p/jerichohtml/discussion/350025/moderate/save_moderation
Hi Ethan, Thank you for the suggestion. Yes I got a request for this already last year: https://sourceforge.net/p/jerichohtml/bugs/93/ The biggest barrier at the moment is the fact that I implemented a new major feature a few years ago (a web crawler API) but it remains poorly documented, and could probably use a couple of minor enhancements before it is officially released. That means all bug fixes since then have just gone into the DEV release: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip...
Typo in StreamEncodingDetector.isDifinitive
Haha thanks!
Hi Samuel. Thank you for doing all of this. Unfortunately this library is way down my priority list these days. I still use it in heaps of projects, and it still works well, and I still fix an occasional reported bug, but I haven't done an official release for years, and I can't see myself getting to it in the near future. When I do eventually release a new version, I will definitely incorporate your suggestions. Cheers Martin
P.S. When you want to include HTML in your post, you need to enclose it in a code block, otherwise the HTML is parsed and doesn't show properly. For example, your sample document should look like this: <html> <head> <meta http-equiv="Content-Type" content="html; charset=UTF-8"> </head> </html>
Hi Davy, The sample HTML you are feeding it doesn't specify a valid encoding and is therefore parsed correctly. Because the quotes are encoded, they are included in the value of the content attribute, which is why the end quote is interpreted as part of the encoding name. You say that the sample content occurs when it is "inserted into an iframe". I assume you mean it appears as the value of the iframe srcdoc attribute. In that case, your sample document should be the HTML containing the iframe,...
Hi Remi, I didn't document anywhere why I made the decision to remove the content of button elements. In general I was copying the behaviour of how some email clients create pure text versions of HTML emails. Maybe I just thought they should be removed because all other form elements (INPUT, TEXTAREA etc) are removed. Or maybe I just didn't think much about it! I've modified the Render class in version 3.5 to include the content of BUTTON elements. Until version 3.5 is officially released, the development...
No you don't need to create multiple copies of the Source object. The important thing is just that you find the server tags and call source.ignoreWhenParsing() on them before searching for the HTML tags that contain them. This implies that you need to call it before a full sequential parse is performed, but that may never even be called if you don't call any of the methods getAllTags(), getAllStartTags(), getAllElements(), getChildElements(), iterator() or Segment.getNodeIterator() method on the...
please support java modules in the next release
Hi Samael, Thanks for raising this issue. Sorry I'm not very familiar with the java modules system so I'd appreciate if you could give me some advice and assistance. I'm still targeting java 1.7 to maximise compatibility with older programs. In order to support modules I'd have to compile targeting java 9, right? I'm not sure whether that would mean the library doesn't work with projects targeting Java 8, which I believe is still very common. Do you know whether that would be an issue? Have you already...
Hi Andrew, The release.txt file does mention "minor changes to Renderer behaviour" for version 3.5. The new behaviour is more consistent with browser behaviour so it is most likely an intended change. Cheers Martin
Hi Chris, The situation you've encountered is an-unavoidable consequence of using xml-style server tags inside HTML tags, and the reason I consider it to be bad practice. It is no possible for a parser to know which tags are server tags unless you tell it, so the solution you proposed does not work in all cases. If you can't avoid the use of xml-style server tags in your HTML, the way you tell the parser which tags are server tags is to explicitly search for them first. For each server tag you find,...
Incorrectly parsing attribute values containing tags containing quotes
Hi Chris, The situation you've encountered is an-unavoidable consequence of using xml-style server tags inside HTML tags, and the reason I consider it to be bad practice. It is no possible for a parser to know which tags are server tags unless you tell it, so the solution you proposed does not work in all cases. If you can't avoid the use of xml-style server tags in your HTML, the way you tell the parser which tags are server tags is to explicitly search for them first. For each server tag you find,...
Hi Andrew. Version 3.5 hasn't been officially released yet because the newest feature, a web crawler API, has not been fully documented yet. The project is not dead, and minor improvements continue to make their way into the DEV version, but other time commitments have prevented the completion of the documentation and an official release for years. The 3.5-dev version (http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip) is always a release candidate and can be used as a reliable substitute...
Renderer Always Adds Brackets Around Links
Hi Ryan, You can customise this by overriding the renderHyperlinkURL method. An example is provided in the documentation: http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Renderer.html#renderHyperlinkURL(net.htmlparser.jericho.StartTag) If you only want to remove the brackets if the URL is the same as the element contents, you can check for startTag.getAttributeValue("href").equals(startTag.getElement().getTextExtractor().toString()) or if you want it a bit more efficient and disregard...
Query parameter names in hyperlinks being incorrectly decoded
I just confirmed that character references are decoded inside link elements, at least in Chrome. But the character reference in your example is not terminated, meaning it is missing the final semicolon. When this parser was firs written, most browsers still decoded unterminated character references, but each browser behaved differently. So I created a CompatibilityMode class to encapsulate the decoding behaviour of unterminated character references. To configure the parser not to decode any unterminated...
Hi Remi, I'm not aware that HTML character references shouldn't be decoded inside link elements. Do you have a source for that? Cheers Martin
The HTML should reflect the fact that these links are on separate lines. CSS should not be used to change the meaning of the content, only to style it. Correct HTML would wrap these links in some sort of block element such as li or div, or ideally the HTML5 menuitem element. It's an important aspect of making the web content accessible.
Hi Daniel, In your example you're using relative URLs (foo.com), which the default Renderer class doesn't render at all. This is documented in the Renderer.renderHyperlinkURL(String) method. Because the URL isn't rendered, the content inside the hyperlink is just rendered as normal text. Configuration options such as setHyperlinkContentDelimiters only affect the output when the URL is rendered. I think your only option is going to be to create a copy of the Renderer class and cusomise the A_ElementHandler,...
Hi Daniel, It is very uncommon for CSS to style a hyperlink to add whitespace around it. Two consecutive A elements without whitespace in between will almost always render as a single word in a browser. Therefore the behaviour of the Render class is correct and your end user probably does want to know about the lack of a space between the two words when running the spell checker in your example above. Note that the current release version 3.4 contains a bug relating to the rendering of hyperlinks....
Computer hangs while entering password at boot time
Renderer class picks out content from within a script tag
Thanks for the bug report! Fixed in version 3.5. Although the parser was already designed to ignore other tags inside SCRIPT elements, there was a bug triggered by the presence of server tags inside the script element. In your example it was the <%- data.price %> tag causing the problem. Until version 3.5 is officially released, the development version is available here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip Although it has been 5 years since the last official release, version...
ArrayIndexOutOfBoundsException from Renderer: negative left margin.
HTML5 parsing problems - links without quotes
Thanks for the bug report! Fixed in version 3.5. Negative margins and padding are now treated as zero margin. Until version 3.5 is officially released, the development version is available here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip Although it has been almost 5 years since the last official release, version 3.5 includes a new major feature that requires significant time to document, and I don't envisage having spare time in the foreseeable future. So the official 3.5 release...
If you only need to check for matching start and end tags it's easy. Create a stack, iterate through each tag, if it's a start tag add it to the stack, if it's an end tag check that the tag at the top of the stack has the same name. If not, check whether the start tag has an optional end tag. But there are many more potential problems with HTML than mismatched tags, and instead of writing your own code to detect and fix problems, why not use an existing library?
If you only need to check for matching start and end tags it's easy. Create a stack, iterate through each tag, if it's a start tag add it to the stack, if it's an end tag check that the tag at the top of the stack has the same name. If not, check whether the start tag has an optionan end tag. But there are many more potential problems with HTML than mismatched tags, and instead of writing your own code to detect and fix problems, why not use an existing library?
Hi Jiří, The Source Formatter is a tool for formatting valid HTML with indentation, not for fixing broken HTML. If you need to fix broken HTML you'll need to try a library that specialises in that task, such as HtmlCleaner https://sourceforge.net/projects/htmlcleaner/ (I haven't used it myself, but it's under active development) Cheers, Martin
Hi Wise Mike, I'm not sure why it's not working for you, but I suspect it will work if you download the Java JDK instead of the Java JRE. If that doesn't work, download the latest development version of the parser from the link below: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip Then open a command prompt in the folder samples/console and enter the following command: Encoding.bat arabic-test-file.html > out.txt (replace arabic-test-file.html with the full pathname of your arabic HTML...
Hi Wise Mike, According to this page: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html the problem is that you have a "European languages" version of Java installed on your computer, which doesn't support the Windows-1256 encoding. This seems to be the default if the Java installer "recognizes that the host operating system only supports European languages". To fix the problem, you could either install the international version (apparently no need to download again, just...
Hi Wise Mike, Could you please attach the HTML file that isn't doing what you expect? Cheers, Martin
I added a WebBot class for crawling/downloading websites but I want to document it before releasing. There's a fair bit to document so it's taking a while, but I'm plugging away at it in my spare time. Still probably months rather than weeks away though, I don't have a lot of spare time.
You have a test suite for parsing? Would you mind sharing that with me? I only have a handful of unit tests at present.
Fixed in version 3.5. Until version 3.5 is officially released, the development version is available here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip
Thanks Tobias. Strange that I wrote the parser to interpret the closing slash as an empty element tag when no browsers interpret it that way. Maybe browser behaviour has changed in that respect over the years. There's a fair bit of code and documentation to update to fix this but I'll see if I can get it done tonight.
Your example code didn't make it into the post, but try something like this: OutputDocument outputDocument=new OutputDocument(source); outputDocument.remove(source.getAllElements(HTMLElementName.STYLE)); for (StartTag startTag : source.getAllStartTags("style",null)) { // iterate all tags with a style attribute outputDocument.remove(startTag.getAttributes().get("style")); } String newHTML=outputDocument.toString();
Yes use the Reader constructor for either Source or StreamedSource. For Source you can also pass a String to the constructor that accepts a CharSequence argument. The Source class always generates a String containing the entire document anyway so there's no advantage using the Reader constructor instead.
Hi Greg, Thanks for using the library and for taking an interest in hosting it on github. I'm curious as to why you thought it necessary to put it on github just to use it in an android app. Why not just use the jar file or the public maven repository? http://repo1.maven.org/maven2/net/htmlparser/jericho/jericho-html/3.4/ I haven't done any android development for a while but I believe it's still possible to specify a jar file as a dependency in an android project. If there is no real need for the...
The program works fine for me. The likely reason it is "hanging" when you run it...
getParentElement hangs
in link results in space (0x20) rather than no break space (0xC2 0xA0)
There is a static configuration variable to control this behaviour: http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Config.html#ConvertNonBreakingSpaces...
This has been fixed in version 3.5. Until version 3.5 is officially released, the...
html comments not ignored inside style tags
html comments not ignored inside style tags
Hi Code Buddy, I think you're probably right, the parser should be ignoring the HTML...
Hi Guislain, I have (hopefully) fixed this in version 3.5. Until version 3.5 is officially...
Tag and StartTag static init deadlock under heavy load
I've added the requested functionality to version 3.5. Until version 3.5 is officially...
Hi Marco. This wouldn't be a simple tweak to the blockquote element handler, it would...
No problem. I ran into the same issue myself!
This bug will be fixed in version 3.5. Until version 3.5 is officially released,...
Thanks for reporting the problem Damian. Could you please be a bit more specific...
Dear CNTLM users, It appears that the author of this project David Kubíček is no...
Hi Dan, Sorry for the delay responding, I've been on holidays! The normal approach...
Maybe log output goes somewhere else by default on the mac. Probably not an encoding...
See the documentation of the OutputDocument class http://jericho.htmlparser.net/...
To avoid the error use the approach I already mentioned so that it doesn't attempt...
sourceText.substring(startTag.getEnd(),endTag.getBegin())
Sorry when you said your ${...} tags require context I didn't realise you have other...
You don't need to do the expression substitution before the tag parsing. Do all parsing...
The concept of a document hierarchy is fundamtentally problematic when you include...
Hi David, Yes allowing tags with start characters other than "<" would severely impact...
Thanks for the bug report Guislain. I sometimes wonder how the jvm figures out these...
You can create a java 1.6 compatible jar easily by changing the target argument in...
Now released to maven. http://repo1.maven.org/maven2/net/htmlparser/jericho/jeri...
Jericho HTML Parser 3.4 released
Jericho throws a "position discarded" exception with unlimited buffer
This bug has now been closed - Version 3.4 released.
This issue has now been closed - Version 3.4 released.
Performance issues in handling attributes
setHRLineLength(0) for Rendered still display "-" should be ""
This bug has now been closed - Version 3.4 released.
Unmatched <script> tag eats the rest of the markup in the document
newLogger() does not cache its value, resulting in contention in multithreaded environments
This bug has now been closed - Version 3.4 released.
The parser uses exceptions for normal operations
This bug has now been closed - Version 3.4 released.
CharSequence StreamedSource should not require finalization
This bug has now been closed - Version 3.4 released.
Hi Jim, Yes it's probably about time I did an official release. I'll see if I can...
The explanation is already in the comment. If tag.name() equals one of the strings...
Hi Sandro, You would have to do something like this: Element sourceElement = jspSource.getFirstElement("id",...
I don't understand why you can't build an index externally rather than internally....
Just search by tag name first, then check the resulting list for the attributes you...
Hi Jianjin, You are welcome to put the latest code onto github, but you would have...
hi Sandro, The StreamedSource API doesn't provide a means of finding a matching end...
Hi Nathan, You might want to look at the Element.getChildElements() method instead...
Handling lang attributes
Hi Quang, I suspect you're confusing language with character encoding. Your file...
proxy connection fail
Hi Alex, I'd suggest you use your own code to manage connections and load the content,...