I tried the latest version (1.6-20060527). I know that an issue (tracker id 1457371) is solved in this version concerning quotes and script tags. Great! However, now I'm facing another script issue. The following piece of html code:
<script>
navbar = "</A><A>";
document.write("This line of code is parsed as text which is not part of the script tag");
</script>
The problem is that the second line of the script is parsed as text, which is not part of the script tag. I'm using the parser to parse html to plain text and this way javascript code will be part of the plain text.
I realize that the forward slash in the </A> tag is invalid. It should have a backslash in front of it. However, it should be nice if the parser should be tolerant for invalid code.
Does anybody know how I can avoid this behaviour, or is this a bug?
Regards,
Ramon
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As mentioned in the bug report:
The default for ScriptScanner.STRICT was set to true. If you want the older, more lax, script parsing, set it to false with code like:
org.htmlparser.scanners.ScriptScanner.STRICT = false;
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So I guess it's not possible to handle both situations in a way that the text belonging to the script is detected as a whole.
Too bad. I'm very pleased using the parser to extract content from html documents. I'm filtering the script and stylesheet tags, because no useful content will be in there. Is there a way to ignore the complete text between script/style tags?
Ramon
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It's a tough problem because there is just so much rubbish out there in the wild. Unless the program is capable of saying "that's Javascript" or "that doesn't look like Javascript" it's almost impossible to come up with a correct parse in all cases. I mean, the javascript in a page doesn't have to run or even be correct, really.
That being said, most browsers do a better job of handling the errant code than the parser does, and my guess is this is because they are actually trying to interpret the javascript. This is something that would be nice to have if it didn't slow down the parser too much.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I can imagine this is a tough problem. For now I'll choose the SCRIPT = true variant. In that case all relevant text will be parsed. The unwanted javascript code I will have to take for granted. Anyway, thanks for your replies and the hard work you invested in this project.
Ramon
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I tried the latest version (1.6-20060527). I know that an issue (tracker id 1457371) is solved in this version concerning quotes and script tags. Great! However, now I'm facing another script issue. The following piece of html code:
<script>
navbar = "</A><A>";
document.write("This line of code is parsed as text which is not part of the script tag");
</script>
The problem is that the second line of the script is parsed as text, which is not part of the script tag. I'm using the parser to parse html to plain text and this way javascript code will be part of the plain text.
I realize that the forward slash in the </A> tag is invalid. It should have a backslash in front of it. However, it should be nice if the parser should be tolerant for invalid code.
Does anybody know how I can avoid this behaviour, or is this a bug?
Regards,
Ramon
As mentioned in the bug report:
The default for ScriptScanner.STRICT was set to true. If you want the older, more lax, script parsing, set it to false with code like:
org.htmlparser.scanners.ScriptScanner.STRICT = false;
So I guess it's not possible to handle both situations in a way that the text belonging to the script is detected as a whole.
Too bad. I'm very pleased using the parser to extract content from html documents. I'm filtering the script and stylesheet tags, because no useful content will be in there. Is there a way to ignore the complete text between script/style tags?
Ramon
It's been something that's been requested for more than two years. See RFE #886862 parse ecmascript;
http://sourceforge.net/tracker/index.php?func=detail&aid=886862&group_id=24399&atid=381402
It's a tough problem because there is just so much rubbish out there in the wild. Unless the program is capable of saying "that's Javascript" or "that doesn't look like Javascript" it's almost impossible to come up with a correct parse in all cases. I mean, the javascript in a page doesn't have to run or even be correct, really.
That being said, most browsers do a better job of handling the errant code than the parser does, and my guess is this is because they are actually trying to interpret the javascript. This is something that would be nice to have if it didn't slow down the parser too much.
I can imagine this is a tough problem. For now I'll choose the SCRIPT = true variant. In that case all relevant text will be parsed. The unwanted javascript code I will have to take for granted. Anyway, thanks for your replies and the hard work you invested in this project.
Ramon