Thanks for the test case.
I'm not sure what the correct behaviour is though.
Script parsing is 'quote-smart', meaning it balances quotes.
In the case of your example:
<HTML><SCRIPT LANGUAGE="Javascript">//'</SCRIPT><BODY>
</BODY></HTML>
there is no closing quote.
We have test cases like:
<SCRIPT>document.write(\"</script>\");</SCRIPT>
that need to ignore the first </script>, and they do this by
balancing the quotes around the tag.
Can you think of a rule that would allow correct parsing of
both cases?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The problem of handling quotes and tags embedded in
<SCRIPT> tags has come up over and over again. This
needs to be resolved satisfactorily, once and for all.
I propose adding code in the ScriptScanner to drop down into
an <A
HREF="http://www.ecma-international.org/publications/files/ecma-st/Ecma-262.pdf">ECMAScript</A>
parser and actually read the script to determine where the
lexer/parser should resume.
For htmllexer.jar this can be handled by a simple parser
that understands double and single line ECMAScript comments,
plus escape slashes on single and double quotes.
Consideration should be given to adding another type of
node, a 'CodeNode', so programs can differentiate between
StringNodes containing text the user would see in a browser
and script.
A full parser using <A HREF="http://antlr.org/">Antlr</A> or
<A HREF="https://javacc.dev.java.net/">JavaCC</A> can be
integrated into htmlparser.jar, to provide full script control.
A 'free' ECMAScript grammar for JavaCC is <A
HREF="http://www.lugrin.ch/fesi/index.html">FESI</A>.
An apparently aborted attempt to create a SableCC grammar
for ECMAScript is <A
HREF="http://sourceforge.net/projects/scriptonite/">Scriptonite</A>.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Test Case
Logged In: YES
user_id=605407
Thanks for the test case.
I'm not sure what the correct behaviour is though.
Script parsing is 'quote-smart', meaning it balances quotes.
In the case of your example:
<HTML><SCRIPT LANGUAGE="Javascript">//'</SCRIPT><BODY>
</BODY></HTML>
there is no closing quote.
We have test cases like:
<SCRIPT>document.write(\"</script>\");</SCRIPT>
that need to ignore the first </script>, and they do this by
balancing the quotes around the tag.
Can you think of a rule that would allow correct parsing of
both cases?
Logged In: YES
user_id=962414
I think, in script Tag you need to parse comment, because I
had this javascript comment, which doesn't work in your parser:
<SCRIPT>
// It's problem
</SCRIPT>
This comment in browser works, in HTMLParser doesn't.
Also I found a problem with javascript code like this:
<SCRIPT>
var x='text with one apostrophe \' '
</SCRIPT>
I think you will need to ignore in SCRIPT tag sequence with \'.
If you have other questions, i will be apprecited, if you
ask me.
Jozef
P.S.: Maybe it is better to don't parse content of SCRIPT
tag like quote-smart.
Logged In: YES
user_id=962414
Hmmm, bug report system uses HTML in messages, so I send my
comment also in attachement
Logged In: YES
user_id=605407
was: Bug: One apostrophe in Javascript comment
The problem of handling quotes and tags embedded in
<SCRIPT> tags has come up over and over again. This
needs to be resolved satisfactorily, once and for all.
I propose adding code in the ScriptScanner to drop down into
an <A
HREF="http://www.ecma-international.org/publications/files/ecma-st/Ecma-262.pdf">ECMAScript</A>
parser and actually read the script to determine where the
lexer/parser should resume.
For htmllexer.jar this can be handled by a simple parser
that understands double and single line ECMAScript comments,
plus escape slashes on single and double quotes.
Consideration should be given to adding another type of
node, a 'CodeNode', so programs can differentiate between
StringNodes containing text the user would see in a browser
and script.
A full parser using <A HREF="http://antlr.org/">Antlr</A> or
<A HREF="https://javacc.dev.java.net/">JavaCC</A> can be
integrated into htmlparser.jar, to provide full script control.
A 'free' ECMAScript grammar for JavaCC is <A
HREF="http://www.lugrin.ch/fesi/index.html">FESI</A>.
An apparently aborted attempt to create a SableCC grammar
for ECMAScript is <A
HREF="http://sourceforge.net/projects/scriptonite/">Scriptonite</A>.
Logged In: YES
user_id=639492
Skipping over everything inside script comments would
probably solve it.