Re: [Htmlparser-user] parsing bug?
Brought to you by:
derrickoswald
|
From: Derrick O. <der...@ro...> - 2007-10-26 02:21:06
|
You might want to try setting the ScriptScanner STRICT memeber to false:=0A=
=0Apublic class ScriptScanner=0A extends=0A CompositeTagScanner=
=0A{=0A /**=0A * Strict parsing of CDATA flag.=0A * If this flag=
is set true, the parsing of script is performed without=0A * regard to=
quotes. This means that erroneous script such as:=0A * <pre>=0A * =
document.write("</script>");=0A * </pre>=0A * will be parsed i=
n strict accordance with appendix=0A * =0A * B.3.2 Specifying non-H=
TML data</a> of the=0A * HTML'>http://www.w3.org/TR/html4/">HTML 4.01 S=
pecification and=0A * hence will be split into two or more nodes. Corre=
ct javascript would=0A * escape the ETAGO:=0A * <pre>=0A * docu=
ment.write("<\/script>");=0A * </pre>=0A * If true, CDATA pars=
ing will stop at the first ETAGO ("</") no matter=0A * whether it is=
quoted or not. If false, balanced quotes (either single or=0A * double=
) will shield an ETAGO. Beacuse of the possibility of quotes within=0A =
* single or multiline comments, these are also parsed. In most cases,=0A =
* users prefer non-strict handling since there is so much broken script=
=0A * out in the wild.=0A */=0A public static boolean STRICT =3D=
true;=0A=0A=0A----- Original Message ----=0AFrom: Subramanya Sastry <sastr=
y...@cs...>=0ATo: htmlparser user list <htm...@li...=
e.net>=0ASent: Thursday, October 25, 2007 2:00:56 PM=0ASubject: [Htmlparser=
-user] parsing bug?=0A=0A=0AI am writing to check if what I am observing is=
a parsing bug, and if =0Aso, if there are any known workarounds.=0A=0AWhen=
javascript is being parsed, at the start, my NodeVisitor's visitTag =0Amet=
hod gets called. As expected, all starting html tags within the =0Ajavascr=
ipt itself are being ignored since they are are part of the =0Ajavascript a=
nd not the HTML. However, at the first closing tag that is =0Aencounted wi=
thin the javascript code (even within strings), the parser =0Atriggers a cl=
ose script tag and calls my visitor's visitEndTag method =0Agets called.=0A=
=0AFor example with "</textarea>" string within the javascript, I get 2 =0A=
successive calls to visitEndTag, first with the SCRIPT tag and next with =
=0Athe TEXTAREA tag. Or, with a "</he" + "ad>" string within javascript, I=
=0Aget 2 successive calls to visitEndTag, first with the SCRIPT tag and =
=0Anext with the TEXTAREA tag.=0A=0AAs an example, test the 2 urls:=0A=0A1.=
http://youtube.com/?v=3Dqf4tdOKKWic=0A2. http://www.deccanherald.com/Conte=
nt/Oct202007/city2007102031594.asp=0A=0AAny leads to fix this are much appr=
eciated.=0A=0AThanks,=0ASubbu.=0A=0A---------------------------------------=
----------------------------------=0AThis SF.net email is sponsored by: Spl=
unk Inc.=0AStill grepping through log files to find problems? Stop.=0ANow =
Search log events and configuration files using AJAX and a browser.=0ADownl=
oad your FREE copy of Splunk now >> http://get.splunk.com/=0A______________=
_________________________________=0AHtmlparser-user mailing list=0AHtmlpars=
er...@li...=0Ahttps://lists.sourceforge.net/lists/listinf=
o/htmlparser-user=0A=0A=0A=0A=0A |