Re: [Htmlparser-user] parsing bug?
Brought to you by:
derrickoswald
From: Derrick O. <der...@ro...> - 2007-10-26 02:21:06
|
You might want to try setting the ScriptScanner STRICT memeber to false:=0A= =0Apublic class ScriptScanner=0A extends=0A CompositeTagScanner= =0A{=0A /**=0A * Strict parsing of CDATA flag.=0A * If this flag= is set true, the parsing of script is performed without=0A * regard to= quotes. This means that erroneous script such as:=0A * <pre>=0A * = document.write("</script>");=0A * </pre>=0A * will be parsed i= n strict accordance with appendix=0A * =0A * B.3.2 Specifying non-H= TML data</a> of the=0A * HTML'>http://www.w3.org/TR/html4/">HTML 4.01 S= pecification and=0A * hence will be split into two or more nodes. Corre= ct javascript would=0A * escape the ETAGO:=0A * <pre>=0A * docu= ment.write("<\/script>");=0A * </pre>=0A * If true, CDATA pars= ing will stop at the first ETAGO ("</") no matter=0A * whether it is= quoted or not. If false, balanced quotes (either single or=0A * double= ) will shield an ETAGO. Beacuse of the possibility of quotes within=0A = * single or multiline comments, these are also parsed. In most cases,=0A = * users prefer non-strict handling since there is so much broken script= =0A * out in the wild.=0A */=0A public static boolean STRICT =3D= true;=0A=0A=0A----- Original Message ----=0AFrom: Subramanya Sastry <sastr= y...@cs...>=0ATo: htmlparser user list <htm...@li...urceforg= e.net>=0ASent: Thursday, October 25, 2007 2:00:56 PM=0ASubject: [Htmlparser= -user] parsing bug?=0A=0A=0AI am writing to check if what I am observing is= a parsing bug, and if =0Aso, if there are any known workarounds.=0A=0AWhen= javascript is being parsed, at the start, my NodeVisitor's visitTag =0Amet= hod gets called. As expected, all starting html tags within the =0Ajavascr= ipt itself are being ignored since they are are part of the =0Ajavascript a= nd not the HTML. However, at the first closing tag that is =0Aencounted wi= thin the javascript code (even within strings), the parser =0Atriggers a cl= ose script tag and calls my visitor's visitEndTag method =0Agets called.=0A= =0AFor example with "</textarea>" string within the javascript, I get 2 =0A= successive calls to visitEndTag, first with the SCRIPT tag and next with = =0Athe TEXTAREA tag. Or, with a "</he" + "ad>" string within javascript, I= =0Aget 2 successive calls to visitEndTag, first with the SCRIPT tag and = =0Anext with the TEXTAREA tag.=0A=0AAs an example, test the 2 urls:=0A=0A1.= http://youtube.com/?v=3Dqf4tdOKKWic=0A2. http://www.deccanherald.com/Conte= nt/Oct202007/city2007102031594.asp=0A=0AAny leads to fix this are much appr= eciated.=0A=0AThanks,=0ASubbu.=0A=0A---------------------------------------= ----------------------------------=0AThis SF.net email is sponsored by: Spl= unk Inc.=0AStill grepping through log files to find problems? Stop.=0ANow = Search log events and configuration files using AJAX and a browser.=0ADownl= oad your FREE copy of Splunk now >> http://get.splunk.com/=0A______________= _________________________________=0AHtmlparser-user mailing list=0AHtmlpars= er...@li...=0Ahttps://lists.sourceforge.net/lists/listinf= o/htmlparser-user=0A=0A=0A=0A=0A |