RE: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagS
Brought to you by:
derrickoswald
From: Marc N. <ma...@ke...> - 2003-05-27 22:55:27
|
Sure, I'll see if I can fix it. -----Original Message----- From: Derrick Oswald [mailto:Der...@ro...] Sent: Tuesday, May 27, 2003 2:39 PM To: htm...@li... Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 Marc, The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text=20 or remarks. I guess the text scanner goes until it sees a <x... and then stops to=20 defer to a tag scanner. I hadn't thought about those in comments, or=20 about the \ end of lines. Perhaps, rather than write a new scanner, fix the StringScanner (the=20 remark scanner should be OK), so that it does the correct behaviour when = balance_quotes is true. Then the 'balance_quotes' flag could be called=20 'strict_script' or something. Derrick Marc Novakowski wrote: >Derrick, > >I was relying on some of the old behavior of ScriptScanner, mostly the = fact that its contents were not parsed as HTML. I'm still seeing cases = where tags inside of <script> are recognised as "HTML" and modified = (i.e. turned into uppercase, auto-closed, etc). For example, if there = is an HTML tag in a Javascript comment. Also, using "\" to concatenate = lines (which is valid in Javacript) is totally messed up now when I try = to get the script code using "toHtml()". > >However, I think your change was valid and fixes the bug as requested. = What I think I'm going to do, though, is make a new scanner class that = does what the old ScriptScanner did. That is, do a bare-bones "leave = everything inside that tag as-is" parse of the HTML, searching only for = the end tag with no knowledge of quotes or anything. I think there are = cases where Javascript is written such that any modification at all will = break it. > >I'll send a note to the list when this class is done (today sometime). = I'll call it StrictScriptScanner or something. > >Marc > >-----Original Message----- >From: der...@us... >[mailto:der...@us...] >Sent: Saturday, May 24, 2003 2:05 PM >To: htm...@li... >Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > >Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners >In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners > >Modified Files: > CompositeTagScanner.java ScriptScanner.java=20 >Log Message: >Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags >Major overhaul of ScriptScanner. >It now uses the scan() method of CompositeTagScanner (i.e. doesn't = override). >CompositeTagScanner now has a balance_quotes member field that dictates >whether strings tags are scanned honouring single and double quotes. >This affected the call chain through NodeReader and StringScanner which >now have this parameter. >StringScanner now correctly handles quotes if asked. The ignoreState = stuff is removed, >it didn't work anyway since a single StringScanner is used recursively = by the NodeReader, >and the member field would have been tromped. >Sorry to all those who have broken code because of this, but it's for = the better. Really. > > > >Index: CompositeTagScanner.java >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagSc= anner.java,v >retrieving revision 1.52 >retrieving revision 1.53 >diff -C2 -d -r1.52 -r1.53 >*** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52 >--- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53 >*************** >*** 97,100 **** >--- 97,101 ---- > private Set tagEnderSet; > private Set endTagEnderSet; >+ private boolean balance_quotes; > =09 > public CompositeTagScanner(String [] nameOfTagToMatch) { >*************** >*** 125,129 **** > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); > } >! =09 > public CompositeTagScanner( > String filter,=20 >--- 126,130 ---- > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); > } >!=20 > public CompositeTagScanner( > String filter,=20 >*************** >*** 131,138 **** > String [] tagEnders,=20 > String [] endTagEnders, >! boolean allowSelfChildren) { > super(filter); > this.nameOfTagToMatch =3D nameOfTagToMatch; > this.allowSelfChildren =3D allowSelfChildren; > this.tagEnderSet =3D new HashSet(); > for (int i=3D0;i<tagEnders.length;i++) >--- 132,172 ---- > String [] tagEnders,=20 > String [] endTagEnders, >! boolean allowSelfChildren) >! { >! this(filter,nameOfTagToMatch,tagEnders,endTagEnders, = allowSelfChildren, false); >! } >!=20 >! /** >! * Constructor specifying all member fields. >! * @param filter A string that is used to match which tags are to = be allowed >! * to pass through. This can be useful when one wishes to = dynamically filter >! * out all tags except one type which may be programmed later than = the parser. >! * @param nameOfTagToMatch The tag names recognized by this = scanner. >! * @param tagEnders The non-endtag tag names which signal that no = closing >! * end tag was found. For example, encountering <FORM> while >! * scanning a <A> link tag would mean that no </A> was = found >! * and needs to be corrected. >! * @param endTagEnders The endtag names which signal that no = closing end >! * tag was found. For example, encountering </HTML> while >! * scanning a <BODY> tag would mean that no </BODY> = was found >! * and needs to be corrected. These items are not prefixed by a = '/'. >! * @param allowSelfChildren If <code>true</code> a tag of the same = name is >! * allowed within this tag. Used to determine when an endtag is = missing. >! * @param balance_quotes <code>true</code> if scanning string = nodes needs to >! * honour quotes. For example, ScriptScanner defines this = <code>true</code> >! * so that text within <SCRIPT></SCRIPT> ignores = tag-like text >! * within quotes. >! */ >! public CompositeTagScanner( >! String filter,=20 >! String [] nameOfTagToMatch,=20 >! String [] tagEnders,=20 >! String [] endTagEnders, >! boolean allowSelfChildren, >! boolean balance_quotes) { > super(filter); > this.nameOfTagToMatch =3D nameOfTagToMatch; > this.allowSelfChildren =3D allowSelfChildren; >+ this.balance_quotes =3D balance_quotes; > this.tagEnderSet =3D new HashSet(); > for (int i=3D0;i<tagEnders.length;i++) >*************** >*** 145,149 **** > public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { > CompositeTagScannerHelper helper =3D=20 >! new CompositeTagScannerHelper(this,tag,url,reader,currLine); > return helper.scan(); > } >--- 179,183 ---- > public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { > CompositeTagScannerHelper helper =3D=20 >! new = CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes); > return helper.scan(); > } >*************** >*** 193,196 **** > return false; > } >-=20 > } >--- 227,229 ---- > >Index: ScriptScanner.java >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.= java,v >retrieving revision 1.21 >retrieving revision 1.22 >diff -C2 -d -r1.21 -r1.22 >*** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21 >--- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22 >*************** >*** 28,64 **** > =20 > package org.htmlparser.scanners; >! ///////////////////////// >! // HTML Parser Imports // >! ///////////////////////// >! import org.htmlparser.Node; >! import org.htmlparser.NodeReader; >! import org.htmlparser.StringNode; >! import org.htmlparser.tags.EndTag; > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >! import org.htmlparser.util.NodeList; >! import org.htmlparser.util.ParserException; > /** > * The HTMLScriptScanner identifies javascript code > */ >-=20 > public class ScriptScanner extends CompositeTagScanner { >- private static final String SCRIPT_END_TAG =3D "</SCRIPT>"; > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > private static final String ENDERS [] =3D {"BODY", "HTML"}; > public ScriptScanner() { >! super("",MATCH_NAME,ENDERS); > } > =20 > public ScriptScanner(String filter) { >! super(filter,MATCH_NAME,ENDERS); > } > =20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch) { >! super(filter,nameOfTagToMatch,ENDERS); > } >! =09 > public String [] getID() { > return MATCH_NAME; >--- 28,59 ---- > =20 > package org.htmlparser.scanners; >!=20 > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >!=20 > /** > * The HTMLScriptScanner identifies javascript code > */ > public class ScriptScanner extends CompositeTagScanner { > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > private static final String ENDERS [] =3D {"BODY", "HTML"}; > public ScriptScanner() { >! this(""); > } > =20 > public ScriptScanner(String filter) { >! this(filter,MATCH_NAME,ENDERS); > } > =20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders) { >! this(filter,nameOfTagToMatch,enders, new String[0], true, true); > } >!=20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders, String[] endtagenders, boolean allowSelfChildren, = boolean balance_quotes) { >! super(filter,nameOfTagToMatch,enders, new String[0], = allowSelfChildren, balance_quotes); >! } >!=20 > public String [] getID() { > return MATCH_NAME; >*************** >*** 70,205 **** > return new ScriptTag(tagData,compositeTagData); > } >-=20 >- public Tag scan(Tag tag, String url, NodeReader reader, String = currLine) >- throws ParserException { >- try { >- int startLine =3D reader.getLastLineNumber(); >- String line =3D null; >- StringBuffer scriptContents =3D=20 >- new StringBuffer(); >- boolean endTagFound =3D false; >- Tag startTag =3D tag; >- Tag endTag =3D null; >- line =3D currLine; >- boolean sameLine =3D true; >- int startingPos =3D startTag.elementEnd(); >- do { >- int endTagLoc =3D = line.toUpperCase().indexOf(getEndTag(),startingPos); >- while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, = endTagLoc)) { >- startingPos =3D endTagLoc+getEndTag().length(); >- endTagLoc =3D line.toUpperCase().indexOf(getEndTag(), = startingPos); =09 >- } >- =20 >- if (endTagLoc!=3D-1) { >- endTagFound =3D true; >- endTag =3D (EndTag)EndTag.find(line,endTagLoc); >- if (sameLine)=20 >- scriptContents.append( >- getCodeBetweenStartAndEndTags( >- line, >- startTag, >- endTagLoc) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line.substring(0,endTagLoc)); >- } >- =09 >- reader.setPosInLine(endTag.elementEnd()); >- } else { >- if (sameLine)=20 >- scriptContents.append( >- line.substring( >- startTag.elementEnd()+1 >- ) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line); >- } >- } >- if (!endTagFound) { >- line =3D reader.getNextLine(); >- startingPos =3D 0; >- } >- if (sameLine)=20 >- sameLine =3D false; >- } >- while (line!=3Dnull && !endTagFound); >- if (endTag =3D=3D null) { >- // If end tag doesn't exist, create one >- String endTagName =3D tag.getTagName(); >- int endTagBegin =3D reader.getLastReadPosition()+1 ; >- int endTagEnd =3D endTagBegin + endTagName.length() + 2;=20 >- endTag =3D new EndTag( >- new TagData( >- endTagBegin, >- endTagEnd, >- endTagName, >- currLine >- ) >- ); >- } >- NodeList childrenNodeList =3D new NodeList(); >- childrenNodeList.add( >- new StringNode( >- scriptContents, >- startTag.elementEnd(), >- endTag.elementBegin()-1 >- ) >- ); >- return createTag( >- new TagData( >- startTag.elementBegin(), >- endTag.elementEnd(), >- startLine, >- reader.getLastLineNumber(), >- startTag.getText(), >- currLine, >- url, >- false >- ), new CompositeTagData( >- startTag,endTag,childrenNodeList >- ) >- ); >- =09 >- } >- catch (Exception e) { >- throw new ParserException("Error in ScriptScanner: ",e); >- } >- } >-=20 >- public String getCodeBetweenStartAndEndTags( >- String line, >- Tag startTag, >- int endTagLoc) throws ParserException { >- try { >- =09 >- return line.substring( >- startTag.elementEnd()+1, >- endTagLoc >- ); >- } >- catch (Exception e) { >- StringBuffer msg =3D new StringBuffer("Error in = getCodeBetweenStartAndEndTags():\n"); >- msg.append("substring starts at: = "+(startTag.elementEnd()+1)).append("\n"); >- msg.append("substring ends at: "+(endTagLoc)); >- throw new ParserException(msg.toString(),e); >- } >- } >-=20 >- /** >- * Gets the end tag that the scanner uses to stop scanning. = Subclasses of >- * <code>ScriptScanner</code> you should override this method. >- * @return String containing the end tag to search for, i.e. = </SCRIPT> >- */=20 >- public String getEndTag() { >- return SCRIPT_END_TAG; >- } >- =09 >- private boolean isScriptEmbeddedInDocumentWrite(String line, int = endTagLoc) { >- if (endTagLoc+getEndTag().length() > line.length()-1) return false; >- return line.charAt(endTagLoc+getEndTag().length())=3D=3D'"'; >- } >-=20 > } >--- 65,67 ---- > > > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-cvs mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > =20 > ------------------------------------------------------- This SF.net email is sponsored by: ObjectStore. If flattening out C++ or Java code to make your application fit in a relational database is painful, don't do it! Check out ObjectStore. Now part of Progress Software. http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |