Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTag
Brought to you by:
derrickoswald
From: Derrick O. <Der...@ro...> - 2003-05-27 21:46:44
|
Marc, The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text or remarks. I guess the text scanner goes until it sees a <x... and then stops to defer to a tag scanner. I hadn't thought about those in comments, or about the \ end of lines. Perhaps, rather than write a new scanner, fix the StringScanner (the remark scanner should be OK), so that it does the correct behaviour when balance_quotes is true. Then the 'balance_quotes' flag could be called 'strict_script' or something. Derrick Marc Novakowski wrote: >Derrick, > >I was relying on some of the old behavior of ScriptScanner, mostly the fact that its contents were not parsed as HTML. I'm still seeing cases where tags inside of <script> are recognised as "HTML" and modified (i.e. turned into uppercase, auto-closed, etc). For example, if there is an HTML tag in a Javascript comment. Also, using "\" to concatenate lines (which is valid in Javacript) is totally messed up now when I try to get the script code using "toHtml()". > >However, I think your change was valid and fixes the bug as requested. What I think I'm going to do, though, is make a new scanner class that does what the old ScriptScanner did. That is, do a bare-bones "leave everything inside that tag as-is" parse of the HTML, searching only for the end tag with no knowledge of quotes or anything. I think there are cases where Javascript is written such that any modification at all will break it. > >I'll send a note to the list when this class is done (today sometime). I'll call it StrictScriptScanner or something. > >Marc > >-----Original Message----- >From: der...@us... >[mailto:der...@us...] >Sent: Saturday, May 24, 2003 2:05 PM >To: htm...@li... >Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > >Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners >In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners > >Modified Files: > CompositeTagScanner.java ScriptScanner.java >Log Message: >Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags >Major overhaul of ScriptScanner. >It now uses the scan() method of CompositeTagScanner (i.e. doesn't override). >CompositeTagScanner now has a balance_quotes member field that dictates >whether strings tags are scanned honouring single and double quotes. >This affected the call chain through NodeReader and StringScanner which >now have this parameter. >StringScanner now correctly handles quotes if asked. The ignoreState stuff is removed, >it didn't work anyway since a single StringScanner is used recursively by the NodeReader, >and the member field would have been tromped. >Sorry to all those who have broken code because of this, but it's for the better. Really. > > > >Index: CompositeTagScanner.java >=================================================================== >RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagScanner.java,v >retrieving revision 1.52 >retrieving revision 1.53 >diff -C2 -d -r1.52 -r1.53 >*** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52 >--- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53 >*************** >*** 97,100 **** >--- 97,101 ---- > private Set tagEnderSet; > private Set endTagEnderSet; >+ private boolean balance_quotes; > > public CompositeTagScanner(String [] nameOfTagToMatch) { >*************** >*** 125,129 **** > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, allowSelfChildren); > } >! > public CompositeTagScanner( > String filter, >--- 126,130 ---- > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, allowSelfChildren); > } >! > public CompositeTagScanner( > String filter, >*************** >*** 131,138 **** > String [] tagEnders, > String [] endTagEnders, >! boolean allowSelfChildren) { > super(filter); > this.nameOfTagToMatch = nameOfTagToMatch; > this.allowSelfChildren = allowSelfChildren; > this.tagEnderSet = new HashSet(); > for (int i=0;i<tagEnders.length;i++) >--- 132,172 ---- > String [] tagEnders, > String [] endTagEnders, >! boolean allowSelfChildren) >! { >! this(filter,nameOfTagToMatch,tagEnders,endTagEnders, allowSelfChildren, false); >! } >! >! /** >! * Constructor specifying all member fields. >! * @param filter A string that is used to match which tags are to be allowed >! * to pass through. This can be useful when one wishes to dynamically filter >! * out all tags except one type which may be programmed later than the parser. >! * @param nameOfTagToMatch The tag names recognized by this scanner. >! * @param tagEnders The non-endtag tag names which signal that no closing >! * end tag was found. For example, encountering <FORM> while >! * scanning a <A> link tag would mean that no </A> was found >! * and needs to be corrected. >! * @param endTagEnders The endtag names which signal that no closing end >! * tag was found. For example, encountering </HTML> while >! * scanning a <BODY> tag would mean that no </BODY> was found >! * and needs to be corrected. These items are not prefixed by a '/'. >! * @param allowSelfChildren If <code>true</code> a tag of the same name is >! * allowed within this tag. Used to determine when an endtag is missing. >! * @param balance_quotes <code>true</code> if scanning string nodes needs to >! * honour quotes. For example, ScriptScanner defines this <code>true</code> >! * so that text within <SCRIPT></SCRIPT> ignores tag-like text >! * within quotes. >! */ >! public CompositeTagScanner( >! String filter, >! String [] nameOfTagToMatch, >! String [] tagEnders, >! String [] endTagEnders, >! boolean allowSelfChildren, >! boolean balance_quotes) { > super(filter); > this.nameOfTagToMatch = nameOfTagToMatch; > this.allowSelfChildren = allowSelfChildren; >+ this.balance_quotes = balance_quotes; > this.tagEnderSet = new HashSet(); > for (int i=0;i<tagEnders.length;i++) >*************** >*** 145,149 **** > public Tag scan(Tag tag, String url, NodeReader reader,String currLine) throws ParserException { > CompositeTagScannerHelper helper = >! new CompositeTagScannerHelper(this,tag,url,reader,currLine); > return helper.scan(); > } >--- 179,183 ---- > public Tag scan(Tag tag, String url, NodeReader reader,String currLine) throws ParserException { > CompositeTagScannerHelper helper = >! new CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes); > return helper.scan(); > } >*************** >*** 193,196 **** > return false; > } >- > } >--- 227,229 ---- > >Index: ScriptScanner.java >=================================================================== >RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.java,v >retrieving revision 1.21 >retrieving revision 1.22 >diff -C2 -d -r1.21 -r1.22 >*** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21 >--- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22 >*************** >*** 28,64 **** > > package org.htmlparser.scanners; >! ///////////////////////// >! // HTML Parser Imports // >! ///////////////////////// >! import org.htmlparser.Node; >! import org.htmlparser.NodeReader; >! import org.htmlparser.StringNode; >! import org.htmlparser.tags.EndTag; > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >! import org.htmlparser.util.NodeList; >! import org.htmlparser.util.ParserException; > /** > * The HTMLScriptScanner identifies javascript code > */ >- > public class ScriptScanner extends CompositeTagScanner { >- private static final String SCRIPT_END_TAG = "</SCRIPT>"; > private static final String MATCH_NAME [] = {"SCRIPT"}; > private static final String ENDERS [] = {"BODY", "HTML"}; > public ScriptScanner() { >! super("",MATCH_NAME,ENDERS); > } > > public ScriptScanner(String filter) { >! super(filter,MATCH_NAME,ENDERS); > } > >! public ScriptScanner(String filter, String[] nameOfTagToMatch) { >! super(filter,nameOfTagToMatch,ENDERS); > } >! > public String [] getID() { > return MATCH_NAME; >--- 28,59 ---- > > package org.htmlparser.scanners; >! > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >! > /** > * The HTMLScriptScanner identifies javascript code > */ > public class ScriptScanner extends CompositeTagScanner { > private static final String MATCH_NAME [] = {"SCRIPT"}; > private static final String ENDERS [] = {"BODY", "HTML"}; > public ScriptScanner() { >! this(""); > } > > public ScriptScanner(String filter) { >! this(filter,MATCH_NAME,ENDERS); > } > >! public ScriptScanner(String filter, String[] nameOfTagToMatch, String[] enders) { >! this(filter,nameOfTagToMatch,enders, new String[0], true, true); > } >! >! public ScriptScanner(String filter, String[] nameOfTagToMatch, String[] enders, String[] endtagenders, boolean allowSelfChildren, boolean balance_quotes) { >! super(filter,nameOfTagToMatch,enders, new String[0], allowSelfChildren, balance_quotes); >! } >! > public String [] getID() { > return MATCH_NAME; >*************** >*** 70,205 **** > return new ScriptTag(tagData,compositeTagData); > } >- >- public Tag scan(Tag tag, String url, NodeReader reader, String currLine) >- throws ParserException { >- try { >- int startLine = reader.getLastLineNumber(); >- String line = null; >- StringBuffer scriptContents = >- new StringBuffer(); >- boolean endTagFound = false; >- Tag startTag = tag; >- Tag endTag = null; >- line = currLine; >- boolean sameLine = true; >- int startingPos = startTag.elementEnd(); >- do { >- int endTagLoc = line.toUpperCase().indexOf(getEndTag(),startingPos); >- while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, endTagLoc)) { >- startingPos = endTagLoc+getEndTag().length(); >- endTagLoc = line.toUpperCase().indexOf(getEndTag(), startingPos); >- } >- >- if (endTagLoc!=-1) { >- endTagFound = true; >- endTag = (EndTag)EndTag.find(line,endTagLoc); >- if (sameLine) >- scriptContents.append( >- getCodeBetweenStartAndEndTags( >- line, >- startTag, >- endTagLoc) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line.substring(0,endTagLoc)); >- } >- >- reader.setPosInLine(endTag.elementEnd()); >- } else { >- if (sameLine) >- scriptContents.append( >- line.substring( >- startTag.elementEnd()+1 >- ) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line); >- } >- } >- if (!endTagFound) { >- line = reader.getNextLine(); >- startingPos = 0; >- } >- if (sameLine) >- sameLine = false; >- } >- while (line!=null && !endTagFound); >- if (endTag == null) { >- // If end tag doesn't exist, create one >- String endTagName = tag.getTagName(); >- int endTagBegin = reader.getLastReadPosition()+1 ; >- int endTagEnd = endTagBegin + endTagName.length() + 2; >- endTag = new EndTag( >- new TagData( >- endTagBegin, >- endTagEnd, >- endTagName, >- currLine >- ) >- ); >- } >- NodeList childrenNodeList = new NodeList(); >- childrenNodeList.add( >- new StringNode( >- scriptContents, >- startTag.elementEnd(), >- endTag.elementBegin()-1 >- ) >- ); >- return createTag( >- new TagData( >- startTag.elementBegin(), >- endTag.elementEnd(), >- startLine, >- reader.getLastLineNumber(), >- startTag.getText(), >- currLine, >- url, >- false >- ), new CompositeTagData( >- startTag,endTag,childrenNodeList >- ) >- ); >- >- } >- catch (Exception e) { >- throw new ParserException("Error in ScriptScanner: ",e); >- } >- } >- >- public String getCodeBetweenStartAndEndTags( >- String line, >- Tag startTag, >- int endTagLoc) throws ParserException { >- try { >- >- return line.substring( >- startTag.elementEnd()+1, >- endTagLoc >- ); >- } >- catch (Exception e) { >- StringBuffer msg = new StringBuffer("Error in getCodeBetweenStartAndEndTags():\n"); >- msg.append("substring starts at: "+(startTag.elementEnd()+1)).append("\n"); >- msg.append("substring ends at: "+(endTagLoc)); >- throw new ParserException(msg.toString(),e); >- } >- } >- >- /** >- * Gets the end tag that the scanner uses to stop scanning. Subclasses of >- * <code>ScriptScanner</code> you should override this method. >- * @return String containing the end tag to search for, i.e. </SCRIPT> >- */ >- public String getEndTag() { >- return SCRIPT_END_TAG; >- } >- >- private boolean isScriptEmbeddedInDocumentWrite(String line, int endTagLoc) { >- if (endTagLoc+getEndTag().length() > line.length()-1) return false; >- return line.charAt(endTagLoc+getEndTag().length())=='"'; >- } >- > } >--- 65,67 ---- > > > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-cvs mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > |