RE: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTag
Brought to you by:
derrickoswald
From: <dha...@po...> - 2003-05-28 05:23:17
|
Marc, I agree with Derrick. Lets correct the existing scanner rather than write something new since typically it gets confusing for users to know what to deal with and how the two scanenrs are different. It takes a lot of experiecne with the parser to understand the subtle difference between the two. > -----Original Message----- > From: htm...@li...=20 > [mailto:htm...@li...] On=20 > Behalf Of Der...@ro... > Sent: Wednesday, May 28, 2003 3:09 AM > To: htm...@li... > Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs]=20 > htmlparser/src/org/htmlparser/scanners=20 > CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 >=20 >=20 > Marc, >=20 > The text within <SCRIPT></SCRIPT> is supposed to be parsed as=20 > pure text=20 > or remarks. > I guess the text scanner goes until it sees a <x... and then stops to=20 > defer to a tag scanner. I hadn't thought about those in comments, or=20 > about the \ end of lines. >=20 > Perhaps, rather than write a new scanner, fix the StringScanner (the=20 > remark scanner should be OK), so that it does the correct=20 > behaviour when=20 > balance_quotes is true. Then the 'balance_quotes' flag could=20 > be called=20 > 'strict_script' or something. >=20 > Derrick >=20 > Marc Novakowski wrote: >=20 > >Derrick, > > > >I was relying on some of the old behavior of ScriptScanner,=20 > mostly the=20 > >fact that its contents were not parsed as HTML. I'm still=20 > seeing cases=20 > >where tags inside of <script> are recognised as "HTML" and modified=20 > >(i.e. turned into uppercase, auto-closed, etc). For=20 > example, if there=20 > >is an HTML tag in a Javascript comment. Also, using "\" to=20 > concatenate=20 > >lines (which is valid in Javacript) is totally messed up now=20 > when I try=20 > >to get the script code using "toHtml()". > > > >However, I think your change was valid and fixes the bug as=20 > requested. =20 > >What I think I'm going to do, though, is make a new scanner=20 > class that=20 > >does what the old ScriptScanner did. That is, do a=20 > bare-bones "leave=20 > >everything inside that tag as-is" parse of the HTML,=20 > searching only for=20 > >the end tag with no knowledge of quotes or anything. I=20 > think there are=20 > >cases where Javascript is written such that any modification at all=20 > >will break it. > > > >I'll send a note to the list when this class is done (today=20 > sometime). =20 > >I'll call it StrictScriptScanner or something. > > > >Marc > > > >-----Original Message----- > >From: der...@us... > >[mailto:der...@us...] > >Sent: Saturday, May 24, 2003 2:05 PM > >To: htm...@li... > >Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners > >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > > > > >Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners > >In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners > > > >Modified Files: > > CompositeTagScanner.java ScriptScanner.java > >Log Message: > >Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags > >Major overhaul of ScriptScanner. > >It now uses the scan() method of CompositeTagScanner (i.e.=20 > doesn't override). > >CompositeTagScanner now has a balance_quotes member field=20 > that dictates > >whether strings tags are scanned honouring single and double quotes. > >This affected the call chain through NodeReader and=20 > StringScanner which > >now have this parameter. > >StringScanner now correctly handles quotes if asked. The=20 > ignoreState stuff is removed, > >it didn't work anyway since a single StringScanner is used=20 > recursively by the NodeReader, > >and the member field would have been tromped. > >Sorry to all those who have broken code because of this, but=20 > it's for the better. Really. > > > > > > > >Index: CompositeTagScanner.java=20 > = >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >RCS file:=20 > >/cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/Co > mpositeTagScanner.java,v > >retrieving revision 1.52 > >retrieving revision 1.53 > >diff -C2 -d -r1.52 -r1.53 > >*** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52 > >--- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53 > >*************** > >*** 97,100 **** > >--- 97,101 ---- > > private Set tagEnderSet; > > private Set endTagEnderSet; > >+ private boolean balance_quotes; > > =09 > > public CompositeTagScanner(String [] nameOfTagToMatch) { > >*************** > >*** 125,129 **** > > this(filter,nameOfTagToMatch,tagEnders,new=20 > String[] {}, allowSelfChildren); > > } > >! =09 > > public CompositeTagScanner( > > String filter, > >--- 126,130 ---- > > this(filter,nameOfTagToMatch,tagEnders,new=20 > String[] {}, allowSelfChildren); > > } > >!=20 > > public CompositeTagScanner( > > String filter,=20 > >*************** > >*** 131,138 **** > > String [] tagEnders,=20 > > String [] endTagEnders, > >! boolean allowSelfChildren) { > > super(filter); > > this.nameOfTagToMatch =3D nameOfTagToMatch; > > this.allowSelfChildren =3D allowSelfChildren; > > this.tagEnderSet =3D new HashSet(); > > for (int i=3D0;i<tagEnders.length;i++) > >--- 132,172 ---- > > String [] tagEnders,=20 > > String [] endTagEnders, > >! boolean allowSelfChildren) > >! { > >! =20 > this(filter,nameOfTagToMatch,tagEnders,endTagEnders,=20 > allowSelfChildren, false); > >! } > >!=20 > >! /** > >! * Constructor specifying all member fields. > >! * @param filter A string that is used to match which=20 > tags are to be allowed > >! * to pass through. This can be useful when one wishes=20 > to dynamically filter > >! * out all tags except one type which may be programmed=20 > later than the parser. > >! * @param nameOfTagToMatch The tag names recognized by=20 > this scanner. > >! * @param tagEnders The non-endtag tag names which=20 > signal that no closing > >! * end tag was found. For example, encountering=20 > <FORM> while > >! * scanning a <A> link tag would mean that no=20 > </A> was found > >! * and needs to be corrected. > >! * @param endTagEnders The endtag names which signal=20 > that no closing end > >! * tag was found. For example, encountering </HTML> while > >! * scanning a <BODY> tag would mean that no=20 > </BODY> was found > >! * and needs to be corrected. These items are not=20 > prefixed by a '/'. > >! * @param allowSelfChildren If <code>true</code> a tag=20 > of the same name is > >! * allowed within this tag. Used to determine when an=20 > endtag is missing. > >! * @param balance_quotes <code>true</code> if scanning=20 > string nodes needs to > >! * honour quotes. For example, ScriptScanner defines=20 > this <code>true</code> > >! * so that text within <SCRIPT></SCRIPT>=20 > ignores tag-like text > >! * within quotes. > >! */ > >! public CompositeTagScanner( > >! String filter,=20 > >! String [] nameOfTagToMatch,=20 > >! String [] tagEnders,=20 > >! String [] endTagEnders, > >! boolean allowSelfChildren, > >! boolean balance_quotes) { > > super(filter); > > this.nameOfTagToMatch =3D nameOfTagToMatch; > > this.allowSelfChildren =3D allowSelfChildren; > >+ this.balance_quotes =3D balance_quotes; > > this.tagEnderSet =3D new HashSet(); > > for (int i=3D0;i<tagEnders.length;i++) > >*************** > >*** 145,149 **** > > public Tag scan(Tag tag, String url, NodeReader=20 > reader,String currLine) throws ParserException { > > CompositeTagScannerHelper helper =3D=20 > >! new=20 > CompositeTagScannerHelper(this,tag,url,reader,currLine); > > return helper.scan(); > > } > >--- 179,183 ---- > > public Tag scan(Tag tag, String url, NodeReader=20 > reader,String currLine) throws ParserException { > > CompositeTagScannerHelper helper =3D=20 > >! new=20 > CompositeTagScannerHelper(this,tag,url,reader,currLine,balance > _quotes); > > return helper.scan(); > > } > >*************** > >*** 193,196 **** > > return false; > > } > >- > > } > >--- 227,229 ---- > > > >Index: ScriptScanner.java=20 > = >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >RCS file:=20 > >/cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/Sc > riptScanner.java,v > >retrieving revision 1.21 > >retrieving revision 1.22 > >diff -C2 -d -r1.21 -r1.22 > >*** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21 > >--- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22 > >*************** > >*** 28,64 **** > > =20 > > package org.htmlparser.scanners; > >! ///////////////////////// > >! // HTML Parser Imports // > >! ///////////////////////// > >! import org.htmlparser.Node; > >! import org.htmlparser.NodeReader; > >! import org.htmlparser.StringNode; > >! import org.htmlparser.tags.EndTag; > > import org.htmlparser.tags.ScriptTag; > > import org.htmlparser.tags.Tag; > > import org.htmlparser.tags.data.CompositeTagData; > > import org.htmlparser.tags.data.TagData; > >! import org.htmlparser.util.NodeList; > >! import org.htmlparser.util.ParserException; > > /** > > * The HTMLScriptScanner identifies javascript code > > */ > >- > > public class ScriptScanner extends CompositeTagScanner { > >- private static final String SCRIPT_END_TAG =3D "</SCRIPT>"; > > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > > private static final String ENDERS [] =3D {"BODY", "HTML"}; > > public ScriptScanner() { > >! super("",MATCH_NAME,ENDERS); > > } > > =20 > > public ScriptScanner(String filter) { > >! super(filter,MATCH_NAME,ENDERS); > > } > > =20 > >! public ScriptScanner(String filter, String[] nameOfTagToMatch) { > >! super(filter,nameOfTagToMatch,ENDERS); > > } > >! =09 > > public String [] getID() { > > return MATCH_NAME; > >--- 28,59 ---- > > =20 > > package org.htmlparser.scanners; > >! > > import org.htmlparser.tags.ScriptTag; > > import org.htmlparser.tags.Tag; > > import org.htmlparser.tags.data.CompositeTagData; > > import org.htmlparser.tags.data.TagData; > >!=20 > > /** > > * The HTMLScriptScanner identifies javascript code > > */ > > public class ScriptScanner extends CompositeTagScanner { > > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > > private static final String ENDERS [] =3D {"BODY", "HTML"}; > > public ScriptScanner() { > >! this(""); > > } > > =20 > > public ScriptScanner(String filter) { > >! this(filter,MATCH_NAME,ENDERS); > > } > > =20 > >! public ScriptScanner(String filter, String[]=20 > nameOfTagToMatch, String[] enders) { > >! this(filter,nameOfTagToMatch,enders, new=20 > String[0], true, true); > > } > >!=20 > >! public ScriptScanner(String filter, String[]=20 > nameOfTagToMatch, String[] enders, String[] endtagenders,=20 > boolean allowSelfChildren, boolean balance_quotes) { > >! super(filter,nameOfTagToMatch,enders, new=20 > String[0], allowSelfChildren, balance_quotes); > >! } > >!=20 > > public String [] getID() { > > return MATCH_NAME; > >*************** > >*** 70,205 **** > > return new ScriptTag(tagData,compositeTagData); > > } > >-=20 > >- public Tag scan(Tag tag, String url, NodeReader reader,=20 > String currLine) > >- throws ParserException { > >- try { > >- int startLine =3D reader.getLastLineNumber(); > >- String line =3D null; > >- StringBuffer scriptContents =3D=20 > >- new StringBuffer(); > >- boolean endTagFound =3D false; > >- Tag startTag =3D tag; > >- Tag endTag =3D null; > >- line =3D currLine; > >- boolean sameLine =3D true; > >- int startingPos =3D startTag.elementEnd(); > >- do { > >- int endTagLoc =3D=20 > line.toUpperCase().indexOf(getEndTag(),startingPos); > >- while (endTagLoc>0 &&=20 > isScriptEmbeddedInDocumentWrite(line, endTagLoc)) { > >- startingPos =3D=20 > endTagLoc+getEndTag().length(); > >- endTagLoc =3D=20 > line.toUpperCase().indexOf(getEndTag(), startingPos); =09 > >- } > >- =20 > >- if (endTagLoc!=3D-1) { > >- endTagFound =3D true; > >- endTag =3D=20 > (EndTag)EndTag.find(line,endTagLoc); > >- if (sameLine)=20 > >- scriptContents.append( > >- =09 > getCodeBetweenStartAndEndTags( > >- line, > >- =09 > startTag, > >- =09 > endTagLoc) > >- ); > >- else { > >- =09 > scriptContents.append(Node.getLineSeparator()); > >- =09 > scriptContents.append(line.substring(0,endTagLoc)); > >- } > >- =09 > >- =09 > reader.setPosInLine(endTag.elementEnd()); > >- } else { > >- if (sameLine)=20 > >- scriptContents.append( > >- line.substring( > >- =09 > startTag.elementEnd()+1 > >- ) > >- ); > >- else { > >- =09 > scriptContents.append(Node.getLineSeparator()); > >- =09 > scriptContents.append(line); > >- } > >- } > >- if (!endTagFound) { > >- line =3D reader.getNextLine(); > >- startingPos =3D 0; > >- } > >- if (sameLine)=20 > >- sameLine =3D false; > >- } > >- while (line!=3Dnull && !endTagFound); > >- if (endTag =3D=3D null) { > >- // If end tag doesn't exist, create one > >- String endTagName =3D tag.getTagName(); > >- int endTagBegin =3D=20 > reader.getLastReadPosition()+1 ; > >- int endTagEnd =3D endTagBegin +=20 > endTagName.length() + 2;=20 > >- endTag =3D new EndTag( > >- new TagData( > >- endTagBegin, > >- endTagEnd, > >- endTagName, > >- currLine > >- ) > >- ); > >- } > >- NodeList childrenNodeList =3D new NodeList(); > >- childrenNodeList.add( > >- new StringNode( > >- scriptContents, > >- startTag.elementEnd(), > >- endTag.elementBegin()-1 > >- ) > >- ); > >- return createTag( > >- new TagData( > >- startTag.elementBegin(), > >- endTag.elementEnd(), > >- startLine, > >- reader.getLastLineNumber(), > >- startTag.getText(), > >- currLine, > >- url, > >- false > >- ), new CompositeTagData( > >- startTag,endTag,childrenNodeList > >- ) > >- ); > >- =09 > >- } > >- catch (Exception e) { > >- throw new ParserException("Error in=20 > ScriptScanner: ",e); > >- } > >- } > >-=20 > >- public String getCodeBetweenStartAndEndTags( > >- String line, > >- Tag startTag, > >- int endTagLoc) throws ParserException { > >- try { > >- =09 > >- return line.substring( > >- startTag.elementEnd()+1, > >- endTagLoc > >- ); > >- } > >- catch (Exception e) { > >- StringBuffer msg =3D new=20 > StringBuffer("Error in getCodeBetweenStartAndEndTags():\n"); > >- msg.append("substring starts at:=20 > "+(startTag.elementEnd()+1)).append("\n"); > >- msg.append("substring ends at: "+(endTagLoc)); > >- throw new ParserException(msg.toString(),e); > >- } > >- } > >-=20 > >- /** > >- * Gets the end tag that the scanner uses to stop=20 > scanning. Subclasses of > >- * <code>ScriptScanner</code> you should override this method. > >- * @return String containing the end tag to search for,=20 > i.e. </SCRIPT> > >- */=20 > >- public String getEndTag() { > >- return SCRIPT_END_TAG; > >- } > >- =09 > >- private boolean isScriptEmbeddedInDocumentWrite(String=20 > line, int endTagLoc) { > >- if (endTagLoc+getEndTag().length() >=20 > line.length()-1) return false; > >- return line.charAt(endTagLoc+getEndTag().length())=3D=3D'"'; > >- } > >- > > } > >--- 65,67 ---- > > > > > > > > > >------------------------------------------------------- > >This SF.net email is sponsored by: ObjectStore. > >If flattening out C++ or Java code to make your application fit in a=20 > >relational database is painful, don't do it! Check out=20 > ObjectStore. Now=20 > >part of Progress Software. http://www.objectstore.net/sourceforge > >_______________________________________________ > >Htmlparser-cvs mailing list Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs > > > > > >------------------------------------------------------- > >This SF.net email is sponsored by: ObjectStore. > >If flattening out C++ or Java code to make your application fit in a=20 > >relational database is painful, don't do it! Check out=20 > ObjectStore. Now=20 > >part of Progress Software. http://www.objectstore.net/sourceforge > >_______________________________________________ > >Htmlparser-developer mailing list=20 > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > =20 > > >=20 >=20 >=20 >=20 > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your application=20 > fit in a relational database is painful, don't do it! Check=20 > out ObjectStore. Now part of Progress Software.=20 http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |