Thread: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagScann

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Derrick,

I was relying on some of the old behavior of ScriptScanner, mostly the =
fact that its contents were not parsed as HTML.  I'm still seeing cases =
where tags inside of <script> are recognised as "HTML" and modified =
(i.e. turned into uppercase, auto-closed, etc).  For example, if there =
is an HTML tag in a Javascript comment.  Also, using "\" to concatenate =
lines (which is valid in Javacript) is totally messed up now when I try =
to get the script code using "toHtml()".

However, I think your change was valid and fixes the bug as requested.  =
What I think I'm going to do, though, is make a new scanner class that =
does what the old ScriptScanner did.  That is, do a bare-bones "leave =
everything inside that tag as-is" parse of the HTML, searching only for =
the end tag with no knowledge of quotes or anything.  I think there are =
cases where Javascript is written such that any modification at all will =
break it.

I'll send a note to the list when this class is done (today sometime).  =
I'll call it StrictScriptScanner or something.

Marc

-----Original Message-----
From: der...@us...
[mailto:der...@us...]
Sent: Saturday, May 24, 2003 2:05 PM
To: htm...@li...
Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners
CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22

Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners
In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners

Modified Files:
	CompositeTagScanner.java ScriptScanner.java=20
Log Message:
Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags
Major overhaul of ScriptScanner.
It now uses the scan() method of CompositeTagScanner (i.e. doesn't =
override).
CompositeTagScanner now has a balance_quotes member field that dictates
whether strings tags are scanned honouring single and double quotes.
This affected the call chain through NodeReader and StringScanner which
now have this parameter.
StringScanner now correctly handles quotes if asked. The ignoreState =
stuff is removed,
it didn't work anyway since a single StringScanner is used recursively =
by the NodeReader,
and the member field would have been tromped.
Sorry to all those who have broken code because of this, but it's for =
the better. Really.

Index: CompositeTagScanner.java
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: =
/cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagSc=
anner.java,v
retrieving revision 1.52
retrieving revision 1.53
diff -C2 -d -r1.52 -r1.53
*** CompositeTagScanner.java	19 May 2003 02:49:57 -0000	1.52
--- CompositeTagScanner.java	24 May 2003 21:04:44 -0000	1.53
***************
*** 97,100 ****
--- 97,101 ----
  	private Set tagEnderSet;
  	private Set endTagEnderSet;
+ 	private boolean balance_quotes;
  		=09
  	public CompositeTagScanner(String [] nameOfTagToMatch) {
***************
*** 125,129 ****
  		this(filter,nameOfTagToMatch,tagEnders,new String[] {}, =
allowSelfChildren);
  	}
! =09
  	public CompositeTagScanner(
  		String filter,=20
--- 126,130 ----
  		this(filter,nameOfTagToMatch,tagEnders,new String[] {}, =
allowSelfChildren);
  	}
!=20
  	public CompositeTagScanner(
  		String filter,=20
***************
*** 131,138 ****
  		String [] tagEnders,=20
  		String [] endTagEnders,
! 		boolean allowSelfChildren) {
  		super(filter);
  		this.nameOfTagToMatch =3D nameOfTagToMatch;
  		this.allowSelfChildren =3D allowSelfChildren;
  		this.tagEnderSet =3D new HashSet();
  		for (int i=3D0;i<tagEnders.length;i++)
--- 132,172 ----
  		String [] tagEnders,=20
  		String [] endTagEnders,
! 		boolean allowSelfChildren)
!     {
!         this(filter,nameOfTagToMatch,tagEnders,endTagEnders, =
allowSelfChildren, false);
!     }
!=20
!    /**
!     * Constructor specifying all member fields.
!     * @param filter A string that is used to match which tags are to =
be allowed
!     * to pass through. This can be useful when one wishes to =
dynamically filter
!     * out all tags except one type which may be programmed later than =
the parser.
!     * @param nameOfTagToMatch The tag names recognized by this =
scanner.
!     * @param tagEnders The non-endtag tag names which signal that no =
closing
!     * end tag was found. For example, encountering &lt;FORM&gt; while
!     * scanning a &lt;A&gt; link tag would mean that no &lt;/A&gt; was =
found
!     * and needs to be corrected.
!     * @param endTagEnders The endtag names which signal that no =
closing end
!     * tag was found. For example, encountering &lt;/HTML&gt; while
!     * scanning a &lt;BODY&gt; tag would mean that no &lt;/BODY&gt; was =
found
!     * and needs to be corrected. These items are not prefixed by a =
'/'.
!     * @param allowSelfChildren If <code>true</code> a tag of the same =
name is
!     * allowed within this tag. Used to determine when an endtag is =
missing.
!     * @param balance_quotes <code>true</code> if scanning string nodes =
needs to
!     * honour quotes. For example, ScriptScanner defines this =
<code>true</code>
!     * so that text within &lt;SCRIPT&gt;&lt;/SCRIPT&gt; ignores =
tag-like text
!     * within quotes.
!     */
! 	public CompositeTagScanner(
! 		String filter,=20
! 		String [] nameOfTagToMatch,=20
! 		String [] tagEnders,=20
! 		String [] endTagEnders,
! 		boolean allowSelfChildren,
!         boolean balance_quotes) {
  		super(filter);
  		this.nameOfTagToMatch =3D nameOfTagToMatch;
  		this.allowSelfChildren =3D allowSelfChildren;
+         this.balance_quotes =3D balance_quotes;
  		this.tagEnderSet =3D new HashSet();
  		for (int i=3D0;i<tagEnders.length;i++)
***************
*** 145,149 ****
  	public Tag scan(Tag tag, String url, NodeReader reader,String =
currLine) throws ParserException {
  		CompositeTagScannerHelper helper =3D=20
! 			new CompositeTagScannerHelper(this,tag,url,reader,currLine);
  		return helper.scan();
  	}
--- 179,183 ----
  	public Tag scan(Tag tag, String url, NodeReader reader,String =
currLine) throws ParserException {
  		CompositeTagScannerHelper helper =3D=20
! 			new =
CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes);
  		return helper.scan();
  	}
***************
*** 193,196 ****
  		return false;
  	}
-=20
  }
--- 227,229 ----

Index: ScriptScanner.java
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: =
/cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.=
java,v
retrieving revision 1.21
retrieving revision 1.22
diff -C2 -d -r1.21 -r1.22
*** ScriptScanner.java	19 May 2003 02:49:57 -0000	1.21
--- ScriptScanner.java	24 May 2003 21:04:44 -0000	1.22
***************
*** 28,64 ****
 =20
  package org.htmlparser.scanners;
! /////////////////////////
! // HTML Parser Imports //
! /////////////////////////
! import org.htmlparser.Node;
! import org.htmlparser.NodeReader;
! import org.htmlparser.StringNode;
! import org.htmlparser.tags.EndTag;
  import org.htmlparser.tags.ScriptTag;
  import org.htmlparser.tags.Tag;
  import org.htmlparser.tags.data.CompositeTagData;
  import org.htmlparser.tags.data.TagData;
! import org.htmlparser.util.NodeList;
! import org.htmlparser.util.ParserException;
  /**
   * The HTMLScriptScanner identifies javascript code
   */
-=20
  public class ScriptScanner extends CompositeTagScanner {
- 	private static final String SCRIPT_END_TAG =3D "</SCRIPT>";
  	private static final String MATCH_NAME [] =3D {"SCRIPT"};
  	private static final String ENDERS [] =3D {"BODY", "HTML"};
  	public ScriptScanner() {
! 		super("",MATCH_NAME,ENDERS);
  	}
 =20
  	public ScriptScanner(String filter) {
! 		super(filter,MATCH_NAME,ENDERS);
  	}
 =20
! 	public ScriptScanner(String filter, String[] nameOfTagToMatch) {
! 		super(filter,nameOfTagToMatch,ENDERS);
  	}
! =09
  	public String [] getID() {
  		return MATCH_NAME;
--- 28,59 ----
 =20
  package org.htmlparser.scanners;
!=20
  import org.htmlparser.tags.ScriptTag;
  import org.htmlparser.tags.Tag;
  import org.htmlparser.tags.data.CompositeTagData;
  import org.htmlparser.tags.data.TagData;
!=20
  /**
   * The HTMLScriptScanner identifies javascript code
   */
  public class ScriptScanner extends CompositeTagScanner {
  	private static final String MATCH_NAME [] =3D {"SCRIPT"};
  	private static final String ENDERS [] =3D {"BODY", "HTML"};
  	public ScriptScanner() {
! 		this("");
  	}
 =20
  	public ScriptScanner(String filter) {
! 		this(filter,MATCH_NAME,ENDERS);
  	}
 =20
! 	public ScriptScanner(String filter, String[] nameOfTagToMatch, =
String[] enders) {
! 		this(filter,nameOfTagToMatch,enders, new String[0], true, true);
  	}
!=20
! 	public ScriptScanner(String filter, String[] nameOfTagToMatch, =
String[] enders, String[] endtagenders, boolean allowSelfChildren, =
boolean balance_quotes) {
! 		super(filter,nameOfTagToMatch,enders, new String[0], =
allowSelfChildren, balance_quotes);
! 	}
!=20
  	public String [] getID() {
  		return MATCH_NAME;
***************
*** 70,205 ****
  		return new ScriptTag(tagData,compositeTagData);
  	}
-=20
- 	public Tag scan(Tag tag, String url, NodeReader reader, String =
currLine)
- 		throws ParserException {
- 		try {
- 			int startLine =3D reader.getLastLineNumber();
- 			String line =3D null;
- 			StringBuffer scriptContents =3D=20
- 				new StringBuffer();
- 			boolean endTagFound =3D false;
- 			Tag startTag =3D tag;
- 			Tag endTag =3D null;
- 			line =3D currLine;
- 			boolean sameLine =3D true;
- 			int startingPos =3D startTag.elementEnd();
- 			do {
- 				int endTagLoc =3D =
line.toUpperCase().indexOf(getEndTag(),startingPos);
- 				while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, =
endTagLoc)) {
- 					startingPos =3D endTagLoc+getEndTag().length();
- 					endTagLoc =3D line.toUpperCase().indexOf(getEndTag(), =
startingPos); =09
- 				}
- 				=20
- 				if (endTagLoc!=3D-1) {
- 					endTagFound =3D true;
- 					endTag =3D (EndTag)EndTag.find(line,endTagLoc);
- 					if (sameLine)=20
- 						scriptContents.append(
- 							getCodeBetweenStartAndEndTags(
- 								line,
- 								startTag,
- 								endTagLoc)
- 						);
- 					else {
- 						scriptContents.append(Node.getLineSeparator());
- 						scriptContents.append(line.substring(0,endTagLoc));
- 					}
- 				=09
- 					reader.setPosInLine(endTag.elementEnd());
- 				} else {
- 					if (sameLine)=20
- 						scriptContents.append(
- 							line.substring(
- 								startTag.elementEnd()+1
- 							)
- 						);
- 					else {
- 						scriptContents.append(Node.getLineSeparator());
- 						scriptContents.append(line);
- 					}
- 				}
- 				if (!endTagFound) {
- 					line =3D reader.getNextLine();
- 					startingPos =3D 0;
- 				}
- 				if (sameLine)=20
- 					sameLine =3D false;
- 			}
- 			while (line!=3Dnull && !endTagFound);
- 			if (endTag =3D=3D null) {
- 				// If end tag doesn't exist, create one
- 				String endTagName =3D tag.getTagName();
- 				int endTagBegin =3D reader.getLastReadPosition()+1 ;
- 				int endTagEnd =3D endTagBegin + endTagName.length() + 2;=20
- 				endTag =3D new EndTag(
- 					new TagData(
- 						endTagBegin,
- 						endTagEnd,
- 						endTagName,
- 						currLine
- 					)
- 				);
- 			}
- 			NodeList childrenNodeList =3D new NodeList();
- 			childrenNodeList.add(
- 				new StringNode(
- 					scriptContents,
- 					startTag.elementEnd(),
- 					endTag.elementBegin()-1
- 				)
- 			);
- 			return createTag(
- 				new TagData(
- 					startTag.elementBegin(),
- 					endTag.elementEnd(),
- 					startLine,
- 					reader.getLastLineNumber(),
- 					startTag.getText(),
- 					currLine,
- 					url,
- 					false
- 				), new CompositeTagData(
- 					startTag,endTag,childrenNodeList
- 				)
- 			);
- 		=09
- 		}
- 		catch (Exception e) {
- 			throw new ParserException("Error in ScriptScanner: ",e);
- 		}
- 	}
-=20
- 	public String getCodeBetweenStartAndEndTags(
- 		String line,
- 		Tag startTag,
- 		int endTagLoc) throws ParserException {
- 		try {
- 		=09
- 			return line.substring(
- 				startTag.elementEnd()+1,
- 				endTagLoc
- 			);
- 		}
- 		catch (Exception e) {
- 			StringBuffer msg =3D new StringBuffer("Error in =
getCodeBetweenStartAndEndTags():\n");
- 			msg.append("substring starts at: =
"+(startTag.elementEnd()+1)).append("\n");
- 			msg.append("substring ends at: "+(endTagLoc));
- 			throw new ParserException(msg.toString(),e);
- 		}
- 	}
-=20
- 	/**
- 	 * Gets the end tag that the scanner uses to stop scanning. =
Subclasses of
- 	 * <code>ScriptScanner</code> you should override this method.
- 	 * @return String containing the end tag to search for, i.e. =
&lt;/SCRIPT&gt;
- 	 */=20
- 	public String getEndTag() {
- 		return SCRIPT_END_TAG;
- 	}
- =09
- 	private boolean isScriptEmbeddedInDocumentWrite(String line, int =
endTagLoc) {
- 		if (endTagLoc+getEndTag().length() > line.length()-1) return false;
- 		return line.charAt(endTagLoc+getEndTag().length())=3D=3D'"';
- 	}
-=20
  }
--- 65,67 ----

-------------------------------------------------------
This SF.net email is sponsored by: ObjectStore.
If flattening out C++ or Java code to make your application fit in a
relational database is painful, don't do it! Check out ObjectStore.
Now part of Progress Software. http://www.objectstore.net/sourceforge
_______________________________________________
Htmlparser-cvs mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs

Thread: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagScann

htmlparser-developer