Thread: RE: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagS

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Sure, I'll see if I can fix it.

-----Original Message-----
From: Derrick Oswald [mailto:Der...@ro...]
Sent: Tuesday, May 27, 2003 2:39 PM
To: htm...@li...
Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs]
htmlparser/src/org/htmlparser/scanners
CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22

Marc,

The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text=20
or remarks.
I guess the text scanner goes until it sees a <x... and then stops to=20
defer to a tag scanner. I hadn't thought about those in comments, or=20
about the \ end of lines.

Perhaps, rather than write a new scanner, fix the StringScanner (the=20
remark scanner should be OK), so that it does the correct behaviour when =

balance_quotes is true. Then the 'balance_quotes' flag could be called=20
'strict_script' or something.

Derrick

Marc Novakowski wrote:

>Derrick,
>
>I was relying on some of the old behavior of ScriptScanner, mostly the =
fact that its contents were not parsed as HTML.  I'm still seeing cases =
where tags inside of <script> are recognised as "HTML" and modified =
(i.e. turned into uppercase, auto-closed, etc).  For example, if there =
is an HTML tag in a Javascript comment.  Also, using "\" to concatenate =
lines (which is valid in Javacript) is totally messed up now when I try =
to get the script code using "toHtml()".
>
>However, I think your change was valid and fixes the bug as requested.  =
What I think I'm going to do, though, is make a new scanner class that =
does what the old ScriptScanner did.  That is, do a bare-bones "leave =
everything inside that tag as-is" parse of the HTML, searching only for =
the end tag with no knowledge of quotes or anything.  I think there are =
cases where Javascript is written such that any modification at all will =
break it.
>
>I'll send a note to the list when this class is done (today sometime).  =
I'll call it StrictScriptScanner or something.
>
>Marc
>
>-----Original Message-----
>From: der...@us...
>[mailto:der...@us...]
>Sent: Saturday, May 24, 2003 2:05 PM
>To: htm...@li...
>Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners
>CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22
>
>
>Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners
>In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners
>
>Modified Files:
>	CompositeTagScanner.java ScriptScanner.java=20
>Log Message:
>Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags
>Major overhaul of ScriptScanner.
>It now uses the scan() method of CompositeTagScanner (i.e. doesn't =
override).
>CompositeTagScanner now has a balance_quotes member field that dictates
>whether strings tags are scanned honouring single and double quotes.
>This affected the call chain through NodeReader and StringScanner which
>now have this parameter.
>StringScanner now correctly handles quotes if asked. The ignoreState =
stuff is removed,
>it didn't work anyway since a single StringScanner is used recursively =
by the NodeReader,
>and the member field would have been tromped.
>Sorry to all those who have broken code because of this, but it's for =
the better. Really.
>
>
>
>Index: CompositeTagScanner.java
>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>RCS file: =
/cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagSc=
anner.java,v
>retrieving revision 1.52
>retrieving revision 1.53
>diff -C2 -d -r1.52 -r1.53
>*** CompositeTagScanner.java	19 May 2003 02:49:57 -0000	1.52
>--- CompositeTagScanner.java	24 May 2003 21:04:44 -0000	1.53
>***************
>*** 97,100 ****
>--- 97,101 ----
>  	private Set tagEnderSet;
>  	private Set endTagEnderSet;
>+ 	private boolean balance_quotes;
>  		=09
>  	public CompositeTagScanner(String [] nameOfTagToMatch) {
>***************
>*** 125,129 ****
>  		this(filter,nameOfTagToMatch,tagEnders,new String[] {}, =
allowSelfChildren);
>  	}
>! =09
>  	public CompositeTagScanner(
>  		String filter,=20
>--- 126,130 ----
>  		this(filter,nameOfTagToMatch,tagEnders,new String[] {}, =
allowSelfChildren);
>  	}
>!=20
>  	public CompositeTagScanner(
>  		String filter,=20
>***************
>*** 131,138 ****
>  		String [] tagEnders,=20
>  		String [] endTagEnders,
>! 		boolean allowSelfChildren) {
>  		super(filter);
>  		this.nameOfTagToMatch =3D nameOfTagToMatch;
>  		this.allowSelfChildren =3D allowSelfChildren;
>  		this.tagEnderSet =3D new HashSet();
>  		for (int i=3D0;i<tagEnders.length;i++)
>--- 132,172 ----
>  		String [] tagEnders,=20
>  		String [] endTagEnders,
>! 		boolean allowSelfChildren)
>!     {
>!         this(filter,nameOfTagToMatch,tagEnders,endTagEnders, =
allowSelfChildren, false);
>!     }
>!=20
>!    /**
>!     * Constructor specifying all member fields.
>!     * @param filter A string that is used to match which tags are to =
be allowed
>!     * to pass through. This can be useful when one wishes to =
dynamically filter
>!     * out all tags except one type which may be programmed later than =
the parser.
>!     * @param nameOfTagToMatch The tag names recognized by this =
scanner.
>!     * @param tagEnders The non-endtag tag names which signal that no =
closing
>!     * end tag was found. For example, encountering &lt;FORM&gt; while
>!     * scanning a &lt;A&gt; link tag would mean that no &lt;/A&gt; was =
found
>!     * and needs to be corrected.
>!     * @param endTagEnders The endtag names which signal that no =
closing end
>!     * tag was found. For example, encountering &lt;/HTML&gt; while
>!     * scanning a &lt;BODY&gt; tag would mean that no &lt;/BODY&gt; =
was found
>!     * and needs to be corrected. These items are not prefixed by a =
'/'.
>!     * @param allowSelfChildren If <code>true</code> a tag of the same =
name is
>!     * allowed within this tag. Used to determine when an endtag is =
missing.
>!     * @param balance_quotes <code>true</code> if scanning string =
nodes needs to
>!     * honour quotes. For example, ScriptScanner defines this =
<code>true</code>
>!     * so that text within &lt;SCRIPT&gt;&lt;/SCRIPT&gt; ignores =
tag-like text
>!     * within quotes.
>!     */
>! 	public CompositeTagScanner(
>! 		String filter,=20
>! 		String [] nameOfTagToMatch,=20
>! 		String [] tagEnders,=20
>! 		String [] endTagEnders,
>! 		boolean allowSelfChildren,
>!         boolean balance_quotes) {
>  		super(filter);
>  		this.nameOfTagToMatch =3D nameOfTagToMatch;
>  		this.allowSelfChildren =3D allowSelfChildren;
>+         this.balance_quotes =3D balance_quotes;
>  		this.tagEnderSet =3D new HashSet();
>  		for (int i=3D0;i<tagEnders.length;i++)
>***************
>*** 145,149 ****
>  	public Tag scan(Tag tag, String url, NodeReader reader,String =
currLine) throws ParserException {
>  		CompositeTagScannerHelper helper =3D=20
>! 			new CompositeTagScannerHelper(this,tag,url,reader,currLine);
>  		return helper.scan();
>  	}
>--- 179,183 ----
>  	public Tag scan(Tag tag, String url, NodeReader reader,String =
currLine) throws ParserException {
>  		CompositeTagScannerHelper helper =3D=20
>! 			new =
CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes);
>  		return helper.scan();
>  	}
>***************
>*** 193,196 ****
>  		return false;
>  	}
>-=20
>  }
>--- 227,229 ----
>
>Index: ScriptScanner.java
>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>RCS file: =
/cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.=
java,v
>retrieving revision 1.21
>retrieving revision 1.22
>diff -C2 -d -r1.21 -r1.22
>*** ScriptScanner.java	19 May 2003 02:49:57 -0000	1.21
>--- ScriptScanner.java	24 May 2003 21:04:44 -0000	1.22
>***************
>*** 28,64 ****
> =20
>  package org.htmlparser.scanners;
>! /////////////////////////
>! // HTML Parser Imports //
>! /////////////////////////
>! import org.htmlparser.Node;
>! import org.htmlparser.NodeReader;
>! import org.htmlparser.StringNode;
>! import org.htmlparser.tags.EndTag;
>  import org.htmlparser.tags.ScriptTag;
>  import org.htmlparser.tags.Tag;
>  import org.htmlparser.tags.data.CompositeTagData;
>  import org.htmlparser.tags.data.TagData;
>! import org.htmlparser.util.NodeList;
>! import org.htmlparser.util.ParserException;
>  /**
>   * The HTMLScriptScanner identifies javascript code
>   */
>-=20
>  public class ScriptScanner extends CompositeTagScanner {
>- 	private static final String SCRIPT_END_TAG =3D "</SCRIPT>";
>  	private static final String MATCH_NAME [] =3D {"SCRIPT"};
>  	private static final String ENDERS [] =3D {"BODY", "HTML"};
>  	public ScriptScanner() {
>! 		super("",MATCH_NAME,ENDERS);
>  	}
> =20
>  	public ScriptScanner(String filter) {
>! 		super(filter,MATCH_NAME,ENDERS);
>  	}
> =20
>! 	public ScriptScanner(String filter, String[] nameOfTagToMatch) {
>! 		super(filter,nameOfTagToMatch,ENDERS);
>  	}
>! =09
>  	public String [] getID() {
>  		return MATCH_NAME;
>--- 28,59 ----
> =20
>  package org.htmlparser.scanners;
>!=20
>  import org.htmlparser.tags.ScriptTag;
>  import org.htmlparser.tags.Tag;
>  import org.htmlparser.tags.data.CompositeTagData;
>  import org.htmlparser.tags.data.TagData;
>!=20
>  /**
>   * The HTMLScriptScanner identifies javascript code
>   */
>  public class ScriptScanner extends CompositeTagScanner {
>  	private static final String MATCH_NAME [] =3D {"SCRIPT"};
>  	private static final String ENDERS [] =3D {"BODY", "HTML"};
>  	public ScriptScanner() {
>! 		this("");
>  	}
> =20
>  	public ScriptScanner(String filter) {
>! 		this(filter,MATCH_NAME,ENDERS);
>  	}
> =20
>! 	public ScriptScanner(String filter, String[] nameOfTagToMatch, =
String[] enders) {
>! 		this(filter,nameOfTagToMatch,enders, new String[0], true, true);
>  	}
>!=20
>! 	public ScriptScanner(String filter, String[] nameOfTagToMatch, =
String[] enders, String[] endtagenders, boolean allowSelfChildren, =
boolean balance_quotes) {
>! 		super(filter,nameOfTagToMatch,enders, new String[0], =
allowSelfChildren, balance_quotes);
>! 	}
>!=20
>  	public String [] getID() {
>  		return MATCH_NAME;
>***************
>*** 70,205 ****
>  		return new ScriptTag(tagData,compositeTagData);
>  	}
>-=20
>- 	public Tag scan(Tag tag, String url, NodeReader reader, String =
currLine)
>- 		throws ParserException {
>- 		try {
>- 			int startLine =3D reader.getLastLineNumber();
>- 			String line =3D null;
>- 			StringBuffer scriptContents =3D=20
>- 				new StringBuffer();
>- 			boolean endTagFound =3D false;
>- 			Tag startTag =3D tag;
>- 			Tag endTag =3D null;
>- 			line =3D currLine;
>- 			boolean sameLine =3D true;
>- 			int startingPos =3D startTag.elementEnd();
>- 			do {
>- 				int endTagLoc =3D =
line.toUpperCase().indexOf(getEndTag(),startingPos);
>- 				while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, =
endTagLoc)) {
>- 					startingPos =3D endTagLoc+getEndTag().length();
>- 					endTagLoc =3D line.toUpperCase().indexOf(getEndTag(), =
startingPos); =09
>- 				}
>- 				=20
>- 				if (endTagLoc!=3D-1) {
>- 					endTagFound =3D true;
>- 					endTag =3D (EndTag)EndTag.find(line,endTagLoc);
>- 					if (sameLine)=20
>- 						scriptContents.append(
>- 							getCodeBetweenStartAndEndTags(
>- 								line,
>- 								startTag,
>- 								endTagLoc)
>- 						);
>- 					else {
>- 						scriptContents.append(Node.getLineSeparator());
>- 						scriptContents.append(line.substring(0,endTagLoc));
>- 					}
>- 				=09
>- 					reader.setPosInLine(endTag.elementEnd());
>- 				} else {
>- 					if (sameLine)=20
>- 						scriptContents.append(
>- 							line.substring(
>- 								startTag.elementEnd()+1
>- 							)
>- 						);
>- 					else {
>- 						scriptContents.append(Node.getLineSeparator());
>- 						scriptContents.append(line);
>- 					}
>- 				}
>- 				if (!endTagFound) {
>- 					line =3D reader.getNextLine();
>- 					startingPos =3D 0;
>- 				}
>- 				if (sameLine)=20
>- 					sameLine =3D false;
>- 			}
>- 			while (line!=3Dnull && !endTagFound);
>- 			if (endTag =3D=3D null) {
>- 				// If end tag doesn't exist, create one
>- 				String endTagName =3D tag.getTagName();
>- 				int endTagBegin =3D reader.getLastReadPosition()+1 ;
>- 				int endTagEnd =3D endTagBegin + endTagName.length() + 2;=20
>- 				endTag =3D new EndTag(
>- 					new TagData(
>- 						endTagBegin,
>- 						endTagEnd,
>- 						endTagName,
>- 						currLine
>- 					)
>- 				);
>- 			}
>- 			NodeList childrenNodeList =3D new NodeList();
>- 			childrenNodeList.add(
>- 				new StringNode(
>- 					scriptContents,
>- 					startTag.elementEnd(),
>- 					endTag.elementBegin()-1
>- 				)
>- 			);
>- 			return createTag(
>- 				new TagData(
>- 					startTag.elementBegin(),
>- 					endTag.elementEnd(),
>- 					startLine,
>- 					reader.getLastLineNumber(),
>- 					startTag.getText(),
>- 					currLine,
>- 					url,
>- 					false
>- 				), new CompositeTagData(
>- 					startTag,endTag,childrenNodeList
>- 				)
>- 			);
>- 		=09
>- 		}
>- 		catch (Exception e) {
>- 			throw new ParserException("Error in ScriptScanner: ",e);
>- 		}
>- 	}
>-=20
>- 	public String getCodeBetweenStartAndEndTags(
>- 		String line,
>- 		Tag startTag,
>- 		int endTagLoc) throws ParserException {
>- 		try {
>- 		=09
>- 			return line.substring(
>- 				startTag.elementEnd()+1,
>- 				endTagLoc
>- 			);
>- 		}
>- 		catch (Exception e) {
>- 			StringBuffer msg =3D new StringBuffer("Error in =
getCodeBetweenStartAndEndTags():\n");
>- 			msg.append("substring starts at: =
"+(startTag.elementEnd()+1)).append("\n");
>- 			msg.append("substring ends at: "+(endTagLoc));
>- 			throw new ParserException(msg.toString(),e);
>- 		}
>- 	}
>-=20
>- 	/**
>- 	 * Gets the end tag that the scanner uses to stop scanning. =
Subclasses of
>- 	 * <code>ScriptScanner</code> you should override this method.
>- 	 * @return String containing the end tag to search for, i.e. =
&lt;/SCRIPT&gt;
>- 	 */=20
>- 	public String getEndTag() {
>- 		return SCRIPT_END_TAG;
>- 	}
>- =09
>- 	private boolean isScriptEmbeddedInDocumentWrite(String line, int =
endTagLoc) {
>- 		if (endTagLoc+getEndTag().length() > line.length()-1) return false;
>- 		return line.charAt(endTagLoc+getEndTag().length())=3D=3D'"';
>- 	}
>-=20
>  }
>--- 65,67 ----
>
>
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: ObjectStore.
>If flattening out C++ or Java code to make your application fit in a
>relational database is painful, don't do it! Check out ObjectStore.
>Now part of Progress Software. http://www.objectstore.net/sourceforge
>_______________________________________________
>Htmlparser-cvs mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: ObjectStore.
>If flattening out C++ or Java code to make your application fit in a
>relational database is painful, don't do it! Check out ObjectStore.
>Now part of Progress Software. http://www.objectstore.net/sourceforge
>_______________________________________________
>Htmlparser-developer mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
> =20
>

-------------------------------------------------------
This SF.net email is sponsored by: ObjectStore.
If flattening out C++ or Java code to make your application fit in a
relational database is painful, don't do it! Check out ObjectStore.
Now part of Progress Software. http://www.objectstore.net/sourceforge
_______________________________________________
Htmlparser-developer mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer