Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTag

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Marc,

The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text 
or remarks.
I guess the text scanner goes until it sees a <x... and then stops to 
defer to a tag scanner. I hadn't thought about those in comments, or 
about the \ end of lines.

Perhaps, rather than write a new scanner, fix the StringScanner (the 
remark scanner should be OK), so that it does the correct behaviour when 
balance_quotes is true. Then the 'balance_quotes' flag could be called 
'strict_script' or something.

Derrick

Marc Novakowski wrote:

>Derrick,
>
>I was relying on some of the old behavior of ScriptScanner, mostly the fact that its contents were not parsed as HTML.  I'm still seeing cases where tags inside of <script> are recognised as "HTML" and modified (i.e. turned into uppercase, auto-closed, etc).  For example, if there is an HTML tag in a Javascript comment.  Also, using "\" to concatenate lines (which is valid in Javacript) is totally messed up now when I try to get the script code using "toHtml()".
>
>However, I think your change was valid and fixes the bug as requested.  What I think I'm going to do, though, is make a new scanner class that does what the old ScriptScanner did.  That is, do a bare-bones "leave everything inside that tag as-is" parse of the HTML, searching only for the end tag with no knowledge of quotes or anything.  I think there are cases where Javascript is written such that any modification at all will break it.
>
>I'll send a note to the list when this class is done (today sometime).  I'll call it StrictScriptScanner or something.
>
>Marc
>
>-----Original Message-----
>From: der...@us...
>[mailto:der...@us...]
>Sent: Saturday, May 24, 2003 2:05 PM
>To: htm...@li...
>Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners
>CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22
>
>
>Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners
>In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners
>
>Modified Files:
>	CompositeTagScanner.java ScriptScanner.java 
>Log Message:
>Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags
>Major overhaul of ScriptScanner.
>It now uses the scan() method of CompositeTagScanner (i.e. doesn't override).
>CompositeTagScanner now has a balance_quotes member field that dictates
>whether strings tags are scanned honouring single and double quotes.
>This affected the call chain through NodeReader and StringScanner which
>now have this parameter.
>StringScanner now correctly handles quotes if asked. The ignoreState stuff is removed,
>it didn't work anyway since a single StringScanner is used recursively by the NodeReader,
>and the member field would have been tromped.
>Sorry to all those who have broken code because of this, but it's for the better. Really.
>
>
>
>Index: CompositeTagScanner.java
>===================================================================
>RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagScanner.java,v
>retrieving revision 1.52
>retrieving revision 1.53
>diff -C2 -d -r1.52 -r1.53
>*** CompositeTagScanner.java	19 May 2003 02:49:57 -0000	1.52
>--- CompositeTagScanner.java	24 May 2003 21:04:44 -0000	1.53
>***************
>*** 97,100 ****
>--- 97,101 ----
>  	private Set tagEnderSet;
>  	private Set endTagEnderSet;
>+ 	private boolean balance_quotes;
>  			
>  	public CompositeTagScanner(String [] nameOfTagToMatch) {
>***************
>*** 125,129 ****
>  		this(filter,nameOfTagToMatch,tagEnders,new String[] {}, allowSelfChildren);
>  	}
>! 	
>  	public CompositeTagScanner(
>  		String filter, 
>--- 126,130 ----
>  		this(filter,nameOfTagToMatch,tagEnders,new String[] {}, allowSelfChildren);
>  	}
>! 
>  	public CompositeTagScanner(
>  		String filter, 
>***************
>*** 131,138 ****
>  		String [] tagEnders, 
>  		String [] endTagEnders,
>! 		boolean allowSelfChildren) {
>  		super(filter);
>  		this.nameOfTagToMatch = nameOfTagToMatch;
>  		this.allowSelfChildren = allowSelfChildren;
>  		this.tagEnderSet = new HashSet();
>  		for (int i=0;i<tagEnders.length;i++)
>--- 132,172 ----
>  		String [] tagEnders, 
>  		String [] endTagEnders,
>! 		boolean allowSelfChildren)
>!     {
>!         this(filter,nameOfTagToMatch,tagEnders,endTagEnders, allowSelfChildren, false);
>!     }
>! 
>!    /**
>!     * Constructor specifying all member fields.
>!     * @param filter A string that is used to match which tags are to be allowed
>!     * to pass through. This can be useful when one wishes to dynamically filter
>!     * out all tags except one type which may be programmed later than the parser.
>!     * @param nameOfTagToMatch The tag names recognized by this scanner.
>!     * @param tagEnders The non-endtag tag names which signal that no closing
>!     * end tag was found. For example, encountering &lt;FORM&gt; while
>!     * scanning a &lt;A&gt; link tag would mean that no &lt;/A&gt; was found
>!     * and needs to be corrected.
>!     * @param endTagEnders The endtag names which signal that no closing end
>!     * tag was found. For example, encountering &lt;/HTML&gt; while
>!     * scanning a &lt;BODY&gt; tag would mean that no &lt;/BODY&gt; was found
>!     * and needs to be corrected. These items are not prefixed by a '/'.
>!     * @param allowSelfChildren If <code>true</code> a tag of the same name is
>!     * allowed within this tag. Used to determine when an endtag is missing.
>!     * @param balance_quotes <code>true</code> if scanning string nodes needs to
>!     * honour quotes. For example, ScriptScanner defines this <code>true</code>
>!     * so that text within &lt;SCRIPT&gt;&lt;/SCRIPT&gt; ignores tag-like text
>!     * within quotes.
>!     */
>! 	public CompositeTagScanner(
>! 		String filter, 
>! 		String [] nameOfTagToMatch, 
>! 		String [] tagEnders, 
>! 		String [] endTagEnders,
>! 		boolean allowSelfChildren,
>!         boolean balance_quotes) {
>  		super(filter);
>  		this.nameOfTagToMatch = nameOfTagToMatch;
>  		this.allowSelfChildren = allowSelfChildren;
>+         this.balance_quotes = balance_quotes;
>  		this.tagEnderSet = new HashSet();
>  		for (int i=0;i<tagEnders.length;i++)
>***************
>*** 145,149 ****
>  	public Tag scan(Tag tag, String url, NodeReader reader,String currLine) throws ParserException {
>  		CompositeTagScannerHelper helper = 
>! 			new CompositeTagScannerHelper(this,tag,url,reader,currLine);
>  		return helper.scan();
>  	}
>--- 179,183 ----
>  	public Tag scan(Tag tag, String url, NodeReader reader,String currLine) throws ParserException {
>  		CompositeTagScannerHelper helper = 
>! 			new CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes);
>  		return helper.scan();
>  	}
>***************
>*** 193,196 ****
>  		return false;
>  	}
>- 
>  }
>--- 227,229 ----
>
>Index: ScriptScanner.java
>===================================================================
>RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.java,v
>retrieving revision 1.21
>retrieving revision 1.22
>diff -C2 -d -r1.21 -r1.22
>*** ScriptScanner.java	19 May 2003 02:49:57 -0000	1.21
>--- ScriptScanner.java	24 May 2003 21:04:44 -0000	1.22
>***************
>*** 28,64 ****
>  
>  package org.htmlparser.scanners;
>! /////////////////////////
>! // HTML Parser Imports //
>! /////////////////////////
>! import org.htmlparser.Node;
>! import org.htmlparser.NodeReader;
>! import org.htmlparser.StringNode;
>! import org.htmlparser.tags.EndTag;
>  import org.htmlparser.tags.ScriptTag;
>  import org.htmlparser.tags.Tag;
>  import org.htmlparser.tags.data.CompositeTagData;
>  import org.htmlparser.tags.data.TagData;
>! import org.htmlparser.util.NodeList;
>! import org.htmlparser.util.ParserException;
>  /**
>   * The HTMLScriptScanner identifies javascript code
>   */
>- 
>  public class ScriptScanner extends CompositeTagScanner {
>- 	private static final String SCRIPT_END_TAG = "</SCRIPT>";
>  	private static final String MATCH_NAME [] = {"SCRIPT"};
>  	private static final String ENDERS [] = {"BODY", "HTML"};
>  	public ScriptScanner() {
>! 		super("",MATCH_NAME,ENDERS);
>  	}
>  
>  	public ScriptScanner(String filter) {
>! 		super(filter,MATCH_NAME,ENDERS);
>  	}
>  
>! 	public ScriptScanner(String filter, String[] nameOfTagToMatch) {
>! 		super(filter,nameOfTagToMatch,ENDERS);
>  	}
>! 	
>  	public String [] getID() {
>  		return MATCH_NAME;
>--- 28,59 ----
>  
>  package org.htmlparser.scanners;
>! 
>  import org.htmlparser.tags.ScriptTag;
>  import org.htmlparser.tags.Tag;
>  import org.htmlparser.tags.data.CompositeTagData;
>  import org.htmlparser.tags.data.TagData;
>! 
>  /**
>   * The HTMLScriptScanner identifies javascript code
>   */
>  public class ScriptScanner extends CompositeTagScanner {
>  	private static final String MATCH_NAME [] = {"SCRIPT"};
>  	private static final String ENDERS [] = {"BODY", "HTML"};
>  	public ScriptScanner() {
>! 		this("");
>  	}
>  
>  	public ScriptScanner(String filter) {
>! 		this(filter,MATCH_NAME,ENDERS);
>  	}
>  
>! 	public ScriptScanner(String filter, String[] nameOfTagToMatch, String[] enders) {
>! 		this(filter,nameOfTagToMatch,enders, new String[0], true, true);
>  	}
>! 
>! 	public ScriptScanner(String filter, String[] nameOfTagToMatch, String[] enders, String[] endtagenders, boolean allowSelfChildren, boolean balance_quotes) {
>! 		super(filter,nameOfTagToMatch,enders, new String[0], allowSelfChildren, balance_quotes);
>! 	}
>! 
>  	public String [] getID() {
>  		return MATCH_NAME;
>***************
>*** 70,205 ****
>  		return new ScriptTag(tagData,compositeTagData);
>  	}
>- 
>- 	public Tag scan(Tag tag, String url, NodeReader reader, String currLine)
>- 		throws ParserException {
>- 		try {
>- 			int startLine = reader.getLastLineNumber();
>- 			String line = null;
>- 			StringBuffer scriptContents = 
>- 				new StringBuffer();
>- 			boolean endTagFound = false;
>- 			Tag startTag = tag;
>- 			Tag endTag = null;
>- 			line = currLine;
>- 			boolean sameLine = true;
>- 			int startingPos = startTag.elementEnd();
>- 			do {
>- 				int endTagLoc = line.toUpperCase().indexOf(getEndTag(),startingPos);
>- 				while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, endTagLoc)) {
>- 					startingPos = endTagLoc+getEndTag().length();
>- 					endTagLoc = line.toUpperCase().indexOf(getEndTag(), startingPos); 	
>- 				}
>- 				 
>- 				if (endTagLoc!=-1) {
>- 					endTagFound = true;
>- 					endTag = (EndTag)EndTag.find(line,endTagLoc);
>- 					if (sameLine) 
>- 						scriptContents.append(
>- 							getCodeBetweenStartAndEndTags(
>- 								line,
>- 								startTag,
>- 								endTagLoc)
>- 						);
>- 					else {
>- 						scriptContents.append(Node.getLineSeparator());
>- 						scriptContents.append(line.substring(0,endTagLoc));
>- 					}
>- 					
>- 					reader.setPosInLine(endTag.elementEnd());
>- 				} else {
>- 					if (sameLine) 
>- 						scriptContents.append(
>- 							line.substring(
>- 								startTag.elementEnd()+1
>- 							)
>- 						);
>- 					else {
>- 						scriptContents.append(Node.getLineSeparator());
>- 						scriptContents.append(line);
>- 					}
>- 				}
>- 				if (!endTagFound) {
>- 					line = reader.getNextLine();
>- 					startingPos = 0;
>- 				}
>- 				if (sameLine) 
>- 					sameLine = false;
>- 			}
>- 			while (line!=null && !endTagFound);
>- 			if (endTag == null) {
>- 				// If end tag doesn't exist, create one
>- 				String endTagName = tag.getTagName();
>- 				int endTagBegin = reader.getLastReadPosition()+1 ;
>- 				int endTagEnd = endTagBegin + endTagName.length() + 2; 
>- 				endTag = new EndTag(
>- 					new TagData(
>- 						endTagBegin,
>- 						endTagEnd,
>- 						endTagName,
>- 						currLine
>- 					)
>- 				);
>- 			}
>- 			NodeList childrenNodeList = new NodeList();
>- 			childrenNodeList.add(
>- 				new StringNode(
>- 					scriptContents,
>- 					startTag.elementEnd(),
>- 					endTag.elementBegin()-1
>- 				)
>- 			);
>- 			return createTag(
>- 				new TagData(
>- 					startTag.elementBegin(),
>- 					endTag.elementEnd(),
>- 					startLine,
>- 					reader.getLastLineNumber(),
>- 					startTag.getText(),
>- 					currLine,
>- 					url,
>- 					false
>- 				), new CompositeTagData(
>- 					startTag,endTag,childrenNodeList
>- 				)
>- 			);
>- 			
>- 		}
>- 		catch (Exception e) {
>- 			throw new ParserException("Error in ScriptScanner: ",e);
>- 		}
>- 	}
>- 
>- 	public String getCodeBetweenStartAndEndTags(
>- 		String line,
>- 		Tag startTag,
>- 		int endTagLoc) throws ParserException {
>- 		try {
>- 			
>- 			return line.substring(
>- 				startTag.elementEnd()+1,
>- 				endTagLoc
>- 			);
>- 		}
>- 		catch (Exception e) {
>- 			StringBuffer msg = new StringBuffer("Error in getCodeBetweenStartAndEndTags():\n");
>- 			msg.append("substring starts at: "+(startTag.elementEnd()+1)).append("\n");
>- 			msg.append("substring ends at: "+(endTagLoc));
>- 			throw new ParserException(msg.toString(),e);
>- 		}
>- 	}
>- 
>- 	/**
>- 	 * Gets the end tag that the scanner uses to stop scanning. Subclasses of
>- 	 * <code>ScriptScanner</code> you should override this method.
>- 	 * @return String containing the end tag to search for, i.e. &lt;/SCRIPT&gt;
>- 	 */ 
>- 	public String getEndTag() {
>- 		return SCRIPT_END_TAG;
>- 	}
>- 	
>- 	private boolean isScriptEmbeddedInDocumentWrite(String line, int endTagLoc) {
>- 		if (endTagLoc+getEndTag().length() > line.length()-1) return false;
>- 		return line.charAt(endTagLoc+getEndTag().length())=='"';
>- 	}
>- 
>  }
>--- 65,67 ----
>
>
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: ObjectStore.
>If flattening out C++ or Java code to make your application fit in a
>relational database is painful, don't do it! Check out ObjectStore.
>Now part of Progress Software. http://www.objectstore.net/sourceforge
>_______________________________________________
>Htmlparser-cvs mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: ObjectStore.
>If flattening out C++ or Java code to make your application fit in a
>relational database is painful, don't do it! Check out ObjectStore.
>Now part of Progress Software. http://www.objectstore.net/sourceforge
>_______________________________________________
>Htmlparser-developer mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>  
>

Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTag

Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22