Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTag
Brought to you by:
derrickoswald
|
From: Derrick O. <Der...@ro...> - 2003-05-27 21:46:44
|
Marc,
The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text
or remarks.
I guess the text scanner goes until it sees a <x... and then stops to
defer to a tag scanner. I hadn't thought about those in comments, or
about the \ end of lines.
Perhaps, rather than write a new scanner, fix the StringScanner (the
remark scanner should be OK), so that it does the correct behaviour when
balance_quotes is true. Then the 'balance_quotes' flag could be called
'strict_script' or something.
Derrick
Marc Novakowski wrote:
>Derrick,
>
>I was relying on some of the old behavior of ScriptScanner, mostly the fact that its contents were not parsed as HTML. I'm still seeing cases where tags inside of <script> are recognised as "HTML" and modified (i.e. turned into uppercase, auto-closed, etc). For example, if there is an HTML tag in a Javascript comment. Also, using "\" to concatenate lines (which is valid in Javacript) is totally messed up now when I try to get the script code using "toHtml()".
>
>However, I think your change was valid and fixes the bug as requested. What I think I'm going to do, though, is make a new scanner class that does what the old ScriptScanner did. That is, do a bare-bones "leave everything inside that tag as-is" parse of the HTML, searching only for the end tag with no knowledge of quotes or anything. I think there are cases where Javascript is written such that any modification at all will break it.
>
>I'll send a note to the list when this class is done (today sometime). I'll call it StrictScriptScanner or something.
>
>Marc
>
>-----Original Message-----
>From: der...@us...
>[mailto:der...@us...]
>Sent: Saturday, May 24, 2003 2:05 PM
>To: htm...@li...
>Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners
>CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22
>
>
>Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners
>In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners
>
>Modified Files:
> CompositeTagScanner.java ScriptScanner.java
>Log Message:
>Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags
>Major overhaul of ScriptScanner.
>It now uses the scan() method of CompositeTagScanner (i.e. doesn't override).
>CompositeTagScanner now has a balance_quotes member field that dictates
>whether strings tags are scanned honouring single and double quotes.
>This affected the call chain through NodeReader and StringScanner which
>now have this parameter.
>StringScanner now correctly handles quotes if asked. The ignoreState stuff is removed,
>it didn't work anyway since a single StringScanner is used recursively by the NodeReader,
>and the member field would have been tromped.
>Sorry to all those who have broken code because of this, but it's for the better. Really.
>
>
>
>Index: CompositeTagScanner.java
>===================================================================
>RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagScanner.java,v
>retrieving revision 1.52
>retrieving revision 1.53
>diff -C2 -d -r1.52 -r1.53
>*** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52
>--- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53
>***************
>*** 97,100 ****
>--- 97,101 ----
> private Set tagEnderSet;
> private Set endTagEnderSet;
>+ private boolean balance_quotes;
>
> public CompositeTagScanner(String [] nameOfTagToMatch) {
>***************
>*** 125,129 ****
> this(filter,nameOfTagToMatch,tagEnders,new String[] {}, allowSelfChildren);
> }
>!
> public CompositeTagScanner(
> String filter,
>--- 126,130 ----
> this(filter,nameOfTagToMatch,tagEnders,new String[] {}, allowSelfChildren);
> }
>!
> public CompositeTagScanner(
> String filter,
>***************
>*** 131,138 ****
> String [] tagEnders,
> String [] endTagEnders,
>! boolean allowSelfChildren) {
> super(filter);
> this.nameOfTagToMatch = nameOfTagToMatch;
> this.allowSelfChildren = allowSelfChildren;
> this.tagEnderSet = new HashSet();
> for (int i=0;i<tagEnders.length;i++)
>--- 132,172 ----
> String [] tagEnders,
> String [] endTagEnders,
>! boolean allowSelfChildren)
>! {
>! this(filter,nameOfTagToMatch,tagEnders,endTagEnders, allowSelfChildren, false);
>! }
>!
>! /**
>! * Constructor specifying all member fields.
>! * @param filter A string that is used to match which tags are to be allowed
>! * to pass through. This can be useful when one wishes to dynamically filter
>! * out all tags except one type which may be programmed later than the parser.
>! * @param nameOfTagToMatch The tag names recognized by this scanner.
>! * @param tagEnders The non-endtag tag names which signal that no closing
>! * end tag was found. For example, encountering <FORM> while
>! * scanning a <A> link tag would mean that no </A> was found
>! * and needs to be corrected.
>! * @param endTagEnders The endtag names which signal that no closing end
>! * tag was found. For example, encountering </HTML> while
>! * scanning a <BODY> tag would mean that no </BODY> was found
>! * and needs to be corrected. These items are not prefixed by a '/'.
>! * @param allowSelfChildren If <code>true</code> a tag of the same name is
>! * allowed within this tag. Used to determine when an endtag is missing.
>! * @param balance_quotes <code>true</code> if scanning string nodes needs to
>! * honour quotes. For example, ScriptScanner defines this <code>true</code>
>! * so that text within <SCRIPT></SCRIPT> ignores tag-like text
>! * within quotes.
>! */
>! public CompositeTagScanner(
>! String filter,
>! String [] nameOfTagToMatch,
>! String [] tagEnders,
>! String [] endTagEnders,
>! boolean allowSelfChildren,
>! boolean balance_quotes) {
> super(filter);
> this.nameOfTagToMatch = nameOfTagToMatch;
> this.allowSelfChildren = allowSelfChildren;
>+ this.balance_quotes = balance_quotes;
> this.tagEnderSet = new HashSet();
> for (int i=0;i<tagEnders.length;i++)
>***************
>*** 145,149 ****
> public Tag scan(Tag tag, String url, NodeReader reader,String currLine) throws ParserException {
> CompositeTagScannerHelper helper =
>! new CompositeTagScannerHelper(this,tag,url,reader,currLine);
> return helper.scan();
> }
>--- 179,183 ----
> public Tag scan(Tag tag, String url, NodeReader reader,String currLine) throws ParserException {
> CompositeTagScannerHelper helper =
>! new CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes);
> return helper.scan();
> }
>***************
>*** 193,196 ****
> return false;
> }
>-
> }
>--- 227,229 ----
>
>Index: ScriptScanner.java
>===================================================================
>RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.java,v
>retrieving revision 1.21
>retrieving revision 1.22
>diff -C2 -d -r1.21 -r1.22
>*** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21
>--- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22
>***************
>*** 28,64 ****
>
> package org.htmlparser.scanners;
>! /////////////////////////
>! // HTML Parser Imports //
>! /////////////////////////
>! import org.htmlparser.Node;
>! import org.htmlparser.NodeReader;
>! import org.htmlparser.StringNode;
>! import org.htmlparser.tags.EndTag;
> import org.htmlparser.tags.ScriptTag;
> import org.htmlparser.tags.Tag;
> import org.htmlparser.tags.data.CompositeTagData;
> import org.htmlparser.tags.data.TagData;
>! import org.htmlparser.util.NodeList;
>! import org.htmlparser.util.ParserException;
> /**
> * The HTMLScriptScanner identifies javascript code
> */
>-
> public class ScriptScanner extends CompositeTagScanner {
>- private static final String SCRIPT_END_TAG = "</SCRIPT>";
> private static final String MATCH_NAME [] = {"SCRIPT"};
> private static final String ENDERS [] = {"BODY", "HTML"};
> public ScriptScanner() {
>! super("",MATCH_NAME,ENDERS);
> }
>
> public ScriptScanner(String filter) {
>! super(filter,MATCH_NAME,ENDERS);
> }
>
>! public ScriptScanner(String filter, String[] nameOfTagToMatch) {
>! super(filter,nameOfTagToMatch,ENDERS);
> }
>!
> public String [] getID() {
> return MATCH_NAME;
>--- 28,59 ----
>
> package org.htmlparser.scanners;
>!
> import org.htmlparser.tags.ScriptTag;
> import org.htmlparser.tags.Tag;
> import org.htmlparser.tags.data.CompositeTagData;
> import org.htmlparser.tags.data.TagData;
>!
> /**
> * The HTMLScriptScanner identifies javascript code
> */
> public class ScriptScanner extends CompositeTagScanner {
> private static final String MATCH_NAME [] = {"SCRIPT"};
> private static final String ENDERS [] = {"BODY", "HTML"};
> public ScriptScanner() {
>! this("");
> }
>
> public ScriptScanner(String filter) {
>! this(filter,MATCH_NAME,ENDERS);
> }
>
>! public ScriptScanner(String filter, String[] nameOfTagToMatch, String[] enders) {
>! this(filter,nameOfTagToMatch,enders, new String[0], true, true);
> }
>!
>! public ScriptScanner(String filter, String[] nameOfTagToMatch, String[] enders, String[] endtagenders, boolean allowSelfChildren, boolean balance_quotes) {
>! super(filter,nameOfTagToMatch,enders, new String[0], allowSelfChildren, balance_quotes);
>! }
>!
> public String [] getID() {
> return MATCH_NAME;
>***************
>*** 70,205 ****
> return new ScriptTag(tagData,compositeTagData);
> }
>-
>- public Tag scan(Tag tag, String url, NodeReader reader, String currLine)
>- throws ParserException {
>- try {
>- int startLine = reader.getLastLineNumber();
>- String line = null;
>- StringBuffer scriptContents =
>- new StringBuffer();
>- boolean endTagFound = false;
>- Tag startTag = tag;
>- Tag endTag = null;
>- line = currLine;
>- boolean sameLine = true;
>- int startingPos = startTag.elementEnd();
>- do {
>- int endTagLoc = line.toUpperCase().indexOf(getEndTag(),startingPos);
>- while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, endTagLoc)) {
>- startingPos = endTagLoc+getEndTag().length();
>- endTagLoc = line.toUpperCase().indexOf(getEndTag(), startingPos);
>- }
>-
>- if (endTagLoc!=-1) {
>- endTagFound = true;
>- endTag = (EndTag)EndTag.find(line,endTagLoc);
>- if (sameLine)
>- scriptContents.append(
>- getCodeBetweenStartAndEndTags(
>- line,
>- startTag,
>- endTagLoc)
>- );
>- else {
>- scriptContents.append(Node.getLineSeparator());
>- scriptContents.append(line.substring(0,endTagLoc));
>- }
>-
>- reader.setPosInLine(endTag.elementEnd());
>- } else {
>- if (sameLine)
>- scriptContents.append(
>- line.substring(
>- startTag.elementEnd()+1
>- )
>- );
>- else {
>- scriptContents.append(Node.getLineSeparator());
>- scriptContents.append(line);
>- }
>- }
>- if (!endTagFound) {
>- line = reader.getNextLine();
>- startingPos = 0;
>- }
>- if (sameLine)
>- sameLine = false;
>- }
>- while (line!=null && !endTagFound);
>- if (endTag == null) {
>- // If end tag doesn't exist, create one
>- String endTagName = tag.getTagName();
>- int endTagBegin = reader.getLastReadPosition()+1 ;
>- int endTagEnd = endTagBegin + endTagName.length() + 2;
>- endTag = new EndTag(
>- new TagData(
>- endTagBegin,
>- endTagEnd,
>- endTagName,
>- currLine
>- )
>- );
>- }
>- NodeList childrenNodeList = new NodeList();
>- childrenNodeList.add(
>- new StringNode(
>- scriptContents,
>- startTag.elementEnd(),
>- endTag.elementBegin()-1
>- )
>- );
>- return createTag(
>- new TagData(
>- startTag.elementBegin(),
>- endTag.elementEnd(),
>- startLine,
>- reader.getLastLineNumber(),
>- startTag.getText(),
>- currLine,
>- url,
>- false
>- ), new CompositeTagData(
>- startTag,endTag,childrenNodeList
>- )
>- );
>-
>- }
>- catch (Exception e) {
>- throw new ParserException("Error in ScriptScanner: ",e);
>- }
>- }
>-
>- public String getCodeBetweenStartAndEndTags(
>- String line,
>- Tag startTag,
>- int endTagLoc) throws ParserException {
>- try {
>-
>- return line.substring(
>- startTag.elementEnd()+1,
>- endTagLoc
>- );
>- }
>- catch (Exception e) {
>- StringBuffer msg = new StringBuffer("Error in getCodeBetweenStartAndEndTags():\n");
>- msg.append("substring starts at: "+(startTag.elementEnd()+1)).append("\n");
>- msg.append("substring ends at: "+(endTagLoc));
>- throw new ParserException(msg.toString(),e);
>- }
>- }
>-
>- /**
>- * Gets the end tag that the scanner uses to stop scanning. Subclasses of
>- * <code>ScriptScanner</code> you should override this method.
>- * @return String containing the end tag to search for, i.e. </SCRIPT>
>- */
>- public String getEndTag() {
>- return SCRIPT_END_TAG;
>- }
>-
>- private boolean isScriptEmbeddedInDocumentWrite(String line, int endTagLoc) {
>- if (endTagLoc+getEndTag().length() > line.length()-1) return false;
>- return line.charAt(endTagLoc+getEndTag().length())=='"';
>- }
>-
> }
>--- 65,67 ----
>
>
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: ObjectStore.
>If flattening out C++ or Java code to make your application fit in a
>relational database is painful, don't do it! Check out ObjectStore.
>Now part of Progress Software. http://www.objectstore.net/sourceforge
>_______________________________________________
>Htmlparser-cvs mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: ObjectStore.
>If flattening out C++ or Java code to make your application fit in a
>relational database is painful, don't do it! Check out ObjectStore.
>Now part of Progress Software. http://www.objectstore.net/sourceforge
>_______________________________________________
>Htmlparser-developer mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>
>
|