Thread: RE: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagS
Brought to you by:
derrickoswald
From: Marc N. <ma...@ke...> - 2003-05-27 22:55:27
|
Sure, I'll see if I can fix it. -----Original Message----- From: Derrick Oswald [mailto:Der...@ro...] Sent: Tuesday, May 27, 2003 2:39 PM To: htm...@li... Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 Marc, The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text=20 or remarks. I guess the text scanner goes until it sees a <x... and then stops to=20 defer to a tag scanner. I hadn't thought about those in comments, or=20 about the \ end of lines. Perhaps, rather than write a new scanner, fix the StringScanner (the=20 remark scanner should be OK), so that it does the correct behaviour when = balance_quotes is true. Then the 'balance_quotes' flag could be called=20 'strict_script' or something. Derrick Marc Novakowski wrote: >Derrick, > >I was relying on some of the old behavior of ScriptScanner, mostly the = fact that its contents were not parsed as HTML. I'm still seeing cases = where tags inside of <script> are recognised as "HTML" and modified = (i.e. turned into uppercase, auto-closed, etc). For example, if there = is an HTML tag in a Javascript comment. Also, using "\" to concatenate = lines (which is valid in Javacript) is totally messed up now when I try = to get the script code using "toHtml()". > >However, I think your change was valid and fixes the bug as requested. = What I think I'm going to do, though, is make a new scanner class that = does what the old ScriptScanner did. That is, do a bare-bones "leave = everything inside that tag as-is" parse of the HTML, searching only for = the end tag with no knowledge of quotes or anything. I think there are = cases where Javascript is written such that any modification at all will = break it. > >I'll send a note to the list when this class is done (today sometime). = I'll call it StrictScriptScanner or something. > >Marc > >-----Original Message----- >From: der...@us... >[mailto:der...@us...] >Sent: Saturday, May 24, 2003 2:05 PM >To: htm...@li... >Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > >Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners >In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners > >Modified Files: > CompositeTagScanner.java ScriptScanner.java=20 >Log Message: >Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags >Major overhaul of ScriptScanner. >It now uses the scan() method of CompositeTagScanner (i.e. doesn't = override). >CompositeTagScanner now has a balance_quotes member field that dictates >whether strings tags are scanned honouring single and double quotes. >This affected the call chain through NodeReader and StringScanner which >now have this parameter. >StringScanner now correctly handles quotes if asked. The ignoreState = stuff is removed, >it didn't work anyway since a single StringScanner is used recursively = by the NodeReader, >and the member field would have been tromped. >Sorry to all those who have broken code because of this, but it's for = the better. Really. > > > >Index: CompositeTagScanner.java >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagSc= anner.java,v >retrieving revision 1.52 >retrieving revision 1.53 >diff -C2 -d -r1.52 -r1.53 >*** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52 >--- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53 >*************** >*** 97,100 **** >--- 97,101 ---- > private Set tagEnderSet; > private Set endTagEnderSet; >+ private boolean balance_quotes; > =09 > public CompositeTagScanner(String [] nameOfTagToMatch) { >*************** >*** 125,129 **** > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); > } >! =09 > public CompositeTagScanner( > String filter,=20 >--- 126,130 ---- > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); > } >!=20 > public CompositeTagScanner( > String filter,=20 >*************** >*** 131,138 **** > String [] tagEnders,=20 > String [] endTagEnders, >! boolean allowSelfChildren) { > super(filter); > this.nameOfTagToMatch =3D nameOfTagToMatch; > this.allowSelfChildren =3D allowSelfChildren; > this.tagEnderSet =3D new HashSet(); > for (int i=3D0;i<tagEnders.length;i++) >--- 132,172 ---- > String [] tagEnders,=20 > String [] endTagEnders, >! boolean allowSelfChildren) >! { >! this(filter,nameOfTagToMatch,tagEnders,endTagEnders, = allowSelfChildren, false); >! } >!=20 >! /** >! * Constructor specifying all member fields. >! * @param filter A string that is used to match which tags are to = be allowed >! * to pass through. This can be useful when one wishes to = dynamically filter >! * out all tags except one type which may be programmed later than = the parser. >! * @param nameOfTagToMatch The tag names recognized by this = scanner. >! * @param tagEnders The non-endtag tag names which signal that no = closing >! * end tag was found. For example, encountering <FORM> while >! * scanning a <A> link tag would mean that no </A> was = found >! * and needs to be corrected. >! * @param endTagEnders The endtag names which signal that no = closing end >! * tag was found. For example, encountering </HTML> while >! * scanning a <BODY> tag would mean that no </BODY> = was found >! * and needs to be corrected. These items are not prefixed by a = '/'. >! * @param allowSelfChildren If <code>true</code> a tag of the same = name is >! * allowed within this tag. Used to determine when an endtag is = missing. >! * @param balance_quotes <code>true</code> if scanning string = nodes needs to >! * honour quotes. For example, ScriptScanner defines this = <code>true</code> >! * so that text within <SCRIPT></SCRIPT> ignores = tag-like text >! * within quotes. >! */ >! public CompositeTagScanner( >! String filter,=20 >! String [] nameOfTagToMatch,=20 >! String [] tagEnders,=20 >! String [] endTagEnders, >! boolean allowSelfChildren, >! boolean balance_quotes) { > super(filter); > this.nameOfTagToMatch =3D nameOfTagToMatch; > this.allowSelfChildren =3D allowSelfChildren; >+ this.balance_quotes =3D balance_quotes; > this.tagEnderSet =3D new HashSet(); > for (int i=3D0;i<tagEnders.length;i++) >*************** >*** 145,149 **** > public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { > CompositeTagScannerHelper helper =3D=20 >! new CompositeTagScannerHelper(this,tag,url,reader,currLine); > return helper.scan(); > } >--- 179,183 ---- > public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { > CompositeTagScannerHelper helper =3D=20 >! new = CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes); > return helper.scan(); > } >*************** >*** 193,196 **** > return false; > } >-=20 > } >--- 227,229 ---- > >Index: ScriptScanner.java >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.= java,v >retrieving revision 1.21 >retrieving revision 1.22 >diff -C2 -d -r1.21 -r1.22 >*** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21 >--- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22 >*************** >*** 28,64 **** > =20 > package org.htmlparser.scanners; >! ///////////////////////// >! // HTML Parser Imports // >! ///////////////////////// >! import org.htmlparser.Node; >! import org.htmlparser.NodeReader; >! import org.htmlparser.StringNode; >! import org.htmlparser.tags.EndTag; > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >! import org.htmlparser.util.NodeList; >! import org.htmlparser.util.ParserException; > /** > * The HTMLScriptScanner identifies javascript code > */ >-=20 > public class ScriptScanner extends CompositeTagScanner { >- private static final String SCRIPT_END_TAG =3D "</SCRIPT>"; > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > private static final String ENDERS [] =3D {"BODY", "HTML"}; > public ScriptScanner() { >! super("",MATCH_NAME,ENDERS); > } > =20 > public ScriptScanner(String filter) { >! super(filter,MATCH_NAME,ENDERS); > } > =20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch) { >! super(filter,nameOfTagToMatch,ENDERS); > } >! =09 > public String [] getID() { > return MATCH_NAME; >--- 28,59 ---- > =20 > package org.htmlparser.scanners; >!=20 > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >!=20 > /** > * The HTMLScriptScanner identifies javascript code > */ > public class ScriptScanner extends CompositeTagScanner { > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > private static final String ENDERS [] =3D {"BODY", "HTML"}; > public ScriptScanner() { >! this(""); > } > =20 > public ScriptScanner(String filter) { >! this(filter,MATCH_NAME,ENDERS); > } > =20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders) { >! this(filter,nameOfTagToMatch,enders, new String[0], true, true); > } >!=20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders, String[] endtagenders, boolean allowSelfChildren, = boolean balance_quotes) { >! super(filter,nameOfTagToMatch,enders, new String[0], = allowSelfChildren, balance_quotes); >! } >!=20 > public String [] getID() { > return MATCH_NAME; >*************** >*** 70,205 **** > return new ScriptTag(tagData,compositeTagData); > } >-=20 >- public Tag scan(Tag tag, String url, NodeReader reader, String = currLine) >- throws ParserException { >- try { >- int startLine =3D reader.getLastLineNumber(); >- String line =3D null; >- StringBuffer scriptContents =3D=20 >- new StringBuffer(); >- boolean endTagFound =3D false; >- Tag startTag =3D tag; >- Tag endTag =3D null; >- line =3D currLine; >- boolean sameLine =3D true; >- int startingPos =3D startTag.elementEnd(); >- do { >- int endTagLoc =3D = line.toUpperCase().indexOf(getEndTag(),startingPos); >- while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, = endTagLoc)) { >- startingPos =3D endTagLoc+getEndTag().length(); >- endTagLoc =3D line.toUpperCase().indexOf(getEndTag(), = startingPos); =09 >- } >- =20 >- if (endTagLoc!=3D-1) { >- endTagFound =3D true; >- endTag =3D (EndTag)EndTag.find(line,endTagLoc); >- if (sameLine)=20 >- scriptContents.append( >- getCodeBetweenStartAndEndTags( >- line, >- startTag, >- endTagLoc) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line.substring(0,endTagLoc)); >- } >- =09 >- reader.setPosInLine(endTag.elementEnd()); >- } else { >- if (sameLine)=20 >- scriptContents.append( >- line.substring( >- startTag.elementEnd()+1 >- ) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line); >- } >- } >- if (!endTagFound) { >- line =3D reader.getNextLine(); >- startingPos =3D 0; >- } >- if (sameLine)=20 >- sameLine =3D false; >- } >- while (line!=3Dnull && !endTagFound); >- if (endTag =3D=3D null) { >- // If end tag doesn't exist, create one >- String endTagName =3D tag.getTagName(); >- int endTagBegin =3D reader.getLastReadPosition()+1 ; >- int endTagEnd =3D endTagBegin + endTagName.length() + 2;=20 >- endTag =3D new EndTag( >- new TagData( >- endTagBegin, >- endTagEnd, >- endTagName, >- currLine >- ) >- ); >- } >- NodeList childrenNodeList =3D new NodeList(); >- childrenNodeList.add( >- new StringNode( >- scriptContents, >- startTag.elementEnd(), >- endTag.elementBegin()-1 >- ) >- ); >- return createTag( >- new TagData( >- startTag.elementBegin(), >- endTag.elementEnd(), >- startLine, >- reader.getLastLineNumber(), >- startTag.getText(), >- currLine, >- url, >- false >- ), new CompositeTagData( >- startTag,endTag,childrenNodeList >- ) >- ); >- =09 >- } >- catch (Exception e) { >- throw new ParserException("Error in ScriptScanner: ",e); >- } >- } >-=20 >- public String getCodeBetweenStartAndEndTags( >- String line, >- Tag startTag, >- int endTagLoc) throws ParserException { >- try { >- =09 >- return line.substring( >- startTag.elementEnd()+1, >- endTagLoc >- ); >- } >- catch (Exception e) { >- StringBuffer msg =3D new StringBuffer("Error in = getCodeBetweenStartAndEndTags():\n"); >- msg.append("substring starts at: = "+(startTag.elementEnd()+1)).append("\n"); >- msg.append("substring ends at: "+(endTagLoc)); >- throw new ParserException(msg.toString(),e); >- } >- } >-=20 >- /** >- * Gets the end tag that the scanner uses to stop scanning. = Subclasses of >- * <code>ScriptScanner</code> you should override this method. >- * @return String containing the end tag to search for, i.e. = </SCRIPT> >- */=20 >- public String getEndTag() { >- return SCRIPT_END_TAG; >- } >- =09 >- private boolean isScriptEmbeddedInDocumentWrite(String line, int = endTagLoc) { >- if (endTagLoc+getEndTag().length() > line.length()-1) return false; >- return line.charAt(endTagLoc+getEndTag().length())=3D=3D'"'; >- } >-=20 > } >--- 65,67 ---- > > > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-cvs mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > =20 > ------------------------------------------------------- This SF.net email is sponsored by: ObjectStore. If flattening out C++ or Java code to make your application fit in a relational database is painful, don't do it! Check out ObjectStore. Now part of Progress Software. http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Marc N. <ma...@ke...> - 2003-05-28 00:30:59
|
I just realized that it's more complicated than that (for me, at least). = In my application that uses htmlparser, I am extending certain scanners = and tags (such as ScriptScanner but mostly CompositeTagScanner) to allow = for "custom" tags in an HTML page. When the "HTML + custom tags" are = run through my custom parser, the custom tags are converted into an = object model which is then turned into dynamic javascript code. Long story short: some of these custom tags (i.e. the ones that extend = ScriptScanner) _absolutely_ need the inner contents of the tag to remain = unchanged. Also, since it's not always Javascript that is inside of the = tags, adding extra rules to ignore tags in comments or strings won't = always work. For example, one tag allows for arbitrary XML innards. = Currently, the scanner will UPPERCASE all tags inside unless they're in = quotes (which messes up the XML). The old ScriptScanner did exactly what I needed -- that is, it didn't = scan for tags at all. It just looked for the exact (case-insensitive) = string match of the end tag. It didn't look for "<" and it didn't defer = to scanners. I took a look at the current code and I can't see any easy = way to do this. Marc -----Original Message----- From: Derrick Oswald [mailto:Der...@ro...] Sent: Tuesday, May 27, 2003 2:39 PM To: htm...@li... Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 Marc, The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text=20 or remarks. I guess the text scanner goes until it sees a <x... and then stops to=20 defer to a tag scanner. I hadn't thought about those in comments, or=20 about the \ end of lines. Perhaps, rather than write a new scanner, fix the StringScanner (the=20 remark scanner should be OK), so that it does the correct behaviour when = balance_quotes is true. Then the 'balance_quotes' flag could be called=20 'strict_script' or something. Derrick Marc Novakowski wrote: >Derrick, > >I was relying on some of the old behavior of ScriptScanner, mostly the = fact that its contents were not parsed as HTML. I'm still seeing cases = where tags inside of <script> are recognised as "HTML" and modified = (i.e. turned into uppercase, auto-closed, etc). For example, if there = is an HTML tag in a Javascript comment. Also, using "\" to concatenate = lines (which is valid in Javacript) is totally messed up now when I try = to get the script code using "toHtml()". > >However, I think your change was valid and fixes the bug as requested. = What I think I'm going to do, though, is make a new scanner class that = does what the old ScriptScanner did. That is, do a bare-bones "leave = everything inside that tag as-is" parse of the HTML, searching only for = the end tag with no knowledge of quotes or anything. I think there are = cases where Javascript is written such that any modification at all will = break it. > >I'll send a note to the list when this class is done (today sometime). = I'll call it StrictScriptScanner or something. > >Marc > >-----Original Message----- >From: der...@us... >[mailto:der...@us...] >Sent: Saturday, May 24, 2003 2:05 PM >To: htm...@li... >Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > >Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners >In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners > >Modified Files: > CompositeTagScanner.java ScriptScanner.java=20 >Log Message: >Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags >Major overhaul of ScriptScanner. >It now uses the scan() method of CompositeTagScanner (i.e. doesn't = override). >CompositeTagScanner now has a balance_quotes member field that dictates >whether strings tags are scanned honouring single and double quotes. >This affected the call chain through NodeReader and StringScanner which >now have this parameter. >StringScanner now correctly handles quotes if asked. The ignoreState = stuff is removed, >it didn't work anyway since a single StringScanner is used recursively = by the NodeReader, >and the member field would have been tromped. >Sorry to all those who have broken code because of this, but it's for = the better. Really. > > > >Index: CompositeTagScanner.java >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagSc= anner.java,v >retrieving revision 1.52 >retrieving revision 1.53 >diff -C2 -d -r1.52 -r1.53 >*** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52 >--- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53 >*************** >*** 97,100 **** >--- 97,101 ---- > private Set tagEnderSet; > private Set endTagEnderSet; >+ private boolean balance_quotes; > =09 > public CompositeTagScanner(String [] nameOfTagToMatch) { >*************** >*** 125,129 **** > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); > } >! =09 > public CompositeTagScanner( > String filter,=20 >--- 126,130 ---- > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); > } >!=20 > public CompositeTagScanner( > String filter,=20 >*************** >*** 131,138 **** > String [] tagEnders,=20 > String [] endTagEnders, >! boolean allowSelfChildren) { > super(filter); > this.nameOfTagToMatch =3D nameOfTagToMatch; > this.allowSelfChildren =3D allowSelfChildren; > this.tagEnderSet =3D new HashSet(); > for (int i=3D0;i<tagEnders.length;i++) >--- 132,172 ---- > String [] tagEnders,=20 > String [] endTagEnders, >! boolean allowSelfChildren) >! { >! this(filter,nameOfTagToMatch,tagEnders,endTagEnders, = allowSelfChildren, false); >! } >!=20 >! /** >! * Constructor specifying all member fields. >! * @param filter A string that is used to match which tags are to = be allowed >! * to pass through. This can be useful when one wishes to = dynamically filter >! * out all tags except one type which may be programmed later than = the parser. >! * @param nameOfTagToMatch The tag names recognized by this = scanner. >! * @param tagEnders The non-endtag tag names which signal that no = closing >! * end tag was found. For example, encountering <FORM> while >! * scanning a <A> link tag would mean that no </A> was = found >! * and needs to be corrected. >! * @param endTagEnders The endtag names which signal that no = closing end >! * tag was found. For example, encountering </HTML> while >! * scanning a <BODY> tag would mean that no </BODY> = was found >! * and needs to be corrected. These items are not prefixed by a = '/'. >! * @param allowSelfChildren If <code>true</code> a tag of the same = name is >! * allowed within this tag. Used to determine when an endtag is = missing. >! * @param balance_quotes <code>true</code> if scanning string = nodes needs to >! * honour quotes. For example, ScriptScanner defines this = <code>true</code> >! * so that text within <SCRIPT></SCRIPT> ignores = tag-like text >! * within quotes. >! */ >! public CompositeTagScanner( >! String filter,=20 >! String [] nameOfTagToMatch,=20 >! String [] tagEnders,=20 >! String [] endTagEnders, >! boolean allowSelfChildren, >! boolean balance_quotes) { > super(filter); > this.nameOfTagToMatch =3D nameOfTagToMatch; > this.allowSelfChildren =3D allowSelfChildren; >+ this.balance_quotes =3D balance_quotes; > this.tagEnderSet =3D new HashSet(); > for (int i=3D0;i<tagEnders.length;i++) >*************** >*** 145,149 **** > public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { > CompositeTagScannerHelper helper =3D=20 >! new CompositeTagScannerHelper(this,tag,url,reader,currLine); > return helper.scan(); > } >--- 179,183 ---- > public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { > CompositeTagScannerHelper helper =3D=20 >! new = CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes); > return helper.scan(); > } >*************** >*** 193,196 **** > return false; > } >-=20 > } >--- 227,229 ---- > >Index: ScriptScanner.java >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.= java,v >retrieving revision 1.21 >retrieving revision 1.22 >diff -C2 -d -r1.21 -r1.22 >*** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21 >--- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22 >*************** >*** 28,64 **** > =20 > package org.htmlparser.scanners; >! ///////////////////////// >! // HTML Parser Imports // >! ///////////////////////// >! import org.htmlparser.Node; >! import org.htmlparser.NodeReader; >! import org.htmlparser.StringNode; >! import org.htmlparser.tags.EndTag; > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >! import org.htmlparser.util.NodeList; >! import org.htmlparser.util.ParserException; > /** > * The HTMLScriptScanner identifies javascript code > */ >-=20 > public class ScriptScanner extends CompositeTagScanner { >- private static final String SCRIPT_END_TAG =3D "</SCRIPT>"; > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > private static final String ENDERS [] =3D {"BODY", "HTML"}; > public ScriptScanner() { >! super("",MATCH_NAME,ENDERS); > } > =20 > public ScriptScanner(String filter) { >! super(filter,MATCH_NAME,ENDERS); > } > =20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch) { >! super(filter,nameOfTagToMatch,ENDERS); > } >! =09 > public String [] getID() { > return MATCH_NAME; >--- 28,59 ---- > =20 > package org.htmlparser.scanners; >!=20 > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >!=20 > /** > * The HTMLScriptScanner identifies javascript code > */ > public class ScriptScanner extends CompositeTagScanner { > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > private static final String ENDERS [] =3D {"BODY", "HTML"}; > public ScriptScanner() { >! this(""); > } > =20 > public ScriptScanner(String filter) { >! this(filter,MATCH_NAME,ENDERS); > } > =20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders) { >! this(filter,nameOfTagToMatch,enders, new String[0], true, true); > } >!=20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders, String[] endtagenders, boolean allowSelfChildren, = boolean balance_quotes) { >! super(filter,nameOfTagToMatch,enders, new String[0], = allowSelfChildren, balance_quotes); >! } >!=20 > public String [] getID() { > return MATCH_NAME; >*************** >*** 70,205 **** > return new ScriptTag(tagData,compositeTagData); > } >-=20 >- public Tag scan(Tag tag, String url, NodeReader reader, String = currLine) >- throws ParserException { >- try { >- int startLine =3D reader.getLastLineNumber(); >- String line =3D null; >- StringBuffer scriptContents =3D=20 >- new StringBuffer(); >- boolean endTagFound =3D false; >- Tag startTag =3D tag; >- Tag endTag =3D null; >- line =3D currLine; >- boolean sameLine =3D true; >- int startingPos =3D startTag.elementEnd(); >- do { >- int endTagLoc =3D = line.toUpperCase().indexOf(getEndTag(),startingPos); >- while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, = endTagLoc)) { >- startingPos =3D endTagLoc+getEndTag().length(); >- endTagLoc =3D line.toUpperCase().indexOf(getEndTag(), = startingPos); =09 >- } >- =20 >- if (endTagLoc!=3D-1) { >- endTagFound =3D true; >- endTag =3D (EndTag)EndTag.find(line,endTagLoc); >- if (sameLine)=20 >- scriptContents.append( >- getCodeBetweenStartAndEndTags( >- line, >- startTag, >- endTagLoc) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line.substring(0,endTagLoc)); >- } >- =09 >- reader.setPosInLine(endTag.elementEnd()); >- } else { >- if (sameLine)=20 >- scriptContents.append( >- line.substring( >- startTag.elementEnd()+1 >- ) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line); >- } >- } >- if (!endTagFound) { >- line =3D reader.getNextLine(); >- startingPos =3D 0; >- } >- if (sameLine)=20 >- sameLine =3D false; >- } >- while (line!=3Dnull && !endTagFound); >- if (endTag =3D=3D null) { >- // If end tag doesn't exist, create one >- String endTagName =3D tag.getTagName(); >- int endTagBegin =3D reader.getLastReadPosition()+1 ; >- int endTagEnd =3D endTagBegin + endTagName.length() + 2;=20 >- endTag =3D new EndTag( >- new TagData( >- endTagBegin, >- endTagEnd, >- endTagName, >- currLine >- ) >- ); >- } >- NodeList childrenNodeList =3D new NodeList(); >- childrenNodeList.add( >- new StringNode( >- scriptContents, >- startTag.elementEnd(), >- endTag.elementBegin()-1 >- ) >- ); >- return createTag( >- new TagData( >- startTag.elementBegin(), >- endTag.elementEnd(), >- startLine, >- reader.getLastLineNumber(), >- startTag.getText(), >- currLine, >- url, >- false >- ), new CompositeTagData( >- startTag,endTag,childrenNodeList >- ) >- ); >- =09 >- } >- catch (Exception e) { >- throw new ParserException("Error in ScriptScanner: ",e); >- } >- } >-=20 >- public String getCodeBetweenStartAndEndTags( >- String line, >- Tag startTag, >- int endTagLoc) throws ParserException { >- try { >- =09 >- return line.substring( >- startTag.elementEnd()+1, >- endTagLoc >- ); >- } >- catch (Exception e) { >- StringBuffer msg =3D new StringBuffer("Error in = getCodeBetweenStartAndEndTags():\n"); >- msg.append("substring starts at: = "+(startTag.elementEnd()+1)).append("\n"); >- msg.append("substring ends at: "+(endTagLoc)); >- throw new ParserException(msg.toString(),e); >- } >- } >-=20 >- /** >- * Gets the end tag that the scanner uses to stop scanning. = Subclasses of >- * <code>ScriptScanner</code> you should override this method. >- * @return String containing the end tag to search for, i.e. = </SCRIPT> >- */=20 >- public String getEndTag() { >- return SCRIPT_END_TAG; >- } >- =09 >- private boolean isScriptEmbeddedInDocumentWrite(String line, int = endTagLoc) { >- if (endTagLoc+getEndTag().length() > line.length()-1) return false; >- return line.charAt(endTagLoc+getEndTag().length())=3D=3D'"'; >- } >-=20 > } >--- 65,67 ---- > > > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-cvs mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > =20 > ------------------------------------------------------- This SF.net email is sponsored by: ObjectStore. If flattening out C++ or Java code to make your application fit in a relational database is painful, don't do it! Check out ObjectStore. Now part of Progress Software. http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2003-05-28 01:34:33
|
You may need to back out the change, or at a minimum get the old code by going back a version and putting it in your ScriptScanner base class. I guess I screwed up. I saw you're drop that allowed all the lines to be accumulated in a tag and I thought the two scanners were very close then (apart from the tags in quotes thing). My only excuse is it passed all the unit tests. Well to be truthful I changed two of the tests, but it was only extraneous newline stuff at the start and end of text. The script scanner is breaking your code because of uppercasing tags (not just within in comments) and removing newlines after \, right? Marc Novakowski wrote: >I just realized that it's more complicated than that (for me, at least). In my application that uses htmlparser, I am extending certain scanners and tags (such as ScriptScanner but mostly CompositeTagScanner) to allow for "custom" tags in an HTML page. When the "HTML + custom tags" are run through my custom parser, the custom tags are converted into an object model which is then turned into dynamic javascript code. > >Long story short: some of these custom tags (i.e. the ones that extend ScriptScanner) _absolutely_ need the inner contents of the tag to remain unchanged. Also, since it's not always Javascript that is inside of the tags, adding extra rules to ignore tags in comments or strings won't always work. For example, one tag allows for arbitrary XML innards. Currently, the scanner will UPPERCASE all tags inside unless they're in quotes (which messes up the XML). > >The old ScriptScanner did exactly what I needed -- that is, it didn't scan for tags at all. It just looked for the exact (case-insensitive) string match of the end tag. It didn't look for "<" and it didn't defer to scanners. I took a look at the current code and I can't see any easy way to do this. > >Marc > >-----Original Message----- >From: Derrick Oswald [mailto:Der...@ro...] >Sent: Tuesday, May 27, 2003 2:39 PM >To: htm...@li... >Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs] >htmlparser/src/org/htmlparser/scanners >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > >Marc, > >The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text >or remarks. >I guess the text scanner goes until it sees a <x... and then stops to >defer to a tag scanner. I hadn't thought about those in comments, or >about the \ end of lines. > >Perhaps, rather than write a new scanner, fix the StringScanner (the >remark scanner should be OK), so that it does the correct behaviour when >balance_quotes is true. Then the 'balance_quotes' flag could be called >'strict_script' or something. > >Derrick > >Marc Novakowski wrote: > > > |
From: Marc N. <ma...@ke...> - 2003-05-28 14:59:52
|
RGVycmljaywgaWYgaXQncyBhbnlib2R5J3MgZmF1bHQgdGhhdCBteSBjb2RlIGlzIGZhaWxpbmcg YmVjYXVzZSBvZiB5b3VyIGNoYW5nZSwgaXQncyBtaW5lLiAgSSBzaG91bGQgaGF2ZSBjaGVja2Vk IGluIHNwZWNpZmljIHRlc3QgY2FzZXMgdGhhdCBleGNlcnNpc2UgbXkgdXNhZ2Ugb2YgdGhlIGxp YnJhcnkuICBJIGFwb2xvZ2lzZSBmb3Igbm90IGRvaW5nIHRoYXQgZWFybGllci4uLg0KIA0KSGVy ZSBhcmUgdGhlIG1haW4gdGhpbmdzIHRoYXQgdGhlIG5ldyBTY3JpcHRTY2FubmVyIGRvZXMgdGhh dCBicmVha3MgbXkgY29kZToNCjEpIGFjdHMgdmVyeSBzdHJhbmdlbHkgd2hlbiBpdCBlbmNvdW50 ZXJzICJcIiBhdCBhIG5ld2xpbmUgKGRvZXNuJ3QganVzdCBnZXQgcmlkIG9mIHRoZSBuZXdsaW5l LCBidXQgaXQgc3RhcnRzIHJlcGVhdGluZyB0aGUgZW50aXJlIGxpbmUgYWJvdXQgNiB0aW1lcykN CjIpIHVwcGVyY2FzZXMgYW5kIGF1dG8tY2xvc2VzIHRhZ3MgdGhhdCBhcmVuJ3QgaW4gcXVvdGVz DQogDQpJIGhhdmUgc29tZSBzcGVjaWZpYyB0ZXN0IGNhc2VzIHRoYXQgZGVtb25zdHJhdGUgdGhl c2UuICBJJ2xsIGNoZWNrIHRoZW0gaW4gaWYgeW91J2QgbGlrZS4gIEkgaGF2ZSB0byBhZG1pdCB0 aGF0IGFmdGVyIHBsYXlpbmcgd2l0aCB0aGUgaW50ZXJuYWxzIG9mIE5vZGVSZWFkZXIsIFRhZ1Nj YW5uZXIsIGV0Yy4gdGhhdCBJJ20gbm90IDEwMCUgY2xlYXIgb24gaG93IHNvbWUgb2YgdGhpcyBs b3cgbGV2ZWwgc2Nhbm5pbmcgY29kZSB3b3Jrcy4gIE5vciBpcyBpdCBhbHdheXMgY2xlYXIgZnJv bSByZWFkaW5nIHRoZSBjb2RlLiAgVGhhdCdzIHdoeSBJIGFtIG5vdCBjb25maWRlbnQgdGhhdCBJ IHdpbGwgYmUgYWJsZSB0byByZWZhY3RvciB0aGUgZXhpc3RpbmcgY29kZSB0byBoYW5kbGUgbXkg c3BlY2lmaWMgcHJvYmxlbXMuDQogDQpJIHJlYWxpemUgbXkgdXNhZ2Ugb2YgdGhlIHBhcnNlciBt YXkgYmUgcXVpdGUgZGlmZmVyZW50IHRoYW4gOTUlIG9mIHRoZSBwZW9wbGUgd2hvIHVzZSB0aGUg bGlicmFyeSwgc28gaWYgdGhlcmUgaXNuJ3QgYSBzb2x1dGlvbiB0aGF0IGZpdHMgaW50byB0aGUg ZXhpc3RpbmcgYXJjaGl0ZWN0dXJlIEknbGwgYmUgaGFwcHkgdG8ganVzdCBtYWtlIHNvbWUgbG9j YWwgY2hhbmdlcyB0byBmaXggdGhpbmdzLiAgSSBjYW4gYWx3YXlzIG1ha2UgbXkgb3duIHNjYW5u ZXIgYW5kIG5vdCBjaGVjayBpdCBpbnRvIHRoZSBjb2RlbGluZSAob3IganVzdCBjb3B5IHRoZSBv bGQgdmVyc2lvbiBvZiBTY3JpcHRTY2FubmVyIGludG8gbXkgY29kZSkuICBIb3dldmVyLCBpZiBJ J20gcnVubmluZyBpbnRvIHRoaXMgbm93LCBjaGFuY2VzIGFyZSBzb21lYm9keSBpbiB0aGUgZnV0 dXJlIHdpbGwsIGFsc28uDQogDQpNYXJjDQoNCgktLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLSAN CglGcm9tOiBEZXJyaWNrIE9zd2FsZCBbbWFpbHRvOkRlcnJpY2tPc3dhbGRAcm9nZXJzLmNvbV0g DQoJU2VudDogVHVlIDUvMjcvMjAwMyA2OjI2IFBNIA0KCVRvOiBodG1scGFyc2VyLWRldmVsb3Bl ckBsaXN0cy5zb3VyY2Vmb3JnZS5uZXQgDQoJQ2M6IA0KCVN1YmplY3Q6IFJlOiBbSHRtbHBhcnNl ci1kZXZlbG9wZXJdIFJFOiBbSHRtbHBhcnNlci1jdnNdIGh0bWxwYXJzZXIvc3JjL29yZy9odG1s cGFyc2VyL3NjYW5uZXJzIENvbXBvc2l0ZVRhZ1NjYW5uZXIuamF2YSwxLjUyLDEuNTMgU2NyaXB0 U2Nhbm5lci5qYXZhLDEuMjEsMS4yMg0KCQ0KCQ0KDQoJWW91IG1heSBuZWVkIHRvIGJhY2sgb3V0 IHRoZSBjaGFuZ2UsIG9yIGF0IGEgbWluaW11bSBnZXQgdGhlIG9sZCBjb2RlIGJ5DQoJZ29pbmcg YmFjayBhIHZlcnNpb24gYW5kIHB1dHRpbmcgaXQgaW4geW91ciBTY3JpcHRTY2FubmVyIGJhc2Ug Y2xhc3MuDQoJDQoJSSBndWVzcyBJIHNjcmV3ZWQgdXAuIEkgc2F3IHlvdSdyZSBkcm9wIHRoYXQg YWxsb3dlZCBhbGwgdGhlIGxpbmVzIHRvIGJlDQoJYWNjdW11bGF0ZWQgaW4gYSB0YWcgYW5kIEkg dGhvdWdodCB0aGUgdHdvIHNjYW5uZXJzIHdlcmUgdmVyeSBjbG9zZSB0aGVuDQoJKGFwYXJ0IGZy b20gdGhlIHRhZ3MgaW4gcXVvdGVzIHRoaW5nKS4gIE15IG9ubHkgZXhjdXNlIGlzIGl0IHBhc3Nl ZCBhbGwNCgl0aGUgdW5pdCB0ZXN0cy4gV2VsbCB0byBiZSB0cnV0aGZ1bCBJIGNoYW5nZWQgdHdv IG9mIHRoZSB0ZXN0cywgYnV0IGl0DQoJd2FzIG9ubHkgZXh0cmFuZW91cyBuZXdsaW5lIHN0dWZm IGF0IHRoZSBzdGFydCBhbmQgZW5kIG9mIHRleHQuDQoJDQoJVGhlIHNjcmlwdCBzY2FubmVyIGlz IGJyZWFraW5nIHlvdXIgY29kZSBiZWNhdXNlIG9mIHVwcGVyY2FzaW5nIHRhZ3MNCgkobm90IGp1 c3Qgd2l0aGluIGluIGNvbW1lbnRzKSBhbmQgcmVtb3ZpbmcgbmV3bGluZXMgYWZ0ZXIgXCwgcmln aHQ/DQoJDQoJTWFyYyBOb3Zha293c2tpIHdyb3RlOg0KCQ0KCT5JIGp1c3QgcmVhbGl6ZWQgdGhh dCBpdCdzIG1vcmUgY29tcGxpY2F0ZWQgdGhhbiB0aGF0IChmb3IgbWUsIGF0IGxlYXN0KS4gIElu IG15IGFwcGxpY2F0aW9uIHRoYXQgdXNlcyBodG1scGFyc2VyLCBJIGFtIGV4dGVuZGluZyBjZXJ0 YWluIHNjYW5uZXJzIGFuZCB0YWdzIChzdWNoIGFzIFNjcmlwdFNjYW5uZXIgYnV0IG1vc3RseSBD b21wb3NpdGVUYWdTY2FubmVyKSB0byBhbGxvdyBmb3IgImN1c3RvbSIgdGFncyBpbiBhbiBIVE1M IHBhZ2UuICBXaGVuIHRoZSAiSFRNTCArIGN1c3RvbSB0YWdzIiBhcmUgcnVuIHRocm91Z2ggbXkg Y3VzdG9tIHBhcnNlciwgdGhlIGN1c3RvbSB0YWdzIGFyZSBjb252ZXJ0ZWQgaW50byBhbiBvYmpl Y3QgbW9kZWwgd2hpY2ggaXMgdGhlbiB0dXJuZWQgaW50byBkeW5hbWljIGphdmFzY3JpcHQgY29k ZS4NCgk+DQoJPkxvbmcgc3Rvcnkgc2hvcnQ6IHNvbWUgb2YgdGhlc2UgY3VzdG9tIHRhZ3MgKGku ZS4gdGhlIG9uZXMgdGhhdCBleHRlbmQgU2NyaXB0U2Nhbm5lcikgX2Fic29sdXRlbHlfIG5lZWQg dGhlIGlubmVyIGNvbnRlbnRzIG9mIHRoZSB0YWcgdG8gcmVtYWluIHVuY2hhbmdlZC4gIEFsc28s IHNpbmNlIGl0J3Mgbm90IGFsd2F5cyBKYXZhc2NyaXB0IHRoYXQgaXMgaW5zaWRlIG9mIHRoZSB0 YWdzLCBhZGRpbmcgZXh0cmEgcnVsZXMgdG8gaWdub3JlIHRhZ3MgaW4gY29tbWVudHMgb3Igc3Ry aW5ncyB3b24ndCBhbHdheXMgd29yay4gIEZvciBleGFtcGxlLCBvbmUgdGFnIGFsbG93cyBmb3Ig YXJiaXRyYXJ5IFhNTCBpbm5hcmRzLiAgQ3VycmVudGx5LCB0aGUgc2Nhbm5lciB3aWxsIFVQUEVS Q0FTRSBhbGwgdGFncyBpbnNpZGUgdW5sZXNzIHRoZXkncmUgaW4gcXVvdGVzICh3aGljaCBtZXNz ZXMgdXAgdGhlIFhNTCkuDQoJPg0KCT5UaGUgb2xkIFNjcmlwdFNjYW5uZXIgZGlkIGV4YWN0bHkg d2hhdCBJIG5lZWRlZCAtLSB0aGF0IGlzLCBpdCBkaWRuJ3Qgc2NhbiBmb3IgdGFncyBhdCBhbGwu ICBJdCBqdXN0IGxvb2tlZCBmb3IgdGhlIGV4YWN0IChjYXNlLWluc2Vuc2l0aXZlKSBzdHJpbmcg bWF0Y2ggb2YgdGhlIGVuZCB0YWcuICBJdCBkaWRuJ3QgbG9vayBmb3IgIjwiIGFuZCBpdCBkaWRu J3QgZGVmZXIgdG8gc2Nhbm5lcnMuICBJIHRvb2sgYSBsb29rIGF0IHRoZSBjdXJyZW50IGNvZGUg YW5kIEkgY2FuJ3Qgc2VlIGFueSBlYXN5IHdheSB0byBkbyB0aGlzLg0KCT4NCgk+TWFyYw0KCT4N Cgk+LS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCgk+RnJvbTogRGVycmljayBPc3dhbGQgW21h aWx0bzpEZXJyaWNrT3N3YWxkQHJvZ2Vycy5jb21dDQoJPlNlbnQ6IFR1ZXNkYXksIE1heSAyNywg MjAwMyAyOjM5IFBNDQoJPlRvOiBodG1scGFyc2VyLWRldmVsb3BlckBsaXN0cy5zb3VyY2Vmb3Jn ZS5uZXQNCgk+U3ViamVjdDogUmU6IFtIdG1scGFyc2VyLWRldmVsb3Blcl0gUkU6IFtIdG1scGFy c2VyLWN2c10NCgk+aHRtbHBhcnNlci9zcmMvb3JnL2h0bWxwYXJzZXIvc2Nhbm5lcnMNCgk+Q29t cG9zaXRlVGFnU2Nhbm5lci5qYXZhLDEuNTIsMS41MyBTY3JpcHRTY2FubmVyLmphdmEsMS4yMSwx LjIyDQoJPg0KCT4NCgk+TWFyYywNCgk+DQoJPlRoZSB0ZXh0IHdpdGhpbiA8U0NSSVBUPjwvU0NS SVBUPiBpcyBzdXBwb3NlZCB0byBiZSBwYXJzZWQgYXMgcHVyZSB0ZXh0DQoJPm9yIHJlbWFya3Mu DQoJPkkgZ3Vlc3MgdGhlIHRleHQgc2Nhbm5lciBnb2VzIHVudGlsIGl0IHNlZXMgYSA8eC4uLiBh bmQgdGhlbiBzdG9wcyB0bw0KCT5kZWZlciB0byBhIHRhZyBzY2FubmVyLiBJIGhhZG4ndCB0aG91 Z2h0IGFib3V0IHRob3NlIGluIGNvbW1lbnRzLCBvcg0KCT5hYm91dCB0aGUgXCBlbmQgb2YgbGlu ZXMuDQoJPg0KCT5QZXJoYXBzLCByYXRoZXIgdGhhbiB3cml0ZSBhIG5ldyBzY2FubmVyLCBmaXgg dGhlIFN0cmluZ1NjYW5uZXIgKHRoZQ0KCT5yZW1hcmsgc2Nhbm5lciBzaG91bGQgYmUgT0spLCBz byB0aGF0IGl0IGRvZXMgdGhlIGNvcnJlY3QgYmVoYXZpb3VyIHdoZW4NCgk+YmFsYW5jZV9xdW90 ZXMgaXMgdHJ1ZS4gVGhlbiB0aGUgJ2JhbGFuY2VfcXVvdGVzJyBmbGFnIGNvdWxkIGJlIGNhbGxl ZA0KCT4nc3RyaWN0X3NjcmlwdCcgb3Igc29tZXRoaW5nLg0KCT4NCgk+RGVycmljaw0KCT4NCgk+ TWFyYyBOb3Zha293c2tpIHdyb3RlOg0KCT4NCgk+IA0KCT4NCgkNCgkNCgkNCgkNCgktLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tDQoJVGhpcyBT Ri5uZXQgZW1haWwgaXMgc3BvbnNvcmVkIGJ5OiBPYmplY3RTdG9yZS4NCglJZiBmbGF0dGVuaW5n IG91dCBDKysgb3IgSmF2YSBjb2RlIHRvIG1ha2UgeW91ciBhcHBsaWNhdGlvbiBmaXQgaW4gYQ0K CXJlbGF0aW9uYWwgZGF0YWJhc2UgaXMgcGFpbmZ1bCwgZG9uJ3QgZG8gaXQhIENoZWNrIG91dCBP YmplY3RTdG9yZS4NCglOb3cgcGFydCBvZiBQcm9ncmVzcyBTb2Z0d2FyZS4gaHR0cDovL3d3dy5v YmplY3RzdG9yZS5uZXQvc291cmNlZm9yZ2UNCglfX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fXw0KCUh0bWxwYXJzZXItZGV2ZWxvcGVyIG1haWxpbmcgbGlzdA0K CUh0bWxwYXJzZXItZGV2ZWxvcGVyQGxpc3RzLnNvdXJjZWZvcmdlLm5ldA0KCWh0dHBzOi8vbGlz dHMuc291cmNlZm9yZ2UubmV0L2xpc3RzL2xpc3RpbmZvL2h0bWxwYXJzZXItZGV2ZWxvcGVyDQoJ DQoNCg== |
From: Derrick O. <Der...@ro...> - 2003-05-28 22:32:24
|
Marc, I've been thinking about your problem and I think I have a solution. I'll re-write the node reader. OK, that's the bottom line, but I've said before that the lowest level should return a contiguous stream of nodes, that have the original characters (not case converted) and include the formatting like line endings and other whitespace so that toHtml() gives you the exact same page that you started with. I should make a picture, but see if you can follow me here. The lowest level is a byte stream, right off the wire. This needs to support mark and reset in case the character set changes. The second level is a character stream, after applying the decoding for a particular charset. The third level is a string, which is a char array. The chars are copied from the second level, so that can be discarded, but only after the entire stream has been drained. If we want to do threaded access to the socket to provide for parallel parsing while reading, the characters need to be kept around to create whole new strings. The fourth level is a stream of tags. Instead of keeping substrings though, the tags just keep character position, start and end, within the entire page, like a cursor, and a pointer to a new 'Page' object. That way as the Page reads more bytes from the stream, it accumulates more characters, which make a bigger string that represents the page read so far, and there's nothing preventing the older strings from being garbage collected. The upper case thing goes away since the tags point to the original characters via their offsets. The end of line thing goes away because the reader just treats a newline as any other whitespace. So what you have after a parse is a single (very large) string with a parallel stream of tag objects with a whole bunch of cursors pointing into the string. I've experimented with reading all the characters up front and that breaks 67 test cases. If you erroneously substitute "\n" for "\r\n" (or vice versa) there are only 47 failed cases left. The reset on character set change test case is one of them. If you erroneously consume newlines at the front of string nodes the number of failing tests is only 33. And if you erroneously return no string nodes if that consumption leaves nothing left in the string, there are only 15 failing cases. These would have to be examined in detail for correctness, according to HTML the spec. So it's doable. I just have to find the time. For now just include the entire original ScripScanner.scan() code in a base class for your script scanners so that the evil CompositeTagScanner.scan() is overridden. Derrick Marc wrote: >Here are the main things that the new ScriptScanner does that breaks my code: > > >Here are the main things that the new ScriptScanner does that breaks my code: >1) acts very strangely when it encounters "\" at a newline (doesn't just get rid of the newline, but it starts repeating the entire line about 6 times) >2) uppercases and auto-closes tags that aren't in quotes > > |
From: Marc N. <ma...@ke...> - 2003-05-28 22:44:45
|
RGVycmljaywNCg0KSSBsaWtlIHlvdXIgaWRlYXMsIGFuZCBJIHRoaW5rIHRoYXQgeW91ciBzdWdn ZXN0ZWQgcmVmYWN0b3Jpbmcgd291bGQgbWFrZSB0aGUgbG93ZXItbGV2ZWwgY29kZSBpbiBodG1s cGFyc2VyIG11Y2ggbGVzcyBteXN0ZXJpb3VzIGFuZCAoaG9wZWZ1bGx5KSBlYXNpZXIgdG8gbWFp bnRhaW4gYW5kIGV4dGVuZC4NCg0KTWFyYw0KDQotLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0K RnJvbTogRGVycmljayBPc3dhbGQgW21haWx0bzpEZXJyaWNrT3N3YWxkQHJvZ2Vycy5jb21dDQpT ZW50OiBXZWRuZXNkYXksIE1heSAyOCwgMjAwMyAzOjI1IFBNDQpUbzogaHRtbHBhcnNlci1kZXZl bG9wZXJAbGlzdHMuc291cmNlZm9yZ2UubmV0DQpTdWJqZWN0OiBSZTogW0h0bWxwYXJzZXItZGV2 ZWxvcGVyXSBSRTogW0h0bWxwYXJzZXItY3ZzXQ0KaHRtbHBhcnNlci9zcmMvb3JnL2h0bWxwYXJz ZXIvc2Nhbm5lcnMNCkNvbXBvc2l0ZVRhZ1NjYW5uZXIuamF2YSwxLjUyLDEuNTMgU2NyaXB0U2Nh bm5lci5qYXZhLDEuMjEsMS4yMg0KDQoNCg0KTWFyYywNCg0KSSd2ZSBiZWVuIHRoaW5raW5nIGFi b3V0IHlvdXIgcHJvYmxlbSBhbmQgSSB0aGluayBJIGhhdmUgYSBzb2x1dGlvbi4NCkknbGwgcmUt d3JpdGUgdGhlIG5vZGUgcmVhZGVyLg0KDQpPSywgdGhhdCdzIHRoZSBib3R0b20gbGluZSwgYnV0 IEkndmUgc2FpZCBiZWZvcmUgdGhhdCB0aGUgbG93ZXN0IGxldmVsIA0Kc2hvdWxkIHJldHVybiBh IGNvbnRpZ3VvdXMgc3RyZWFtIG9mIG5vZGVzLCB0aGF0IGhhdmUgdGhlIG9yaWdpbmFsIA0KY2hh cmFjdGVycyAobm90IGNhc2UgY29udmVydGVkKSBhbmQgaW5jbHVkZSB0aGUgZm9ybWF0dGluZyBs aWtlIGxpbmUgDQplbmRpbmdzIGFuZCBvdGhlciB3aGl0ZXNwYWNlIHNvIHRoYXQgdG9IdG1sKCkg Z2l2ZXMgeW91IHRoZSBleGFjdCBzYW1lIA0KcGFnZSB0aGF0IHlvdSBzdGFydGVkIHdpdGguDQoN Ckkgc2hvdWxkIG1ha2UgYSBwaWN0dXJlLCBidXQgc2VlIGlmIHlvdSBjYW4gZm9sbG93IG1lIGhl cmUuDQoNClRoZSBsb3dlc3QgbGV2ZWwgaXMgYSBieXRlIHN0cmVhbSwgcmlnaHQgb2ZmIHRoZSB3 aXJlLiBUaGlzIG5lZWRzIHRvIA0Kc3VwcG9ydCBtYXJrIGFuZCByZXNldCBpbiBjYXNlIHRoZSBj aGFyYWN0ZXIgc2V0IGNoYW5nZXMuDQoNClRoZSBzZWNvbmQgbGV2ZWwgaXMgYSBjaGFyYWN0ZXIg c3RyZWFtLCBhZnRlciBhcHBseWluZyB0aGUgZGVjb2RpbmcgZm9yIA0KYSBwYXJ0aWN1bGFyIGNo YXJzZXQuDQoNClRoZSB0aGlyZCBsZXZlbCBpcyBhIHN0cmluZywgd2hpY2ggaXMgYSBjaGFyIGFy cmF5LiBUaGUgY2hhcnMgYXJlIGNvcGllZCANCmZyb20gdGhlIHNlY29uZCBsZXZlbCwgc28gdGhh dCBjYW4gYmUgZGlzY2FyZGVkLCBidXQgb25seSBhZnRlciB0aGUgDQplbnRpcmUgc3RyZWFtIGhh cyBiZWVuIGRyYWluZWQuIElmIHdlIHdhbnQgdG8gZG8gdGhyZWFkZWQgYWNjZXNzIHRvIHRoZSAN CnNvY2tldCB0byBwcm92aWRlIGZvciBwYXJhbGxlbCBwYXJzaW5nIHdoaWxlIHJlYWRpbmcsIHRo ZSBjaGFyYWN0ZXJzIA0KbmVlZCB0byBiZSBrZXB0IGFyb3VuZCB0byBjcmVhdGUgd2hvbGUgbmV3 IHN0cmluZ3MuDQoNClRoZSBmb3VydGggbGV2ZWwgaXMgYSBzdHJlYW0gb2YgdGFncy4gSW5zdGVh ZCBvZiBrZWVwaW5nIHN1YnN0cmluZ3MgDQp0aG91Z2gsIHRoZSB0YWdzIGp1c3Qga2VlcCBjaGFy YWN0ZXIgcG9zaXRpb24sIHN0YXJ0IGFuZCBlbmQsIHdpdGhpbiB0aGUgDQplbnRpcmUgcGFnZSwg bGlrZSBhIGN1cnNvciwgYW5kIGEgcG9pbnRlciB0byBhIG5ldyAnUGFnZScgb2JqZWN0LiBUaGF0 IA0Kd2F5IGFzIHRoZSBQYWdlIHJlYWRzIG1vcmUgYnl0ZXMgZnJvbSB0aGUgc3RyZWFtLCBpdCBh Y2N1bXVsYXRlcyBtb3JlIA0KY2hhcmFjdGVycywgd2hpY2ggbWFrZSBhIGJpZ2dlciBzdHJpbmcg dGhhdCByZXByZXNlbnRzIHRoZSBwYWdlIHJlYWQgc28gDQpmYXIsIGFuZCB0aGVyZSdzIG5vdGhp bmcgcHJldmVudGluZyB0aGUgb2xkZXIgc3RyaW5ncyBmcm9tIGJlaW5nIGdhcmJhZ2UgDQpjb2xs ZWN0ZWQuDQoNClRoZSB1cHBlciBjYXNlIHRoaW5nIGdvZXMgYXdheSBzaW5jZSB0aGUgdGFncyBw b2ludCB0byB0aGUgb3JpZ2luYWwgDQpjaGFyYWN0ZXJzIHZpYSB0aGVpciBvZmZzZXRzLiBUaGUg ZW5kIG9mIGxpbmUgdGhpbmcgZ29lcyBhd2F5IGJlY2F1c2UgDQp0aGUgcmVhZGVyIGp1c3QgdHJl YXRzIGEgbmV3bGluZSBhcyBhbnkgb3RoZXIgd2hpdGVzcGFjZS4NCg0KU28gd2hhdCB5b3UgaGF2 ZSBhZnRlciBhIHBhcnNlIGlzIGEgc2luZ2xlICh2ZXJ5IGxhcmdlKSBzdHJpbmcgd2l0aCBhIA0K cGFyYWxsZWwgc3RyZWFtIG9mIHRhZyBvYmplY3RzIHdpdGggYSB3aG9sZSBidW5jaCBvZiBjdXJz b3JzIHBvaW50aW5nIA0KaW50byB0aGUgc3RyaW5nLg0KDQpJJ3ZlIGV4cGVyaW1lbnRlZCB3aXRo IHJlYWRpbmcgYWxsIHRoZSBjaGFyYWN0ZXJzIHVwIGZyb250IGFuZCB0aGF0IA0KYnJlYWtzIDY3 IHRlc3QgY2FzZXMuIElmIHlvdSBlcnJvbmVvdXNseSBzdWJzdGl0dXRlICJcbiIgZm9yICJcclxu IiAob3IgDQp2aWNlIHZlcnNhKSB0aGVyZSBhcmUgb25seSA0NyBmYWlsZWQgY2FzZXMgbGVmdC4g VGhlIHJlc2V0IG9uIGNoYXJhY3RlciANCnNldCBjaGFuZ2UgdGVzdCBjYXNlIGlzIG9uZSBvZiB0 aGVtLiAgSWYgeW91IGVycm9uZW91c2x5IGNvbnN1bWUgDQpuZXdsaW5lcyBhdCB0aGUgZnJvbnQg b2Ygc3RyaW5nIG5vZGVzIHRoZSBudW1iZXIgb2YgZmFpbGluZyB0ZXN0cyBpcyANCm9ubHkgMzMu IEFuZCBpZiB5b3UgZXJyb25lb3VzbHkgcmV0dXJuIG5vIHN0cmluZyBub2RlcyBpZiB0aGF0IA0K Y29uc3VtcHRpb24gbGVhdmVzIG5vdGhpbmcgbGVmdCBpbiB0aGUgc3RyaW5nLCB0aGVyZSBhcmUg b25seSAxNSBmYWlsaW5nIA0KY2FzZXMuIFRoZXNlIHdvdWxkIGhhdmUgdG8gYmUgZXhhbWluZWQg aW4gZGV0YWlsIGZvciBjb3JyZWN0bmVzcywgDQphY2NvcmRpbmcgdG8gSFRNTCB0aGUgc3BlYy4N Cg0KU28gaXQncyBkb2FibGUuDQpJIGp1c3QgaGF2ZSB0byBmaW5kIHRoZSB0aW1lLg0KRm9yIG5v dyBqdXN0IGluY2x1ZGUgdGhlIGVudGlyZSBvcmlnaW5hbCBTY3JpcFNjYW5uZXIuc2NhbigpIGNv ZGUgaW4gYSANCmJhc2UgY2xhc3MgZm9yIHlvdXIgc2NyaXB0IHNjYW5uZXJzIHNvIHRoYXQgdGhl IGV2aWwgDQpDb21wb3NpdGVUYWdTY2FubmVyLnNjYW4oKSBpcyBvdmVycmlkZGVuLg0KDQpEZXJy aWNrDQoNCk1hcmMgd3JvdGU6DQoNCj5IZXJlIGFyZSB0aGUgbWFpbiB0aGluZ3MgdGhhdCB0aGUg bmV3IFNjcmlwdFNjYW5uZXIgZG9lcyB0aGF0IGJyZWFrcyBteSBjb2RlOg0KPiAgDQo+DQo+SGVy ZSBhcmUgdGhlIG1haW4gdGhpbmdzIHRoYXQgdGhlIG5ldyBTY3JpcHRTY2FubmVyIGRvZXMgdGhh dCBicmVha3MgbXkgY29kZToNCj4xKSBhY3RzIHZlcnkgc3RyYW5nZWx5IHdoZW4gaXQgZW5jb3Vu dGVycyAiXCIgYXQgYSBuZXdsaW5lIChkb2Vzbid0IGp1c3QgZ2V0IHJpZCBvZiB0aGUgbmV3bGlu ZSwgYnV0IGl0IHN0YXJ0cyByZXBlYXRpbmcgdGhlIGVudGlyZSBsaW5lIGFib3V0IDYgdGltZXMp DQo+MikgdXBwZXJjYXNlcyBhbmQgYXV0by1jbG9zZXMgdGFncyB0aGF0IGFyZW4ndCBpbiBxdW90 ZXMNCj4gIA0KPg0KDQoNCg0KDQotLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tDQpUaGlzIFNGLm5ldCBlbWFpbCBpcyBzcG9uc29yZWQgYnk6IGVC YXkNCkdldCBvZmZpY2UgZXF1aXBtZW50IGZvciBsZXNzIG9uIGVCYXkhDQpodHRwOi8vYWRmYXJt Lm1lZGlhcGxleC5jb20vYWQvY2svNzExLTExNjk3LTY5MTYtNQ0KX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX18NCkh0bWxwYXJzZXItZGV2ZWxvcGVyIG1haWxp bmcgbGlzdA0KSHRtbHBhcnNlci1kZXZlbG9wZXJAbGlzdHMuc291cmNlZm9yZ2UubmV0DQpodHRw czovL2xpc3RzLnNvdXJjZWZvcmdlLm5ldC9saXN0cy9saXN0aW5mby9odG1scGFyc2VyLWRldmVs b3Blcg0K |