htmlparser-developer Mailing List for HTML Parser (Page 10)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: <dha...@po...> - 2003-05-28 05:23:17
|
Marc, I agree with Derrick. Lets correct the existing scanner rather than write something new since typically it gets confusing for users to know what to deal with and how the two scanenrs are different. It takes a lot of experiecne with the parser to understand the subtle difference between the two. > -----Original Message----- > From: htm...@li...=20 > [mailto:htm...@li...] On=20 > Behalf Of Der...@ro... > Sent: Wednesday, May 28, 2003 3:09 AM > To: htm...@li... > Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs]=20 > htmlparser/src/org/htmlparser/scanners=20 > CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 >=20 >=20 > Marc, >=20 > The text within <SCRIPT></SCRIPT> is supposed to be parsed as=20 > pure text=20 > or remarks. > I guess the text scanner goes until it sees a <x... and then stops to=20 > defer to a tag scanner. I hadn't thought about those in comments, or=20 > about the \ end of lines. >=20 > Perhaps, rather than write a new scanner, fix the StringScanner (the=20 > remark scanner should be OK), so that it does the correct=20 > behaviour when=20 > balance_quotes is true. Then the 'balance_quotes' flag could=20 > be called=20 > 'strict_script' or something. >=20 > Derrick >=20 > Marc Novakowski wrote: >=20 > >Derrick, > > > >I was relying on some of the old behavior of ScriptScanner,=20 > mostly the=20 > >fact that its contents were not parsed as HTML. I'm still=20 > seeing cases=20 > >where tags inside of <script> are recognised as "HTML" and modified=20 > >(i.e. turned into uppercase, auto-closed, etc). For=20 > example, if there=20 > >is an HTML tag in a Javascript comment. Also, using "\" to=20 > concatenate=20 > >lines (which is valid in Javacript) is totally messed up now=20 > when I try=20 > >to get the script code using "toHtml()". > > > >However, I think your change was valid and fixes the bug as=20 > requested. =20 > >What I think I'm going to do, though, is make a new scanner=20 > class that=20 > >does what the old ScriptScanner did. That is, do a=20 > bare-bones "leave=20 > >everything inside that tag as-is" parse of the HTML,=20 > searching only for=20 > >the end tag with no knowledge of quotes or anything. I=20 > think there are=20 > >cases where Javascript is written such that any modification at all=20 > >will break it. > > > >I'll send a note to the list when this class is done (today=20 > sometime). =20 > >I'll call it StrictScriptScanner or something. > > > >Marc > > > >-----Original Message----- > >From: der...@us... > >[mailto:der...@us...] > >Sent: Saturday, May 24, 2003 2:05 PM > >To: htm...@li... > >Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners > >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > > > > >Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners > >In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners > > > >Modified Files: > > CompositeTagScanner.java ScriptScanner.java > >Log Message: > >Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags > >Major overhaul of ScriptScanner. > >It now uses the scan() method of CompositeTagScanner (i.e.=20 > doesn't override). > >CompositeTagScanner now has a balance_quotes member field=20 > that dictates > >whether strings tags are scanned honouring single and double quotes. > >This affected the call chain through NodeReader and=20 > StringScanner which > >now have this parameter. > >StringScanner now correctly handles quotes if asked. The=20 > ignoreState stuff is removed, > >it didn't work anyway since a single StringScanner is used=20 > recursively by the NodeReader, > >and the member field would have been tromped. > >Sorry to all those who have broken code because of this, but=20 > it's for the better. Really. > > > > > > > >Index: CompositeTagScanner.java=20 > = >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >RCS file:=20 > >/cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/Co > mpositeTagScanner.java,v > >retrieving revision 1.52 > >retrieving revision 1.53 > >diff -C2 -d -r1.52 -r1.53 > >*** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52 > >--- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53 > >*************** > >*** 97,100 **** > >--- 97,101 ---- > > private Set tagEnderSet; > > private Set endTagEnderSet; > >+ private boolean balance_quotes; > > =09 > > public CompositeTagScanner(String [] nameOfTagToMatch) { > >*************** > >*** 125,129 **** > > this(filter,nameOfTagToMatch,tagEnders,new=20 > String[] {}, allowSelfChildren); > > } > >! =09 > > public CompositeTagScanner( > > String filter, > >--- 126,130 ---- > > this(filter,nameOfTagToMatch,tagEnders,new=20 > String[] {}, allowSelfChildren); > > } > >!=20 > > public CompositeTagScanner( > > String filter,=20 > >*************** > >*** 131,138 **** > > String [] tagEnders,=20 > > String [] endTagEnders, > >! boolean allowSelfChildren) { > > super(filter); > > this.nameOfTagToMatch =3D nameOfTagToMatch; > > this.allowSelfChildren =3D allowSelfChildren; > > this.tagEnderSet =3D new HashSet(); > > for (int i=3D0;i<tagEnders.length;i++) > >--- 132,172 ---- > > String [] tagEnders,=20 > > String [] endTagEnders, > >! boolean allowSelfChildren) > >! { > >! =20 > this(filter,nameOfTagToMatch,tagEnders,endTagEnders,=20 > allowSelfChildren, false); > >! } > >!=20 > >! /** > >! * Constructor specifying all member fields. > >! * @param filter A string that is used to match which=20 > tags are to be allowed > >! * to pass through. This can be useful when one wishes=20 > to dynamically filter > >! * out all tags except one type which may be programmed=20 > later than the parser. > >! * @param nameOfTagToMatch The tag names recognized by=20 > this scanner. > >! * @param tagEnders The non-endtag tag names which=20 > signal that no closing > >! * end tag was found. For example, encountering=20 > <FORM> while > >! * scanning a <A> link tag would mean that no=20 > </A> was found > >! * and needs to be corrected. > >! * @param endTagEnders The endtag names which signal=20 > that no closing end > >! * tag was found. For example, encountering </HTML> while > >! * scanning a <BODY> tag would mean that no=20 > </BODY> was found > >! * and needs to be corrected. These items are not=20 > prefixed by a '/'. > >! * @param allowSelfChildren If <code>true</code> a tag=20 > of the same name is > >! * allowed within this tag. Used to determine when an=20 > endtag is missing. > >! * @param balance_quotes <code>true</code> if scanning=20 > string nodes needs to > >! * honour quotes. For example, ScriptScanner defines=20 > this <code>true</code> > >! * so that text within <SCRIPT></SCRIPT>=20 > ignores tag-like text > >! * within quotes. > >! */ > >! public CompositeTagScanner( > >! String filter,=20 > >! String [] nameOfTagToMatch,=20 > >! String [] tagEnders,=20 > >! String [] endTagEnders, > >! boolean allowSelfChildren, > >! boolean balance_quotes) { > > super(filter); > > this.nameOfTagToMatch =3D nameOfTagToMatch; > > this.allowSelfChildren =3D allowSelfChildren; > >+ this.balance_quotes =3D balance_quotes; > > this.tagEnderSet =3D new HashSet(); > > for (int i=3D0;i<tagEnders.length;i++) > >*************** > >*** 145,149 **** > > public Tag scan(Tag tag, String url, NodeReader=20 > reader,String currLine) throws ParserException { > > CompositeTagScannerHelper helper =3D=20 > >! new=20 > CompositeTagScannerHelper(this,tag,url,reader,currLine); > > return helper.scan(); > > } > >--- 179,183 ---- > > public Tag scan(Tag tag, String url, NodeReader=20 > reader,String currLine) throws ParserException { > > CompositeTagScannerHelper helper =3D=20 > >! new=20 > CompositeTagScannerHelper(this,tag,url,reader,currLine,balance > _quotes); > > return helper.scan(); > > } > >*************** > >*** 193,196 **** > > return false; > > } > >- > > } > >--- 227,229 ---- > > > >Index: ScriptScanner.java=20 > = >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >RCS file:=20 > >/cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/Sc > riptScanner.java,v > >retrieving revision 1.21 > >retrieving revision 1.22 > >diff -C2 -d -r1.21 -r1.22 > >*** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21 > >--- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22 > >*************** > >*** 28,64 **** > > =20 > > package org.htmlparser.scanners; > >! ///////////////////////// > >! // HTML Parser Imports // > >! ///////////////////////// > >! import org.htmlparser.Node; > >! import org.htmlparser.NodeReader; > >! import org.htmlparser.StringNode; > >! import org.htmlparser.tags.EndTag; > > import org.htmlparser.tags.ScriptTag; > > import org.htmlparser.tags.Tag; > > import org.htmlparser.tags.data.CompositeTagData; > > import org.htmlparser.tags.data.TagData; > >! import org.htmlparser.util.NodeList; > >! import org.htmlparser.util.ParserException; > > /** > > * The HTMLScriptScanner identifies javascript code > > */ > >- > > public class ScriptScanner extends CompositeTagScanner { > >- private static final String SCRIPT_END_TAG =3D "</SCRIPT>"; > > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > > private static final String ENDERS [] =3D {"BODY", "HTML"}; > > public ScriptScanner() { > >! super("",MATCH_NAME,ENDERS); > > } > > =20 > > public ScriptScanner(String filter) { > >! super(filter,MATCH_NAME,ENDERS); > > } > > =20 > >! public ScriptScanner(String filter, String[] nameOfTagToMatch) { > >! super(filter,nameOfTagToMatch,ENDERS); > > } > >! =09 > > public String [] getID() { > > return MATCH_NAME; > >--- 28,59 ---- > > =20 > > package org.htmlparser.scanners; > >! > > import org.htmlparser.tags.ScriptTag; > > import org.htmlparser.tags.Tag; > > import org.htmlparser.tags.data.CompositeTagData; > > import org.htmlparser.tags.data.TagData; > >!=20 > > /** > > * The HTMLScriptScanner identifies javascript code > > */ > > public class ScriptScanner extends CompositeTagScanner { > > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > > private static final String ENDERS [] =3D {"BODY", "HTML"}; > > public ScriptScanner() { > >! this(""); > > } > > =20 > > public ScriptScanner(String filter) { > >! this(filter,MATCH_NAME,ENDERS); > > } > > =20 > >! public ScriptScanner(String filter, String[]=20 > nameOfTagToMatch, String[] enders) { > >! this(filter,nameOfTagToMatch,enders, new=20 > String[0], true, true); > > } > >!=20 > >! public ScriptScanner(String filter, String[]=20 > nameOfTagToMatch, String[] enders, String[] endtagenders,=20 > boolean allowSelfChildren, boolean balance_quotes) { > >! super(filter,nameOfTagToMatch,enders, new=20 > String[0], allowSelfChildren, balance_quotes); > >! } > >!=20 > > public String [] getID() { > > return MATCH_NAME; > >*************** > >*** 70,205 **** > > return new ScriptTag(tagData,compositeTagData); > > } > >-=20 > >- public Tag scan(Tag tag, String url, NodeReader reader,=20 > String currLine) > >- throws ParserException { > >- try { > >- int startLine =3D reader.getLastLineNumber(); > >- String line =3D null; > >- StringBuffer scriptContents =3D=20 > >- new StringBuffer(); > >- boolean endTagFound =3D false; > >- Tag startTag =3D tag; > >- Tag endTag =3D null; > >- line =3D currLine; > >- boolean sameLine =3D true; > >- int startingPos =3D startTag.elementEnd(); > >- do { > >- int endTagLoc =3D=20 > line.toUpperCase().indexOf(getEndTag(),startingPos); > >- while (endTagLoc>0 &&=20 > isScriptEmbeddedInDocumentWrite(line, endTagLoc)) { > >- startingPos =3D=20 > endTagLoc+getEndTag().length(); > >- endTagLoc =3D=20 > line.toUpperCase().indexOf(getEndTag(), startingPos); =09 > >- } > >- =20 > >- if (endTagLoc!=3D-1) { > >- endTagFound =3D true; > >- endTag =3D=20 > (EndTag)EndTag.find(line,endTagLoc); > >- if (sameLine)=20 > >- scriptContents.append( > >- =09 > getCodeBetweenStartAndEndTags( > >- line, > >- =09 > startTag, > >- =09 > endTagLoc) > >- ); > >- else { > >- =09 > scriptContents.append(Node.getLineSeparator()); > >- =09 > scriptContents.append(line.substring(0,endTagLoc)); > >- } > >- =09 > >- =09 > reader.setPosInLine(endTag.elementEnd()); > >- } else { > >- if (sameLine)=20 > >- scriptContents.append( > >- line.substring( > >- =09 > startTag.elementEnd()+1 > >- ) > >- ); > >- else { > >- =09 > scriptContents.append(Node.getLineSeparator()); > >- =09 > scriptContents.append(line); > >- } > >- } > >- if (!endTagFound) { > >- line =3D reader.getNextLine(); > >- startingPos =3D 0; > >- } > >- if (sameLine)=20 > >- sameLine =3D false; > >- } > >- while (line!=3Dnull && !endTagFound); > >- if (endTag =3D=3D null) { > >- // If end tag doesn't exist, create one > >- String endTagName =3D tag.getTagName(); > >- int endTagBegin =3D=20 > reader.getLastReadPosition()+1 ; > >- int endTagEnd =3D endTagBegin +=20 > endTagName.length() + 2;=20 > >- endTag =3D new EndTag( > >- new TagData( > >- endTagBegin, > >- endTagEnd, > >- endTagName, > >- currLine > >- ) > >- ); > >- } > >- NodeList childrenNodeList =3D new NodeList(); > >- childrenNodeList.add( > >- new StringNode( > >- scriptContents, > >- startTag.elementEnd(), > >- endTag.elementBegin()-1 > >- ) > >- ); > >- return createTag( > >- new TagData( > >- startTag.elementBegin(), > >- endTag.elementEnd(), > >- startLine, > >- reader.getLastLineNumber(), > >- startTag.getText(), > >- currLine, > >- url, > >- false > >- ), new CompositeTagData( > >- startTag,endTag,childrenNodeList > >- ) > >- ); > >- =09 > >- } > >- catch (Exception e) { > >- throw new ParserException("Error in=20 > ScriptScanner: ",e); > >- } > >- } > >-=20 > >- public String getCodeBetweenStartAndEndTags( > >- String line, > >- Tag startTag, > >- int endTagLoc) throws ParserException { > >- try { > >- =09 > >- return line.substring( > >- startTag.elementEnd()+1, > >- endTagLoc > >- ); > >- } > >- catch (Exception e) { > >- StringBuffer msg =3D new=20 > StringBuffer("Error in getCodeBetweenStartAndEndTags():\n"); > >- msg.append("substring starts at:=20 > "+(startTag.elementEnd()+1)).append("\n"); > >- msg.append("substring ends at: "+(endTagLoc)); > >- throw new ParserException(msg.toString(),e); > >- } > >- } > >-=20 > >- /** > >- * Gets the end tag that the scanner uses to stop=20 > scanning. Subclasses of > >- * <code>ScriptScanner</code> you should override this method. > >- * @return String containing the end tag to search for,=20 > i.e. </SCRIPT> > >- */=20 > >- public String getEndTag() { > >- return SCRIPT_END_TAG; > >- } > >- =09 > >- private boolean isScriptEmbeddedInDocumentWrite(String=20 > line, int endTagLoc) { > >- if (endTagLoc+getEndTag().length() >=20 > line.length()-1) return false; > >- return line.charAt(endTagLoc+getEndTag().length())=3D=3D'"'; > >- } > >- > > } > >--- 65,67 ---- > > > > > > > > > >------------------------------------------------------- > >This SF.net email is sponsored by: ObjectStore. > >If flattening out C++ or Java code to make your application fit in a=20 > >relational database is painful, don't do it! Check out=20 > ObjectStore. Now=20 > >part of Progress Software. http://www.objectstore.net/sourceforge > >_______________________________________________ > >Htmlparser-cvs mailing list Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs > > > > > >------------------------------------------------------- > >This SF.net email is sponsored by: ObjectStore. > >If flattening out C++ or Java code to make your application fit in a=20 > >relational database is painful, don't do it! Check out=20 > ObjectStore. Now=20 > >part of Progress Software. http://www.objectstore.net/sourceforge > >_______________________________________________ > >Htmlparser-developer mailing list=20 > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > =20 > > >=20 >=20 >=20 >=20 > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your application=20 > fit in a relational database is painful, don't do it! Check=20 > out ObjectStore. Now part of Progress Software.=20 http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2003-05-28 01:34:33
|
You may need to back out the change, or at a minimum get the old code by going back a version and putting it in your ScriptScanner base class. I guess I screwed up. I saw you're drop that allowed all the lines to be accumulated in a tag and I thought the two scanners were very close then (apart from the tags in quotes thing). My only excuse is it passed all the unit tests. Well to be truthful I changed two of the tests, but it was only extraneous newline stuff at the start and end of text. The script scanner is breaking your code because of uppercasing tags (not just within in comments) and removing newlines after \, right? Marc Novakowski wrote: >I just realized that it's more complicated than that (for me, at least). In my application that uses htmlparser, I am extending certain scanners and tags (such as ScriptScanner but mostly CompositeTagScanner) to allow for "custom" tags in an HTML page. When the "HTML + custom tags" are run through my custom parser, the custom tags are converted into an object model which is then turned into dynamic javascript code. > >Long story short: some of these custom tags (i.e. the ones that extend ScriptScanner) _absolutely_ need the inner contents of the tag to remain unchanged. Also, since it's not always Javascript that is inside of the tags, adding extra rules to ignore tags in comments or strings won't always work. For example, one tag allows for arbitrary XML innards. Currently, the scanner will UPPERCASE all tags inside unless they're in quotes (which messes up the XML). > >The old ScriptScanner did exactly what I needed -- that is, it didn't scan for tags at all. It just looked for the exact (case-insensitive) string match of the end tag. It didn't look for "<" and it didn't defer to scanners. I took a look at the current code and I can't see any easy way to do this. > >Marc > >-----Original Message----- >From: Derrick Oswald [mailto:Der...@ro...] >Sent: Tuesday, May 27, 2003 2:39 PM >To: htm...@li... >Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs] >htmlparser/src/org/htmlparser/scanners >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > >Marc, > >The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text >or remarks. >I guess the text scanner goes until it sees a <x... and then stops to >defer to a tag scanner. I hadn't thought about those in comments, or >about the \ end of lines. > >Perhaps, rather than write a new scanner, fix the StringScanner (the >remark scanner should be OK), so that it does the correct behaviour when >balance_quotes is true. Then the 'balance_quotes' flag could be called >'strict_script' or something. > >Derrick > >Marc Novakowski wrote: > > > |
From: Marc N. <ma...@ke...> - 2003-05-28 00:30:59
|
I just realized that it's more complicated than that (for me, at least). = In my application that uses htmlparser, I am extending certain scanners = and tags (such as ScriptScanner but mostly CompositeTagScanner) to allow = for "custom" tags in an HTML page. When the "HTML + custom tags" are = run through my custom parser, the custom tags are converted into an = object model which is then turned into dynamic javascript code. Long story short: some of these custom tags (i.e. the ones that extend = ScriptScanner) _absolutely_ need the inner contents of the tag to remain = unchanged. Also, since it's not always Javascript that is inside of the = tags, adding extra rules to ignore tags in comments or strings won't = always work. For example, one tag allows for arbitrary XML innards. = Currently, the scanner will UPPERCASE all tags inside unless they're in = quotes (which messes up the XML). The old ScriptScanner did exactly what I needed -- that is, it didn't = scan for tags at all. It just looked for the exact (case-insensitive) = string match of the end tag. It didn't look for "<" and it didn't defer = to scanners. I took a look at the current code and I can't see any easy = way to do this. Marc -----Original Message----- From: Derrick Oswald [mailto:Der...@ro...] Sent: Tuesday, May 27, 2003 2:39 PM To: htm...@li... Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 Marc, The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text=20 or remarks. I guess the text scanner goes until it sees a <x... and then stops to=20 defer to a tag scanner. I hadn't thought about those in comments, or=20 about the \ end of lines. Perhaps, rather than write a new scanner, fix the StringScanner (the=20 remark scanner should be OK), so that it does the correct behaviour when = balance_quotes is true. Then the 'balance_quotes' flag could be called=20 'strict_script' or something. Derrick Marc Novakowski wrote: >Derrick, > >I was relying on some of the old behavior of ScriptScanner, mostly the = fact that its contents were not parsed as HTML. I'm still seeing cases = where tags inside of <script> are recognised as "HTML" and modified = (i.e. turned into uppercase, auto-closed, etc). For example, if there = is an HTML tag in a Javascript comment. Also, using "\" to concatenate = lines (which is valid in Javacript) is totally messed up now when I try = to get the script code using "toHtml()". > >However, I think your change was valid and fixes the bug as requested. = What I think I'm going to do, though, is make a new scanner class that = does what the old ScriptScanner did. That is, do a bare-bones "leave = everything inside that tag as-is" parse of the HTML, searching only for = the end tag with no knowledge of quotes or anything. I think there are = cases where Javascript is written such that any modification at all will = break it. > >I'll send a note to the list when this class is done (today sometime). = I'll call it StrictScriptScanner or something. > >Marc > >-----Original Message----- >From: der...@us... >[mailto:der...@us...] >Sent: Saturday, May 24, 2003 2:05 PM >To: htm...@li... >Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > >Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners >In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners > >Modified Files: > CompositeTagScanner.java ScriptScanner.java=20 >Log Message: >Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags >Major overhaul of ScriptScanner. >It now uses the scan() method of CompositeTagScanner (i.e. doesn't = override). >CompositeTagScanner now has a balance_quotes member field that dictates >whether strings tags are scanned honouring single and double quotes. >This affected the call chain through NodeReader and StringScanner which >now have this parameter. >StringScanner now correctly handles quotes if asked. The ignoreState = stuff is removed, >it didn't work anyway since a single StringScanner is used recursively = by the NodeReader, >and the member field would have been tromped. >Sorry to all those who have broken code because of this, but it's for = the better. Really. > > > >Index: CompositeTagScanner.java >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagSc= anner.java,v >retrieving revision 1.52 >retrieving revision 1.53 >diff -C2 -d -r1.52 -r1.53 >*** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52 >--- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53 >*************** >*** 97,100 **** >--- 97,101 ---- > private Set tagEnderSet; > private Set endTagEnderSet; >+ private boolean balance_quotes; > =09 > public CompositeTagScanner(String [] nameOfTagToMatch) { >*************** >*** 125,129 **** > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); > } >! =09 > public CompositeTagScanner( > String filter,=20 >--- 126,130 ---- > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); > } >!=20 > public CompositeTagScanner( > String filter,=20 >*************** >*** 131,138 **** > String [] tagEnders,=20 > String [] endTagEnders, >! boolean allowSelfChildren) { > super(filter); > this.nameOfTagToMatch =3D nameOfTagToMatch; > this.allowSelfChildren =3D allowSelfChildren; > this.tagEnderSet =3D new HashSet(); > for (int i=3D0;i<tagEnders.length;i++) >--- 132,172 ---- > String [] tagEnders,=20 > String [] endTagEnders, >! boolean allowSelfChildren) >! { >! this(filter,nameOfTagToMatch,tagEnders,endTagEnders, = allowSelfChildren, false); >! } >!=20 >! /** >! * Constructor specifying all member fields. >! * @param filter A string that is used to match which tags are to = be allowed >! * to pass through. This can be useful when one wishes to = dynamically filter >! * out all tags except one type which may be programmed later than = the parser. >! * @param nameOfTagToMatch The tag names recognized by this = scanner. >! * @param tagEnders The non-endtag tag names which signal that no = closing >! * end tag was found. For example, encountering <FORM> while >! * scanning a <A> link tag would mean that no </A> was = found >! * and needs to be corrected. >! * @param endTagEnders The endtag names which signal that no = closing end >! * tag was found. For example, encountering </HTML> while >! * scanning a <BODY> tag would mean that no </BODY> = was found >! * and needs to be corrected. These items are not prefixed by a = '/'. >! * @param allowSelfChildren If <code>true</code> a tag of the same = name is >! * allowed within this tag. Used to determine when an endtag is = missing. >! * @param balance_quotes <code>true</code> if scanning string = nodes needs to >! * honour quotes. For example, ScriptScanner defines this = <code>true</code> >! * so that text within <SCRIPT></SCRIPT> ignores = tag-like text >! * within quotes. >! */ >! public CompositeTagScanner( >! String filter,=20 >! String [] nameOfTagToMatch,=20 >! String [] tagEnders,=20 >! String [] endTagEnders, >! boolean allowSelfChildren, >! boolean balance_quotes) { > super(filter); > this.nameOfTagToMatch =3D nameOfTagToMatch; > this.allowSelfChildren =3D allowSelfChildren; >+ this.balance_quotes =3D balance_quotes; > this.tagEnderSet =3D new HashSet(); > for (int i=3D0;i<tagEnders.length;i++) >*************** >*** 145,149 **** > public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { > CompositeTagScannerHelper helper =3D=20 >! new CompositeTagScannerHelper(this,tag,url,reader,currLine); > return helper.scan(); > } >--- 179,183 ---- > public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { > CompositeTagScannerHelper helper =3D=20 >! new = CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes); > return helper.scan(); > } >*************** >*** 193,196 **** > return false; > } >-=20 > } >--- 227,229 ---- > >Index: ScriptScanner.java >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.= java,v >retrieving revision 1.21 >retrieving revision 1.22 >diff -C2 -d -r1.21 -r1.22 >*** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21 >--- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22 >*************** >*** 28,64 **** > =20 > package org.htmlparser.scanners; >! ///////////////////////// >! // HTML Parser Imports // >! ///////////////////////// >! import org.htmlparser.Node; >! import org.htmlparser.NodeReader; >! import org.htmlparser.StringNode; >! import org.htmlparser.tags.EndTag; > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >! import org.htmlparser.util.NodeList; >! import org.htmlparser.util.ParserException; > /** > * The HTMLScriptScanner identifies javascript code > */ >-=20 > public class ScriptScanner extends CompositeTagScanner { >- private static final String SCRIPT_END_TAG =3D "</SCRIPT>"; > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > private static final String ENDERS [] =3D {"BODY", "HTML"}; > public ScriptScanner() { >! super("",MATCH_NAME,ENDERS); > } > =20 > public ScriptScanner(String filter) { >! super(filter,MATCH_NAME,ENDERS); > } > =20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch) { >! super(filter,nameOfTagToMatch,ENDERS); > } >! =09 > public String [] getID() { > return MATCH_NAME; >--- 28,59 ---- > =20 > package org.htmlparser.scanners; >!=20 > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >!=20 > /** > * The HTMLScriptScanner identifies javascript code > */ > public class ScriptScanner extends CompositeTagScanner { > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > private static final String ENDERS [] =3D {"BODY", "HTML"}; > public ScriptScanner() { >! this(""); > } > =20 > public ScriptScanner(String filter) { >! this(filter,MATCH_NAME,ENDERS); > } > =20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders) { >! this(filter,nameOfTagToMatch,enders, new String[0], true, true); > } >!=20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders, String[] endtagenders, boolean allowSelfChildren, = boolean balance_quotes) { >! super(filter,nameOfTagToMatch,enders, new String[0], = allowSelfChildren, balance_quotes); >! } >!=20 > public String [] getID() { > return MATCH_NAME; >*************** >*** 70,205 **** > return new ScriptTag(tagData,compositeTagData); > } >-=20 >- public Tag scan(Tag tag, String url, NodeReader reader, String = currLine) >- throws ParserException { >- try { >- int startLine =3D reader.getLastLineNumber(); >- String line =3D null; >- StringBuffer scriptContents =3D=20 >- new StringBuffer(); >- boolean endTagFound =3D false; >- Tag startTag =3D tag; >- Tag endTag =3D null; >- line =3D currLine; >- boolean sameLine =3D true; >- int startingPos =3D startTag.elementEnd(); >- do { >- int endTagLoc =3D = line.toUpperCase().indexOf(getEndTag(),startingPos); >- while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, = endTagLoc)) { >- startingPos =3D endTagLoc+getEndTag().length(); >- endTagLoc =3D line.toUpperCase().indexOf(getEndTag(), = startingPos); =09 >- } >- =20 >- if (endTagLoc!=3D-1) { >- endTagFound =3D true; >- endTag =3D (EndTag)EndTag.find(line,endTagLoc); >- if (sameLine)=20 >- scriptContents.append( >- getCodeBetweenStartAndEndTags( >- line, >- startTag, >- endTagLoc) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line.substring(0,endTagLoc)); >- } >- =09 >- reader.setPosInLine(endTag.elementEnd()); >- } else { >- if (sameLine)=20 >- scriptContents.append( >- line.substring( >- startTag.elementEnd()+1 >- ) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line); >- } >- } >- if (!endTagFound) { >- line =3D reader.getNextLine(); >- startingPos =3D 0; >- } >- if (sameLine)=20 >- sameLine =3D false; >- } >- while (line!=3Dnull && !endTagFound); >- if (endTag =3D=3D null) { >- // If end tag doesn't exist, create one >- String endTagName =3D tag.getTagName(); >- int endTagBegin =3D reader.getLastReadPosition()+1 ; >- int endTagEnd =3D endTagBegin + endTagName.length() + 2;=20 >- endTag =3D new EndTag( >- new TagData( >- endTagBegin, >- endTagEnd, >- endTagName, >- currLine >- ) >- ); >- } >- NodeList childrenNodeList =3D new NodeList(); >- childrenNodeList.add( >- new StringNode( >- scriptContents, >- startTag.elementEnd(), >- endTag.elementBegin()-1 >- ) >- ); >- return createTag( >- new TagData( >- startTag.elementBegin(), >- endTag.elementEnd(), >- startLine, >- reader.getLastLineNumber(), >- startTag.getText(), >- currLine, >- url, >- false >- ), new CompositeTagData( >- startTag,endTag,childrenNodeList >- ) >- ); >- =09 >- } >- catch (Exception e) { >- throw new ParserException("Error in ScriptScanner: ",e); >- } >- } >-=20 >- public String getCodeBetweenStartAndEndTags( >- String line, >- Tag startTag, >- int endTagLoc) throws ParserException { >- try { >- =09 >- return line.substring( >- startTag.elementEnd()+1, >- endTagLoc >- ); >- } >- catch (Exception e) { >- StringBuffer msg =3D new StringBuffer("Error in = getCodeBetweenStartAndEndTags():\n"); >- msg.append("substring starts at: = "+(startTag.elementEnd()+1)).append("\n"); >- msg.append("substring ends at: "+(endTagLoc)); >- throw new ParserException(msg.toString(),e); >- } >- } >-=20 >- /** >- * Gets the end tag that the scanner uses to stop scanning. = Subclasses of >- * <code>ScriptScanner</code> you should override this method. >- * @return String containing the end tag to search for, i.e. = </SCRIPT> >- */=20 >- public String getEndTag() { >- return SCRIPT_END_TAG; >- } >- =09 >- private boolean isScriptEmbeddedInDocumentWrite(String line, int = endTagLoc) { >- if (endTagLoc+getEndTag().length() > line.length()-1) return false; >- return line.charAt(endTagLoc+getEndTag().length())=3D=3D'"'; >- } >-=20 > } >--- 65,67 ---- > > > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-cvs mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > =20 > ------------------------------------------------------- This SF.net email is sponsored by: ObjectStore. If flattening out C++ or Java code to make your application fit in a relational database is painful, don't do it! Check out ObjectStore. Now part of Progress Software. http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Marc N. <ma...@ke...> - 2003-05-27 22:55:27
|
Sure, I'll see if I can fix it. -----Original Message----- From: Derrick Oswald [mailto:Der...@ro...] Sent: Tuesday, May 27, 2003 2:39 PM To: htm...@li... Subject: Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 Marc, The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text=20 or remarks. I guess the text scanner goes until it sees a <x... and then stops to=20 defer to a tag scanner. I hadn't thought about those in comments, or=20 about the \ end of lines. Perhaps, rather than write a new scanner, fix the StringScanner (the=20 remark scanner should be OK), so that it does the correct behaviour when = balance_quotes is true. Then the 'balance_quotes' flag could be called=20 'strict_script' or something. Derrick Marc Novakowski wrote: >Derrick, > >I was relying on some of the old behavior of ScriptScanner, mostly the = fact that its contents were not parsed as HTML. I'm still seeing cases = where tags inside of <script> are recognised as "HTML" and modified = (i.e. turned into uppercase, auto-closed, etc). For example, if there = is an HTML tag in a Javascript comment. Also, using "\" to concatenate = lines (which is valid in Javacript) is totally messed up now when I try = to get the script code using "toHtml()". > >However, I think your change was valid and fixes the bug as requested. = What I think I'm going to do, though, is make a new scanner class that = does what the old ScriptScanner did. That is, do a bare-bones "leave = everything inside that tag as-is" parse of the HTML, searching only for = the end tag with no knowledge of quotes or anything. I think there are = cases where Javascript is written such that any modification at all will = break it. > >I'll send a note to the list when this class is done (today sometime). = I'll call it StrictScriptScanner or something. > >Marc > >-----Original Message----- >From: der...@us... >[mailto:der...@us...] >Sent: Saturday, May 24, 2003 2:05 PM >To: htm...@li... >Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > >Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners >In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners > >Modified Files: > CompositeTagScanner.java ScriptScanner.java=20 >Log Message: >Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags >Major overhaul of ScriptScanner. >It now uses the scan() method of CompositeTagScanner (i.e. doesn't = override). >CompositeTagScanner now has a balance_quotes member field that dictates >whether strings tags are scanned honouring single and double quotes. >This affected the call chain through NodeReader and StringScanner which >now have this parameter. >StringScanner now correctly handles quotes if asked. The ignoreState = stuff is removed, >it didn't work anyway since a single StringScanner is used recursively = by the NodeReader, >and the member field would have been tromped. >Sorry to all those who have broken code because of this, but it's for = the better. Really. > > > >Index: CompositeTagScanner.java >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagSc= anner.java,v >retrieving revision 1.52 >retrieving revision 1.53 >diff -C2 -d -r1.52 -r1.53 >*** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52 >--- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53 >*************** >*** 97,100 **** >--- 97,101 ---- > private Set tagEnderSet; > private Set endTagEnderSet; >+ private boolean balance_quotes; > =09 > public CompositeTagScanner(String [] nameOfTagToMatch) { >*************** >*** 125,129 **** > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); > } >! =09 > public CompositeTagScanner( > String filter,=20 >--- 126,130 ---- > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); > } >!=20 > public CompositeTagScanner( > String filter,=20 >*************** >*** 131,138 **** > String [] tagEnders,=20 > String [] endTagEnders, >! boolean allowSelfChildren) { > super(filter); > this.nameOfTagToMatch =3D nameOfTagToMatch; > this.allowSelfChildren =3D allowSelfChildren; > this.tagEnderSet =3D new HashSet(); > for (int i=3D0;i<tagEnders.length;i++) >--- 132,172 ---- > String [] tagEnders,=20 > String [] endTagEnders, >! boolean allowSelfChildren) >! { >! this(filter,nameOfTagToMatch,tagEnders,endTagEnders, = allowSelfChildren, false); >! } >!=20 >! /** >! * Constructor specifying all member fields. >! * @param filter A string that is used to match which tags are to = be allowed >! * to pass through. This can be useful when one wishes to = dynamically filter >! * out all tags except one type which may be programmed later than = the parser. >! * @param nameOfTagToMatch The tag names recognized by this = scanner. >! * @param tagEnders The non-endtag tag names which signal that no = closing >! * end tag was found. For example, encountering <FORM> while >! * scanning a <A> link tag would mean that no </A> was = found >! * and needs to be corrected. >! * @param endTagEnders The endtag names which signal that no = closing end >! * tag was found. For example, encountering </HTML> while >! * scanning a <BODY> tag would mean that no </BODY> = was found >! * and needs to be corrected. These items are not prefixed by a = '/'. >! * @param allowSelfChildren If <code>true</code> a tag of the same = name is >! * allowed within this tag. Used to determine when an endtag is = missing. >! * @param balance_quotes <code>true</code> if scanning string = nodes needs to >! * honour quotes. For example, ScriptScanner defines this = <code>true</code> >! * so that text within <SCRIPT></SCRIPT> ignores = tag-like text >! * within quotes. >! */ >! public CompositeTagScanner( >! String filter,=20 >! String [] nameOfTagToMatch,=20 >! String [] tagEnders,=20 >! String [] endTagEnders, >! boolean allowSelfChildren, >! boolean balance_quotes) { > super(filter); > this.nameOfTagToMatch =3D nameOfTagToMatch; > this.allowSelfChildren =3D allowSelfChildren; >+ this.balance_quotes =3D balance_quotes; > this.tagEnderSet =3D new HashSet(); > for (int i=3D0;i<tagEnders.length;i++) >*************** >*** 145,149 **** > public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { > CompositeTagScannerHelper helper =3D=20 >! new CompositeTagScannerHelper(this,tag,url,reader,currLine); > return helper.scan(); > } >--- 179,183 ---- > public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { > CompositeTagScannerHelper helper =3D=20 >! new = CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes); > return helper.scan(); > } >*************** >*** 193,196 **** > return false; > } >-=20 > } >--- 227,229 ---- > >Index: ScriptScanner.java >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.= java,v >retrieving revision 1.21 >retrieving revision 1.22 >diff -C2 -d -r1.21 -r1.22 >*** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21 >--- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22 >*************** >*** 28,64 **** > =20 > package org.htmlparser.scanners; >! ///////////////////////// >! // HTML Parser Imports // >! ///////////////////////// >! import org.htmlparser.Node; >! import org.htmlparser.NodeReader; >! import org.htmlparser.StringNode; >! import org.htmlparser.tags.EndTag; > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >! import org.htmlparser.util.NodeList; >! import org.htmlparser.util.ParserException; > /** > * The HTMLScriptScanner identifies javascript code > */ >-=20 > public class ScriptScanner extends CompositeTagScanner { >- private static final String SCRIPT_END_TAG =3D "</SCRIPT>"; > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > private static final String ENDERS [] =3D {"BODY", "HTML"}; > public ScriptScanner() { >! super("",MATCH_NAME,ENDERS); > } > =20 > public ScriptScanner(String filter) { >! super(filter,MATCH_NAME,ENDERS); > } > =20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch) { >! super(filter,nameOfTagToMatch,ENDERS); > } >! =09 > public String [] getID() { > return MATCH_NAME; >--- 28,59 ---- > =20 > package org.htmlparser.scanners; >!=20 > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >!=20 > /** > * The HTMLScriptScanner identifies javascript code > */ > public class ScriptScanner extends CompositeTagScanner { > private static final String MATCH_NAME [] =3D {"SCRIPT"}; > private static final String ENDERS [] =3D {"BODY", "HTML"}; > public ScriptScanner() { >! this(""); > } > =20 > public ScriptScanner(String filter) { >! this(filter,MATCH_NAME,ENDERS); > } > =20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders) { >! this(filter,nameOfTagToMatch,enders, new String[0], true, true); > } >!=20 >! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders, String[] endtagenders, boolean allowSelfChildren, = boolean balance_quotes) { >! super(filter,nameOfTagToMatch,enders, new String[0], = allowSelfChildren, balance_quotes); >! } >!=20 > public String [] getID() { > return MATCH_NAME; >*************** >*** 70,205 **** > return new ScriptTag(tagData,compositeTagData); > } >-=20 >- public Tag scan(Tag tag, String url, NodeReader reader, String = currLine) >- throws ParserException { >- try { >- int startLine =3D reader.getLastLineNumber(); >- String line =3D null; >- StringBuffer scriptContents =3D=20 >- new StringBuffer(); >- boolean endTagFound =3D false; >- Tag startTag =3D tag; >- Tag endTag =3D null; >- line =3D currLine; >- boolean sameLine =3D true; >- int startingPos =3D startTag.elementEnd(); >- do { >- int endTagLoc =3D = line.toUpperCase().indexOf(getEndTag(),startingPos); >- while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, = endTagLoc)) { >- startingPos =3D endTagLoc+getEndTag().length(); >- endTagLoc =3D line.toUpperCase().indexOf(getEndTag(), = startingPos); =09 >- } >- =20 >- if (endTagLoc!=3D-1) { >- endTagFound =3D true; >- endTag =3D (EndTag)EndTag.find(line,endTagLoc); >- if (sameLine)=20 >- scriptContents.append( >- getCodeBetweenStartAndEndTags( >- line, >- startTag, >- endTagLoc) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line.substring(0,endTagLoc)); >- } >- =09 >- reader.setPosInLine(endTag.elementEnd()); >- } else { >- if (sameLine)=20 >- scriptContents.append( >- line.substring( >- startTag.elementEnd()+1 >- ) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line); >- } >- } >- if (!endTagFound) { >- line =3D reader.getNextLine(); >- startingPos =3D 0; >- } >- if (sameLine)=20 >- sameLine =3D false; >- } >- while (line!=3Dnull && !endTagFound); >- if (endTag =3D=3D null) { >- // If end tag doesn't exist, create one >- String endTagName =3D tag.getTagName(); >- int endTagBegin =3D reader.getLastReadPosition()+1 ; >- int endTagEnd =3D endTagBegin + endTagName.length() + 2;=20 >- endTag =3D new EndTag( >- new TagData( >- endTagBegin, >- endTagEnd, >- endTagName, >- currLine >- ) >- ); >- } >- NodeList childrenNodeList =3D new NodeList(); >- childrenNodeList.add( >- new StringNode( >- scriptContents, >- startTag.elementEnd(), >- endTag.elementBegin()-1 >- ) >- ); >- return createTag( >- new TagData( >- startTag.elementBegin(), >- endTag.elementEnd(), >- startLine, >- reader.getLastLineNumber(), >- startTag.getText(), >- currLine, >- url, >- false >- ), new CompositeTagData( >- startTag,endTag,childrenNodeList >- ) >- ); >- =09 >- } >- catch (Exception e) { >- throw new ParserException("Error in ScriptScanner: ",e); >- } >- } >-=20 >- public String getCodeBetweenStartAndEndTags( >- String line, >- Tag startTag, >- int endTagLoc) throws ParserException { >- try { >- =09 >- return line.substring( >- startTag.elementEnd()+1, >- endTagLoc >- ); >- } >- catch (Exception e) { >- StringBuffer msg =3D new StringBuffer("Error in = getCodeBetweenStartAndEndTags():\n"); >- msg.append("substring starts at: = "+(startTag.elementEnd()+1)).append("\n"); >- msg.append("substring ends at: "+(endTagLoc)); >- throw new ParserException(msg.toString(),e); >- } >- } >-=20 >- /** >- * Gets the end tag that the scanner uses to stop scanning. = Subclasses of >- * <code>ScriptScanner</code> you should override this method. >- * @return String containing the end tag to search for, i.e. = </SCRIPT> >- */=20 >- public String getEndTag() { >- return SCRIPT_END_TAG; >- } >- =09 >- private boolean isScriptEmbeddedInDocumentWrite(String line, int = endTagLoc) { >- if (endTagLoc+getEndTag().length() > line.length()-1) return false; >- return line.charAt(endTagLoc+getEndTag().length())=3D=3D'"'; >- } >-=20 > } >--- 65,67 ---- > > > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-cvs mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > =20 > ------------------------------------------------------- This SF.net email is sponsored by: ObjectStore. If flattening out C++ or Java code to make your application fit in a relational database is painful, don't do it! Check out ObjectStore. Now part of Progress Software. http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2003-05-27 21:46:44
|
Marc, The text within <SCRIPT></SCRIPT> is supposed to be parsed as pure text or remarks. I guess the text scanner goes until it sees a <x... and then stops to defer to a tag scanner. I hadn't thought about those in comments, or about the \ end of lines. Perhaps, rather than write a new scanner, fix the StringScanner (the remark scanner should be OK), so that it does the correct behaviour when balance_quotes is true. Then the 'balance_quotes' flag could be called 'strict_script' or something. Derrick Marc Novakowski wrote: >Derrick, > >I was relying on some of the old behavior of ScriptScanner, mostly the fact that its contents were not parsed as HTML. I'm still seeing cases where tags inside of <script> are recognised as "HTML" and modified (i.e. turned into uppercase, auto-closed, etc). For example, if there is an HTML tag in a Javascript comment. Also, using "\" to concatenate lines (which is valid in Javacript) is totally messed up now when I try to get the script code using "toHtml()". > >However, I think your change was valid and fixes the bug as requested. What I think I'm going to do, though, is make a new scanner class that does what the old ScriptScanner did. That is, do a bare-bones "leave everything inside that tag as-is" parse of the HTML, searching only for the end tag with no knowledge of quotes or anything. I think there are cases where Javascript is written such that any modification at all will break it. > >I'll send a note to the list when this class is done (today sometime). I'll call it StrictScriptScanner or something. > >Marc > >-----Original Message----- >From: der...@us... >[mailto:der...@us...] >Sent: Saturday, May 24, 2003 2:05 PM >To: htm...@li... >Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners >CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 > > >Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners >In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners > >Modified Files: > CompositeTagScanner.java ScriptScanner.java >Log Message: >Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags >Major overhaul of ScriptScanner. >It now uses the scan() method of CompositeTagScanner (i.e. doesn't override). >CompositeTagScanner now has a balance_quotes member field that dictates >whether strings tags are scanned honouring single and double quotes. >This affected the call chain through NodeReader and StringScanner which >now have this parameter. >StringScanner now correctly handles quotes if asked. The ignoreState stuff is removed, >it didn't work anyway since a single StringScanner is used recursively by the NodeReader, >and the member field would have been tromped. >Sorry to all those who have broken code because of this, but it's for the better. Really. > > > >Index: CompositeTagScanner.java >=================================================================== >RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagScanner.java,v >retrieving revision 1.52 >retrieving revision 1.53 >diff -C2 -d -r1.52 -r1.53 >*** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52 >--- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53 >*************** >*** 97,100 **** >--- 97,101 ---- > private Set tagEnderSet; > private Set endTagEnderSet; >+ private boolean balance_quotes; > > public CompositeTagScanner(String [] nameOfTagToMatch) { >*************** >*** 125,129 **** > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, allowSelfChildren); > } >! > public CompositeTagScanner( > String filter, >--- 126,130 ---- > this(filter,nameOfTagToMatch,tagEnders,new String[] {}, allowSelfChildren); > } >! > public CompositeTagScanner( > String filter, >*************** >*** 131,138 **** > String [] tagEnders, > String [] endTagEnders, >! boolean allowSelfChildren) { > super(filter); > this.nameOfTagToMatch = nameOfTagToMatch; > this.allowSelfChildren = allowSelfChildren; > this.tagEnderSet = new HashSet(); > for (int i=0;i<tagEnders.length;i++) >--- 132,172 ---- > String [] tagEnders, > String [] endTagEnders, >! boolean allowSelfChildren) >! { >! this(filter,nameOfTagToMatch,tagEnders,endTagEnders, allowSelfChildren, false); >! } >! >! /** >! * Constructor specifying all member fields. >! * @param filter A string that is used to match which tags are to be allowed >! * to pass through. This can be useful when one wishes to dynamically filter >! * out all tags except one type which may be programmed later than the parser. >! * @param nameOfTagToMatch The tag names recognized by this scanner. >! * @param tagEnders The non-endtag tag names which signal that no closing >! * end tag was found. For example, encountering <FORM> while >! * scanning a <A> link tag would mean that no </A> was found >! * and needs to be corrected. >! * @param endTagEnders The endtag names which signal that no closing end >! * tag was found. For example, encountering </HTML> while >! * scanning a <BODY> tag would mean that no </BODY> was found >! * and needs to be corrected. These items are not prefixed by a '/'. >! * @param allowSelfChildren If <code>true</code> a tag of the same name is >! * allowed within this tag. Used to determine when an endtag is missing. >! * @param balance_quotes <code>true</code> if scanning string nodes needs to >! * honour quotes. For example, ScriptScanner defines this <code>true</code> >! * so that text within <SCRIPT></SCRIPT> ignores tag-like text >! * within quotes. >! */ >! public CompositeTagScanner( >! String filter, >! String [] nameOfTagToMatch, >! String [] tagEnders, >! String [] endTagEnders, >! boolean allowSelfChildren, >! boolean balance_quotes) { > super(filter); > this.nameOfTagToMatch = nameOfTagToMatch; > this.allowSelfChildren = allowSelfChildren; >+ this.balance_quotes = balance_quotes; > this.tagEnderSet = new HashSet(); > for (int i=0;i<tagEnders.length;i++) >*************** >*** 145,149 **** > public Tag scan(Tag tag, String url, NodeReader reader,String currLine) throws ParserException { > CompositeTagScannerHelper helper = >! new CompositeTagScannerHelper(this,tag,url,reader,currLine); > return helper.scan(); > } >--- 179,183 ---- > public Tag scan(Tag tag, String url, NodeReader reader,String currLine) throws ParserException { > CompositeTagScannerHelper helper = >! new CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes); > return helper.scan(); > } >*************** >*** 193,196 **** > return false; > } >- > } >--- 227,229 ---- > >Index: ScriptScanner.java >=================================================================== >RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.java,v >retrieving revision 1.21 >retrieving revision 1.22 >diff -C2 -d -r1.21 -r1.22 >*** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21 >--- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22 >*************** >*** 28,64 **** > > package org.htmlparser.scanners; >! ///////////////////////// >! // HTML Parser Imports // >! ///////////////////////// >! import org.htmlparser.Node; >! import org.htmlparser.NodeReader; >! import org.htmlparser.StringNode; >! import org.htmlparser.tags.EndTag; > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >! import org.htmlparser.util.NodeList; >! import org.htmlparser.util.ParserException; > /** > * The HTMLScriptScanner identifies javascript code > */ >- > public class ScriptScanner extends CompositeTagScanner { >- private static final String SCRIPT_END_TAG = "</SCRIPT>"; > private static final String MATCH_NAME [] = {"SCRIPT"}; > private static final String ENDERS [] = {"BODY", "HTML"}; > public ScriptScanner() { >! super("",MATCH_NAME,ENDERS); > } > > public ScriptScanner(String filter) { >! super(filter,MATCH_NAME,ENDERS); > } > >! public ScriptScanner(String filter, String[] nameOfTagToMatch) { >! super(filter,nameOfTagToMatch,ENDERS); > } >! > public String [] getID() { > return MATCH_NAME; >--- 28,59 ---- > > package org.htmlparser.scanners; >! > import org.htmlparser.tags.ScriptTag; > import org.htmlparser.tags.Tag; > import org.htmlparser.tags.data.CompositeTagData; > import org.htmlparser.tags.data.TagData; >! > /** > * The HTMLScriptScanner identifies javascript code > */ > public class ScriptScanner extends CompositeTagScanner { > private static final String MATCH_NAME [] = {"SCRIPT"}; > private static final String ENDERS [] = {"BODY", "HTML"}; > public ScriptScanner() { >! this(""); > } > > public ScriptScanner(String filter) { >! this(filter,MATCH_NAME,ENDERS); > } > >! public ScriptScanner(String filter, String[] nameOfTagToMatch, String[] enders) { >! this(filter,nameOfTagToMatch,enders, new String[0], true, true); > } >! >! public ScriptScanner(String filter, String[] nameOfTagToMatch, String[] enders, String[] endtagenders, boolean allowSelfChildren, boolean balance_quotes) { >! super(filter,nameOfTagToMatch,enders, new String[0], allowSelfChildren, balance_quotes); >! } >! > public String [] getID() { > return MATCH_NAME; >*************** >*** 70,205 **** > return new ScriptTag(tagData,compositeTagData); > } >- >- public Tag scan(Tag tag, String url, NodeReader reader, String currLine) >- throws ParserException { >- try { >- int startLine = reader.getLastLineNumber(); >- String line = null; >- StringBuffer scriptContents = >- new StringBuffer(); >- boolean endTagFound = false; >- Tag startTag = tag; >- Tag endTag = null; >- line = currLine; >- boolean sameLine = true; >- int startingPos = startTag.elementEnd(); >- do { >- int endTagLoc = line.toUpperCase().indexOf(getEndTag(),startingPos); >- while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, endTagLoc)) { >- startingPos = endTagLoc+getEndTag().length(); >- endTagLoc = line.toUpperCase().indexOf(getEndTag(), startingPos); >- } >- >- if (endTagLoc!=-1) { >- endTagFound = true; >- endTag = (EndTag)EndTag.find(line,endTagLoc); >- if (sameLine) >- scriptContents.append( >- getCodeBetweenStartAndEndTags( >- line, >- startTag, >- endTagLoc) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line.substring(0,endTagLoc)); >- } >- >- reader.setPosInLine(endTag.elementEnd()); >- } else { >- if (sameLine) >- scriptContents.append( >- line.substring( >- startTag.elementEnd()+1 >- ) >- ); >- else { >- scriptContents.append(Node.getLineSeparator()); >- scriptContents.append(line); >- } >- } >- if (!endTagFound) { >- line = reader.getNextLine(); >- startingPos = 0; >- } >- if (sameLine) >- sameLine = false; >- } >- while (line!=null && !endTagFound); >- if (endTag == null) { >- // If end tag doesn't exist, create one >- String endTagName = tag.getTagName(); >- int endTagBegin = reader.getLastReadPosition()+1 ; >- int endTagEnd = endTagBegin + endTagName.length() + 2; >- endTag = new EndTag( >- new TagData( >- endTagBegin, >- endTagEnd, >- endTagName, >- currLine >- ) >- ); >- } >- NodeList childrenNodeList = new NodeList(); >- childrenNodeList.add( >- new StringNode( >- scriptContents, >- startTag.elementEnd(), >- endTag.elementBegin()-1 >- ) >- ); >- return createTag( >- new TagData( >- startTag.elementBegin(), >- endTag.elementEnd(), >- startLine, >- reader.getLastLineNumber(), >- startTag.getText(), >- currLine, >- url, >- false >- ), new CompositeTagData( >- startTag,endTag,childrenNodeList >- ) >- ); >- >- } >- catch (Exception e) { >- throw new ParserException("Error in ScriptScanner: ",e); >- } >- } >- >- public String getCodeBetweenStartAndEndTags( >- String line, >- Tag startTag, >- int endTagLoc) throws ParserException { >- try { >- >- return line.substring( >- startTag.elementEnd()+1, >- endTagLoc >- ); >- } >- catch (Exception e) { >- StringBuffer msg = new StringBuffer("Error in getCodeBetweenStartAndEndTags():\n"); >- msg.append("substring starts at: "+(startTag.elementEnd()+1)).append("\n"); >- msg.append("substring ends at: "+(endTagLoc)); >- throw new ParserException(msg.toString(),e); >- } >- } >- >- /** >- * Gets the end tag that the scanner uses to stop scanning. Subclasses of >- * <code>ScriptScanner</code> you should override this method. >- * @return String containing the end tag to search for, i.e. </SCRIPT> >- */ >- public String getEndTag() { >- return SCRIPT_END_TAG; >- } >- >- private boolean isScriptEmbeddedInDocumentWrite(String line, int endTagLoc) { >- if (endTagLoc+getEndTag().length() > line.length()-1) return false; >- return line.charAt(endTagLoc+getEndTag().length())=='"'; >- } >- > } >--- 65,67 ---- > > > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-cvs mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs > > >------------------------------------------------------- >This SF.net email is sponsored by: ObjectStore. >If flattening out C++ or Java code to make your application fit in a >relational database is painful, don't do it! Check out ObjectStore. >Now part of Progress Software. http://www.objectstore.net/sourceforge >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > |
From: Marc N. <ma...@ke...> - 2003-05-27 18:23:03
|
Derrick, I was relying on some of the old behavior of ScriptScanner, mostly the = fact that its contents were not parsed as HTML. I'm still seeing cases = where tags inside of <script> are recognised as "HTML" and modified = (i.e. turned into uppercase, auto-closed, etc). For example, if there = is an HTML tag in a Javascript comment. Also, using "\" to concatenate = lines (which is valid in Javacript) is totally messed up now when I try = to get the script code using "toHtml()". However, I think your change was valid and fixes the bug as requested. = What I think I'm going to do, though, is make a new scanner class that = does what the old ScriptScanner did. That is, do a bare-bones "leave = everything inside that tag as-is" parse of the HTML, searching only for = the end tag with no knowledge of quotes or anything. I think there are = cases where Javascript is written such that any modification at all will = break it. I'll send a note to the list when this class is done (today sometime). = I'll call it StrictScriptScanner or something. Marc -----Original Message----- From: der...@us... [mailto:der...@us...] Sent: Saturday, May 24, 2003 2:05 PM To: htm...@li... Subject: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22 Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners In directory sc8-pr-cvs1:/tmp/cvs-serv7741/org/htmlparser/scanners Modified Files: CompositeTagScanner.java ScriptScanner.java=20 Log Message: Fixed bug #741769 ScriptScanner doesn't handle quoted </script> tags Major overhaul of ScriptScanner. It now uses the scan() method of CompositeTagScanner (i.e. doesn't = override). CompositeTagScanner now has a balance_quotes member field that dictates whether strings tags are scanned honouring single and double quotes. This affected the call chain through NodeReader and StringScanner which now have this parameter. StringScanner now correctly handles quotes if asked. The ignoreState = stuff is removed, it didn't work anyway since a single StringScanner is used recursively = by the NodeReader, and the member field would have been tromped. Sorry to all those who have broken code because of this, but it's for = the better. Really. Index: CompositeTagScanner.java =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagSc= anner.java,v retrieving revision 1.52 retrieving revision 1.53 diff -C2 -d -r1.52 -r1.53 *** CompositeTagScanner.java 19 May 2003 02:49:57 -0000 1.52 --- CompositeTagScanner.java 24 May 2003 21:04:44 -0000 1.53 *************** *** 97,100 **** --- 97,101 ---- private Set tagEnderSet; private Set endTagEnderSet; + private boolean balance_quotes; =09 public CompositeTagScanner(String [] nameOfTagToMatch) { *************** *** 125,129 **** this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); } ! =09 public CompositeTagScanner( String filter,=20 --- 126,130 ---- this(filter,nameOfTagToMatch,tagEnders,new String[] {}, = allowSelfChildren); } !=20 public CompositeTagScanner( String filter,=20 *************** *** 131,138 **** String [] tagEnders,=20 String [] endTagEnders, ! boolean allowSelfChildren) { super(filter); this.nameOfTagToMatch =3D nameOfTagToMatch; this.allowSelfChildren =3D allowSelfChildren; this.tagEnderSet =3D new HashSet(); for (int i=3D0;i<tagEnders.length;i++) --- 132,172 ---- String [] tagEnders,=20 String [] endTagEnders, ! boolean allowSelfChildren) ! { ! this(filter,nameOfTagToMatch,tagEnders,endTagEnders, = allowSelfChildren, false); ! } !=20 ! /** ! * Constructor specifying all member fields. ! * @param filter A string that is used to match which tags are to = be allowed ! * to pass through. This can be useful when one wishes to = dynamically filter ! * out all tags except one type which may be programmed later than = the parser. ! * @param nameOfTagToMatch The tag names recognized by this = scanner. ! * @param tagEnders The non-endtag tag names which signal that no = closing ! * end tag was found. For example, encountering <FORM> while ! * scanning a <A> link tag would mean that no </A> was = found ! * and needs to be corrected. ! * @param endTagEnders The endtag names which signal that no = closing end ! * tag was found. For example, encountering </HTML> while ! * scanning a <BODY> tag would mean that no </BODY> was = found ! * and needs to be corrected. These items are not prefixed by a = '/'. ! * @param allowSelfChildren If <code>true</code> a tag of the same = name is ! * allowed within this tag. Used to determine when an endtag is = missing. ! * @param balance_quotes <code>true</code> if scanning string nodes = needs to ! * honour quotes. For example, ScriptScanner defines this = <code>true</code> ! * so that text within <SCRIPT></SCRIPT> ignores = tag-like text ! * within quotes. ! */ ! public CompositeTagScanner( ! String filter,=20 ! String [] nameOfTagToMatch,=20 ! String [] tagEnders,=20 ! String [] endTagEnders, ! boolean allowSelfChildren, ! boolean balance_quotes) { super(filter); this.nameOfTagToMatch =3D nameOfTagToMatch; this.allowSelfChildren =3D allowSelfChildren; + this.balance_quotes =3D balance_quotes; this.tagEnderSet =3D new HashSet(); for (int i=3D0;i<tagEnders.length;i++) *************** *** 145,149 **** public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { CompositeTagScannerHelper helper =3D=20 ! new CompositeTagScannerHelper(this,tag,url,reader,currLine); return helper.scan(); } --- 179,183 ---- public Tag scan(Tag tag, String url, NodeReader reader,String = currLine) throws ParserException { CompositeTagScannerHelper helper =3D=20 ! new = CompositeTagScannerHelper(this,tag,url,reader,currLine,balance_quotes); return helper.scan(); } *************** *** 193,196 **** return false; } -=20 } --- 227,229 ---- Index: ScriptScanner.java =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: = /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.= java,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** ScriptScanner.java 19 May 2003 02:49:57 -0000 1.21 --- ScriptScanner.java 24 May 2003 21:04:44 -0000 1.22 *************** *** 28,64 **** =20 package org.htmlparser.scanners; ! ///////////////////////// ! // HTML Parser Imports // ! ///////////////////////// ! import org.htmlparser.Node; ! import org.htmlparser.NodeReader; ! import org.htmlparser.StringNode; ! import org.htmlparser.tags.EndTag; import org.htmlparser.tags.ScriptTag; import org.htmlparser.tags.Tag; import org.htmlparser.tags.data.CompositeTagData; import org.htmlparser.tags.data.TagData; ! import org.htmlparser.util.NodeList; ! import org.htmlparser.util.ParserException; /** * The HTMLScriptScanner identifies javascript code */ -=20 public class ScriptScanner extends CompositeTagScanner { - private static final String SCRIPT_END_TAG =3D "</SCRIPT>"; private static final String MATCH_NAME [] =3D {"SCRIPT"}; private static final String ENDERS [] =3D {"BODY", "HTML"}; public ScriptScanner() { ! super("",MATCH_NAME,ENDERS); } =20 public ScriptScanner(String filter) { ! super(filter,MATCH_NAME,ENDERS); } =20 ! public ScriptScanner(String filter, String[] nameOfTagToMatch) { ! super(filter,nameOfTagToMatch,ENDERS); } ! =09 public String [] getID() { return MATCH_NAME; --- 28,59 ---- =20 package org.htmlparser.scanners; !=20 import org.htmlparser.tags.ScriptTag; import org.htmlparser.tags.Tag; import org.htmlparser.tags.data.CompositeTagData; import org.htmlparser.tags.data.TagData; !=20 /** * The HTMLScriptScanner identifies javascript code */ public class ScriptScanner extends CompositeTagScanner { private static final String MATCH_NAME [] =3D {"SCRIPT"}; private static final String ENDERS [] =3D {"BODY", "HTML"}; public ScriptScanner() { ! this(""); } =20 public ScriptScanner(String filter) { ! this(filter,MATCH_NAME,ENDERS); } =20 ! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders) { ! this(filter,nameOfTagToMatch,enders, new String[0], true, true); } !=20 ! public ScriptScanner(String filter, String[] nameOfTagToMatch, = String[] enders, String[] endtagenders, boolean allowSelfChildren, = boolean balance_quotes) { ! super(filter,nameOfTagToMatch,enders, new String[0], = allowSelfChildren, balance_quotes); ! } !=20 public String [] getID() { return MATCH_NAME; *************** *** 70,205 **** return new ScriptTag(tagData,compositeTagData); } -=20 - public Tag scan(Tag tag, String url, NodeReader reader, String = currLine) - throws ParserException { - try { - int startLine =3D reader.getLastLineNumber(); - String line =3D null; - StringBuffer scriptContents =3D=20 - new StringBuffer(); - boolean endTagFound =3D false; - Tag startTag =3D tag; - Tag endTag =3D null; - line =3D currLine; - boolean sameLine =3D true; - int startingPos =3D startTag.elementEnd(); - do { - int endTagLoc =3D = line.toUpperCase().indexOf(getEndTag(),startingPos); - while (endTagLoc>0 && isScriptEmbeddedInDocumentWrite(line, = endTagLoc)) { - startingPos =3D endTagLoc+getEndTag().length(); - endTagLoc =3D line.toUpperCase().indexOf(getEndTag(), = startingPos); =09 - } - =20 - if (endTagLoc!=3D-1) { - endTagFound =3D true; - endTag =3D (EndTag)EndTag.find(line,endTagLoc); - if (sameLine)=20 - scriptContents.append( - getCodeBetweenStartAndEndTags( - line, - startTag, - endTagLoc) - ); - else { - scriptContents.append(Node.getLineSeparator()); - scriptContents.append(line.substring(0,endTagLoc)); - } - =09 - reader.setPosInLine(endTag.elementEnd()); - } else { - if (sameLine)=20 - scriptContents.append( - line.substring( - startTag.elementEnd()+1 - ) - ); - else { - scriptContents.append(Node.getLineSeparator()); - scriptContents.append(line); - } - } - if (!endTagFound) { - line =3D reader.getNextLine(); - startingPos =3D 0; - } - if (sameLine)=20 - sameLine =3D false; - } - while (line!=3Dnull && !endTagFound); - if (endTag =3D=3D null) { - // If end tag doesn't exist, create one - String endTagName =3D tag.getTagName(); - int endTagBegin =3D reader.getLastReadPosition()+1 ; - int endTagEnd =3D endTagBegin + endTagName.length() + 2;=20 - endTag =3D new EndTag( - new TagData( - endTagBegin, - endTagEnd, - endTagName, - currLine - ) - ); - } - NodeList childrenNodeList =3D new NodeList(); - childrenNodeList.add( - new StringNode( - scriptContents, - startTag.elementEnd(), - endTag.elementBegin()-1 - ) - ); - return createTag( - new TagData( - startTag.elementBegin(), - endTag.elementEnd(), - startLine, - reader.getLastLineNumber(), - startTag.getText(), - currLine, - url, - false - ), new CompositeTagData( - startTag,endTag,childrenNodeList - ) - ); - =09 - } - catch (Exception e) { - throw new ParserException("Error in ScriptScanner: ",e); - } - } -=20 - public String getCodeBetweenStartAndEndTags( - String line, - Tag startTag, - int endTagLoc) throws ParserException { - try { - =09 - return line.substring( - startTag.elementEnd()+1, - endTagLoc - ); - } - catch (Exception e) { - StringBuffer msg =3D new StringBuffer("Error in = getCodeBetweenStartAndEndTags():\n"); - msg.append("substring starts at: = "+(startTag.elementEnd()+1)).append("\n"); - msg.append("substring ends at: "+(endTagLoc)); - throw new ParserException(msg.toString(),e); - } - } -=20 - /** - * Gets the end tag that the scanner uses to stop scanning. = Subclasses of - * <code>ScriptScanner</code> you should override this method. - * @return String containing the end tag to search for, i.e. = </SCRIPT> - */=20 - public String getEndTag() { - return SCRIPT_END_TAG; - } - =09 - private boolean isScriptEmbeddedInDocumentWrite(String line, int = endTagLoc) { - if (endTagLoc+getEndTag().length() > line.length()-1) return false; - return line.charAt(endTagLoc+getEndTag().length())=3D=3D'"'; - } -=20 } --- 65,67 ---- ------------------------------------------------------- This SF.net email is sponsored by: ObjectStore. If flattening out C++ or Java code to make your application fit in a relational database is painful, don't do it! Check out ObjectStore. Now part of Progress Software. http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-cvs mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-cvs |
From: Derrick O. <Der...@ro...> - 2003-05-25 23:46:02
|
Version 1.3 of the most popular HTML parser on sourceforge is now available. Four weeks of candidate testing have culminated in a very stable, production level product, with many new user requested features. Features added since 1.2 include: constructor(URLConnection) for POST and exotic GET improved character set handling hierarchically nested tags, i.e. tables scanners for each type of tag java beans for easy integration of text and link fetching 'visitor' patterns Wiki page documentation improved script scanning improved whitespace handling The developers of the HTML Parser hope you enjoy it. |
From: <dha...@po...> - 2003-05-23 11:59:59
|
Hi, I wrote the following test case public void testUnClosed () throws ParserException { createParser("<TABLE><TR><TR></TR></TABLE>"); parseAndAssertNodeCount(1); =09 assertEquals("Unclosed","<TABLE><TR></TR><TR></TR></TABLE>",node[0].toHt ml()); }=09 I was expecting one node, since <TABLE> would be the main node and <TR> would be its children, but I got 5!!!. Obviously because of that the assert also failed. Since nesting is allowed by default, if there is code as follows : ... <TR> <TD>blah blah</TD> <TR> <TD>blah blah</TD> </TR> .... Then the second <TR> is incorrectly considered as a child of the first <TR> whereas in reality, a closing </TR> was missed out should have been put in. Hence for <TR> nesting should be disallowed through the scanner. Same holds for <TD> tag. I think this is a bug. Dhaval |
From: Somik R. <so...@ya...> - 2003-05-21 23:44:19
|
You should not be using setParsed. Instead, all you have to do is use setAttribute on TableTag, like so: tableTag.setAttribute("BORDER",1); Then, make a call to tableTag.toHtml(), and it should show up. Regards, Somik ----- Original Message ----- From: "Terry Alexis Lurie" <tez...@ya...> To: <htm...@li...> Sent: Wednesday, May 21, 2003 5:10 AM Subject: Re: [Htmlparser-developer] HTMLTag patch > Yes, I'd like to be able to programmatically set > certain attributes. Its for a highlighted step-by-step > through a web-rip, so the focus table is border=1 or > whatever [very uncommon these days], the rest is as > is. > > I've been doing this in Perl's HTML::Parse for a > while, but now shifting to Java because of work. > > Terry. > > --- Somik Raha <so...@ya...> wrote: > Hi Terry > > Just curious - why do you need to call > > setParsed() ? > > Are you trying to take all tables and ensure > > that they have a border "1" > > ? > > > > Regards, > > Somik > > ----- Original Message ----- > > From: "Terry Alexis Lurie" <tez...@ya...> > > To: <htm...@li...> > > Sent: Tuesday, May 20, 2003 11:03 AM > > Subject: [Htmlparser-developer] HTMLTag patch > > > > > > > Hi, this is further to my Bug report via the SF > > site. > > > > > > Basically, setParsed() wasn't effecting the actual > > > output of the Node thereafter. This made it a real > > > pain to highlight HTML, the example here being > > making > > > tables have a border of 1 to show them. > > > > > > Patch attached. Has some debugging commented out, > > > you'll want to get rid of this. I put a patch for > > th > > > testing code on the sourceforge bug report. > > > > > > Cheers, > > > > > > Terry. > > > > > > -------------------- > > > > > > *** HTMLTag.java 2003/05/20 12:33:42 1.1 > > > --- HTMLTag.java 2003/05/20 14:52:42 > > > *************** > > > *** 273,283 **** > > > } > > > /** > > > * Sets the parsed. > > > ! * @param parsed The parsed to set > > > */ > > > public void setParsed(Hashtable parsed) { > > > this.parsed = parsed; > > > } > > > /** > > > * Sets the strictTags. > > > * @param strictTags The strictTags to set > > > --- 273,306 ---- > > > } > > > /** > > > * Sets the parsed. > > > ! * Note: There is no guarantee that the > > attributes > > > will be: > > > ! * in the same order or case as originally. > > > ! * This isn't expected to be a problem, but > > > then again > > > ! * it never is, is it? > > > ! * Also: This currently makes no effort to place > > > the attribute > > > ! * in quotes if necessary. You have to take > > > care of that > > > ! * yourself > > > ! * @param parsed The hash of (key,value) > > attribute > > > pairs to set > > > */ > > > public void setParsed(Hashtable parsed) { > > > this.parsed = parsed; > > > + > > > + setText((String) parsed.get(this.TAGNAME)); > > //Set > > > the tag first > > > + for(Enumeration e = parsed.keys(); > > > e.hasMoreElements();) { > > > + String temp = (String) e.nextElement(); > > > + if (!temp.equals(this.TAGNAME)) { //Don't > > add > > > the tagname again > > > + append(" " + temp + '=' + ((String) > > > parsed.get(temp))); > > > + > > > + //Debug > > > + //System.out.println("setParsed appending key: " > > > + temp + " to value: " + ((String) > > parsed.get(temp))); > > > + } > > > + } > > > + > > > + //Debug > > > + //System.out.println("setParsed: completed, now > > > text is:" + getText()); > > > + > > > } > > > + > > > /** > > > * Sets the strictTags. > > > * @param strictTags The strictTags to set > > > > > > > > > ===== > > > > > > ------------------------------------------------------------ > > > Terry Alexis Lurie | 'Something witty > > that doesn't > > > Freelance Computer Engineer | look good with > > variable > > > United Kingdom | width fonts' - Most > > nerds > > > > > > __________________________________________________ > > > It's Samaritans' Week. Help Samaritans help > > others. > > > Call 08709 000032 to give or donate online now at > > http://www.samaritans.org/support/donations.shtm > > > > > > > > > > > > ------------------------------------------------------- > > > This SF.net email is sponsored by: ObjectStore. > > > If flattening out C++ or Java code to make your > > application fit in a > > > relational database is painful, don't do it! Check > > out ObjectStore. > > > Now part of Progress Software. > > http://www.objectstore.net/sourceforge > > > _______________________________________________ > > > Htmlparser-developer mailing list > > > Htm...@li... > > > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: ObjectStore. > > If flattening out C++ or Java code to make your > > application fit in a > > relational database is painful, don't do it! Check > > out ObjectStore. > > Now part of Progress Software. > > http://www.objectstore.net/sourceforge > > _______________________________________________ > > Htmlparser-developer mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > ===== > ------------------------------------------------------------ > Terry Alexis Lurie | 'Something witty that doesn't > Freelance Computer Engineer | look good with variable > United Kingdom | width fonts' - Most nerds > > __________________________________________________ > It's Samaritans' Week. Help Samaritans help others. > Call 08709 000032 to give or donate online now at http://www.samaritans.org/support/donations.shtm > > > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your application fit in a > relational database is painful, don't do it! Check out ObjectStore. > Now part of Progress Software. http://www.objectstore.net/sourceforge > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: <tez...@ya...> - 2003-05-21 12:15:28
|
Right. That was definitely the answer I was looking for. Hopefully be able to use my talents for good rather than evil. I'm just avers to using bleeding edge in production, but now I'm sort of familiar with the scope of the project, I think it is worth the small risk. Terry. --- Derrick Oswald <Der...@ro...> wrote: > Terry, > > You should really switch to the 1.3 codebase, > version 1.2 is very long > in the tooth and a final release of 1.3 is imminent. > These problems you are encountering don't seem to be > present any more > and you would have a more sympathetic ear. > > Derrick > > Terry Alexis Lurie wrote: > > >Yes, I'd like to be able to programmatically set > >certain attributes. Its for a highlighted > step-by-step > >through a web-rip, so the focus table is border=1 > or > >whatever [very uncommon these days], the rest is as > >is. > > > >I've been doing this in Perl's HTML::Parse for a > >while, but now shifting to Java because of work. > > > >Terry. > > > > --- Somik Raha <so...@ya...> wrote: > Hi Terry > > > > > >> Just curious - why do you need to call > >>setParsed() ? > >> Are you trying to take all tables and ensure > >>that they have a border "1" > >>? > >> > >>Regards, > >>Somik > >>----- Original Message ----- > >>From: "Terry Alexis Lurie" <tez...@ya...> > >>To: <htm...@li...> > >>Sent: Tuesday, May 20, 2003 11:03 AM > >>Subject: [Htmlparser-developer] HTMLTag patch > >> > >> > >> > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your > application fit in a > relational database is painful, don't do it! Check > out ObjectStore. > Now part of Progress Software. > http://www.objectstore.net/sourceforge > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer ===== ------------------------------------------------------------ Terry Alexis Lurie | 'Something witty that doesn't Freelance Computer Engineer | look good with variable United Kingdom | width fonts' - Most nerds __________________________________________________ It's Samaritans' Week. Help Samaritans help others. Call 08709 000032 to give or donate online now at http://www.samaritans.org/support/donations.shtm |
From: <dha...@po...> - 2003-05-21 12:12:29
|
I was just going to say that, Derrick ;) I too believe that the problems you mentioned no longer exist in 1.3. > -----Original Message----- > From: htm...@li...=20 > [mailto:htm...@li...] On=20 > Behalf Of Der...@ro... > Sent: Wednesday, May 21, 2003 5:28 PM > To: htm...@li... > Subject: Re: [Htmlparser-developer] HTMLTag patch >=20 >=20 > Terry, >=20 > You should really switch to the 1.3 codebase, version 1.2 is=20 > very long=20 > in the tooth and a final release of 1.3 is imminent. > These problems you are encountering don't seem to be present any more=20 > and you would have a more sympathetic ear. >=20 > Derrick >=20 > Terry Alexis Lurie wrote: >=20 > >Yes, I'd like to be able to programmatically set > >certain attributes. Its for a highlighted step-by-step > >through a web-rip, so the focus table is border=3D1 or > >whatever [very uncommon these days], the rest is as > >is. > > > >I've been doing this in Perl's HTML::Parse for a > >while, but now shifting to Java because of work. > > > >Terry. > > > > --- Somik Raha <so...@ya...> wrote: > Hi Terry > > =20 > > > >> Just curious - why do you need to call > >>setParsed() ? > >> Are you trying to take all tables and ensure > >>that they have a border "1" > >>? > >> > >>Regards, > >>Somik > >>----- Original Message ----- > >>From: "Terry Alexis Lurie" <tez...@ya...> > >>To: <htm...@li...> > >>Sent: Tuesday, May 20, 2003 11:03 AM > >>Subject: [Htmlparser-developer] HTMLTag patch > >> > >> =20 > >> >=20 >=20 >=20 >=20 > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your application=20 > fit in a relational database is painful, don't do it! Check=20 > out ObjectStore. Now part of Progress Software.=20 http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Derrick O. <Der...@ro...> - 2003-05-21 12:07:24
|
Terry, You should really switch to the 1.3 codebase, version 1.2 is very long in the tooth and a final release of 1.3 is imminent. These problems you are encountering don't seem to be present any more and you would have a more sympathetic ear. Derrick Terry Alexis Lurie wrote: >Yes, I'd like to be able to programmatically set >certain attributes. Its for a highlighted step-by-step >through a web-rip, so the focus table is border=1 or >whatever [very uncommon these days], the rest is as >is. > >I've been doing this in Perl's HTML::Parse for a >while, but now shifting to Java because of work. > >Terry. > > --- Somik Raha <so...@ya...> wrote: > Hi Terry > > >> Just curious - why do you need to call >>setParsed() ? >> Are you trying to take all tables and ensure >>that they have a border "1" >>? >> >>Regards, >>Somik >>----- Original Message ----- >>From: "Terry Alexis Lurie" <tez...@ya...> >>To: <htm...@li...> >>Sent: Tuesday, May 20, 2003 11:03 AM >>Subject: [Htmlparser-developer] HTMLTag patch >> >> >> |
From: <tez...@ya...> - 2003-05-21 10:32:08
|
A patch for HTMLTagTest.java. When you call registerScanners, they don't print the attributes properly. Here in this test case you get <A EN="" =="" HREF="http://www.google.com/webhp?hl"></A> from <a href=http://www.google.com/webhp?hl=en> See how you get the bogus atttributes EN="" and =="" ? This doesn't occur if you don't call registerScanners(); Terry ------- public void testHTMLOutputOfDifficultLinksWithRegisterScanners() throws HTMLParserException { createParser("<a href=http://www.google.com/webhp?hl=en>"); //Straight out of a real world example // assertTrue("Node should be a HTMLLinkTag",node[0] instanceof HTMLLinkTag); parser.registerScanners(); // Register standard scanners (Very Important) String stringTemp=""; for (HTMLEnumeration e = parser.elements(); e.hasMoreNodes();) { HTMLNode newNode = e.nextHTMLNode(); // Get the next HTML Node stringTemp = newNode.toHTML(); System.out.println(stringTemp); } assertEquals("Parsed text should be","<a href=http://www.google.com/webhp?hl=en>",stringTemp); } ===== ------------------------------------------------------------ Terry Alexis Lurie | 'Something witty that doesn't Freelance Computer Engineer | look good with variable United Kingdom | width fonts' - Most nerds __________________________________________________ It's Samaritans' Week. Help Samaritans help others. Call 08709 000032 to give or donate online now at http://www.samaritans.org/support/donations.shtm |
From: <tez...@ya...> - 2003-05-21 09:10:46
|
Yes, I'd like to be able to programmatically set certain attributes. Its for a highlighted step-by-step through a web-rip, so the focus table is border=1 or whatever [very uncommon these days], the rest is as is. I've been doing this in Perl's HTML::Parse for a while, but now shifting to Java because of work. Terry. --- Somik Raha <so...@ya...> wrote: > Hi Terry > Just curious - why do you need to call > setParsed() ? > Are you trying to take all tables and ensure > that they have a border "1" > ? > > Regards, > Somik > ----- Original Message ----- > From: "Terry Alexis Lurie" <tez...@ya...> > To: <htm...@li...> > Sent: Tuesday, May 20, 2003 11:03 AM > Subject: [Htmlparser-developer] HTMLTag patch > > > > Hi, this is further to my Bug report via the SF > site. > > > > Basically, setParsed() wasn't effecting the actual > > output of the Node thereafter. This made it a real > > pain to highlight HTML, the example here being > making > > tables have a border of 1 to show them. > > > > Patch attached. Has some debugging commented out, > > you'll want to get rid of this. I put a patch for > th > > testing code on the sourceforge bug report. > > > > Cheers, > > > > Terry. > > > > -------------------- > > > > *** HTMLTag.java 2003/05/20 12:33:42 1.1 > > --- HTMLTag.java 2003/05/20 14:52:42 > > *************** > > *** 273,283 **** > > } > > /** > > * Sets the parsed. > > ! * @param parsed The parsed to set > > */ > > public void setParsed(Hashtable parsed) { > > this.parsed = parsed; > > } > > /** > > * Sets the strictTags. > > * @param strictTags The strictTags to set > > --- 273,306 ---- > > } > > /** > > * Sets the parsed. > > ! * Note: There is no guarantee that the > attributes > > will be: > > ! * in the same order or case as originally. > > ! * This isn't expected to be a problem, but > > then again > > ! * it never is, is it? > > ! * Also: This currently makes no effort to place > > the attribute > > ! * in quotes if necessary. You have to take > > care of that > > ! * yourself > > ! * @param parsed The hash of (key,value) > attribute > > pairs to set > > */ > > public void setParsed(Hashtable parsed) { > > this.parsed = parsed; > > + > > + setText((String) parsed.get(this.TAGNAME)); > //Set > > the tag first > > + for(Enumeration e = parsed.keys(); > > e.hasMoreElements();) { > > + String temp = (String) e.nextElement(); > > + if (!temp.equals(this.TAGNAME)) { //Don't > add > > the tagname again > > + append(" " + temp + '=' + ((String) > > parsed.get(temp))); > > + > > + //Debug > > + //System.out.println("setParsed appending key: " > > + temp + " to value: " + ((String) > parsed.get(temp))); > > + } > > + } > > + > > + //Debug > > + //System.out.println("setParsed: completed, now > > text is:" + getText()); > > + > > } > > + > > /** > > * Sets the strictTags. > > * @param strictTags The strictTags to set > > > > > > ===== > > > ------------------------------------------------------------ > > Terry Alexis Lurie | 'Something witty > that doesn't > > Freelance Computer Engineer | look good with > variable > > United Kingdom | width fonts' - Most > nerds > > > > __________________________________________________ > > It's Samaritans' Week. Help Samaritans help > others. > > Call 08709 000032 to give or donate online now at > http://www.samaritans.org/support/donations.shtm > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: ObjectStore. > > If flattening out C++ or Java code to make your > application fit in a > > relational database is painful, don't do it! Check > out ObjectStore. > > Now part of Progress Software. > http://www.objectstore.net/sourceforge > > _______________________________________________ > > Htmlparser-developer mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your > application fit in a > relational database is painful, don't do it! Check > out ObjectStore. > Now part of Progress Software. > http://www.objectstore.net/sourceforge > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer ===== ------------------------------------------------------------ Terry Alexis Lurie | 'Something witty that doesn't Freelance Computer Engineer | look good with variable United Kingdom | width fonts' - Most nerds __________________________________________________ It's Samaritans' Week. Help Samaritans help others. Call 08709 000032 to give or donate online now at http://www.samaritans.org/support/donations.shtm |
From: Somik R. <so...@ya...> - 2003-05-21 03:03:57
|
Hi Terry Just curious - why do you need to call setParsed() ? Are you trying to take all tables and ensure that they have a border "1" ? Regards, Somik ----- Original Message ----- From: "Terry Alexis Lurie" <tez...@ya...> To: <htm...@li...> Sent: Tuesday, May 20, 2003 11:03 AM Subject: [Htmlparser-developer] HTMLTag patch > Hi, this is further to my Bug report via the SF site. > > Basically, setParsed() wasn't effecting the actual > output of the Node thereafter. This made it a real > pain to highlight HTML, the example here being making > tables have a border of 1 to show them. > > Patch attached. Has some debugging commented out, > you'll want to get rid of this. I put a patch for th > testing code on the sourceforge bug report. > > Cheers, > > Terry. > > -------------------- > > *** HTMLTag.java 2003/05/20 12:33:42 1.1 > --- HTMLTag.java 2003/05/20 14:52:42 > *************** > *** 273,283 **** > } > /** > * Sets the parsed. > ! * @param parsed The parsed to set > */ > public void setParsed(Hashtable parsed) { > this.parsed = parsed; > } > /** > * Sets the strictTags. > * @param strictTags The strictTags to set > --- 273,306 ---- > } > /** > * Sets the parsed. > ! * Note: There is no guarantee that the attributes > will be: > ! * in the same order or case as originally. > ! * This isn't expected to be a problem, but > then again > ! * it never is, is it? > ! * Also: This currently makes no effort to place > the attribute > ! * in quotes if necessary. You have to take > care of that > ! * yourself > ! * @param parsed The hash of (key,value) attribute > pairs to set > */ > public void setParsed(Hashtable parsed) { > this.parsed = parsed; > + > + setText((String) parsed.get(this.TAGNAME)); //Set > the tag first > + for(Enumeration e = parsed.keys(); > e.hasMoreElements();) { > + String temp = (String) e.nextElement(); > + if (!temp.equals(this.TAGNAME)) { //Don't add > the tagname again > + append(" " + temp + '=' + ((String) > parsed.get(temp))); > + > + //Debug > + //System.out.println("setParsed appending key: " > + temp + " to value: " + ((String) parsed.get(temp))); > + } > + } > + > + //Debug > + //System.out.println("setParsed: completed, now > text is:" + getText()); > + > } > + > /** > * Sets the strictTags. > * @param strictTags The strictTags to set > > > ===== > ------------------------------------------------------------ > Terry Alexis Lurie | 'Something witty that doesn't > Freelance Computer Engineer | look good with variable > United Kingdom | width fonts' - Most nerds > > __________________________________________________ > It's Samaritans' Week. Help Samaritans help others. > Call 08709 000032 to give or donate online now at http://www.samaritans.org/support/donations.shtm > > > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your application fit in a > relational database is painful, don't do it! Check out ObjectStore. > Now part of Progress Software. http://www.objectstore.net/sourceforge > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: <tez...@ya...> - 2003-05-20 16:14:38
|
Hmm, well that breaks everything under the sun.. I have re-corrected it on my side by changing this addition into a new method resetParsed(). So more of a helper function than a major change... Obviously I've blundered in here half-cocked. Should I submit further stuff off the CVS or the 1.2 code base? I'm a bit loathe to use the CVS in production, so any patches I do I'm inclined to do off 1.2 Thoughts? If you want the diff that implements the resetParsed() and appropriate test, just email me. Cheers, Terry. --- Terry Alexis Lurie <tez...@ya...> wrote: > Hi, this is further to my Bug report via the SF > site. > > Basically, setParsed() wasn't effecting the actual > output of the Node thereafter. This made it a real > pain to highlight HTML, the example here being > making > tables have a border of 1 to show them. > > Patch attached. Has some debugging commented out, > you'll want to get rid of this. I put a patch for th > testing code on the sourceforge bug report. > > Cheers, > > Terry. > > -------------------- > > *** HTMLTag.java 2003/05/20 12:33:42 1.1 > --- HTMLTag.java 2003/05/20 14:52:42 > *************** > *** 273,283 **** > } > /** > * Sets the parsed. > ! * @param parsed The parsed to set > */ > public void setParsed(Hashtable parsed) { > this.parsed = parsed; > } > /** > * Sets the strictTags. > * @param strictTags The strictTags to set > --- 273,306 ---- > } > /** > * Sets the parsed. > ! * Note: There is no guarantee that the > attributes > will be: > ! * in the same order or case as originally. > ! * This isn't expected to be a problem, but > then again > ! * it never is, is it? > ! * Also: This currently makes no effort to place > the attribute > ! * in quotes if necessary. You have to take > care of that > ! * yourself > ! * @param parsed The hash of (key,value) > attribute > pairs to set > */ > public void setParsed(Hashtable parsed) { > this.parsed = parsed; > + > + setText((String) parsed.get(this.TAGNAME)); > //Set > the tag first > + for(Enumeration e = parsed.keys(); > e.hasMoreElements();) { > + String temp = (String) e.nextElement(); > + if (!temp.equals(this.TAGNAME)) { //Don't > add > the tagname again > + append(" " + temp + '=' + ((String) > parsed.get(temp))); > + > + //Debug > + //System.out.println("setParsed appending key: > " > + temp + " to value: " + ((String) > parsed.get(temp))); > + } > + } > + > + //Debug > + //System.out.println("setParsed: completed, now > text is:" + getText()); > + > } > + > /** > * Sets the strictTags. > * @param strictTags The strictTags to set > > > ===== > ------------------------------------------------------------ > Terry Alexis Lurie | 'Something witty that > doesn't > Freelance Computer Engineer | look good with > variable > United Kingdom | width fonts' - Most > nerds > > __________________________________________________ > It's Samaritans' Week. Help Samaritans help others. > Call 08709 000032 to give or donate online now at > http://www.samaritans.org/support/donations.shtm > > > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your > application fit in a > relational database is painful, don't do it! Check > out ObjectStore. > Now part of Progress Software. > http://www.objectstore.net/sourceforge > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer ===== ------------------------------------------------------------ Terry Alexis Lurie | 'Something witty that doesn't Freelance Computer Engineer | look good with variable United Kingdom | width fonts' - Most nerds __________________________________________________ It's Samaritans' Week. Help Samaritans help others. Call 08709 000032 to give or donate online now at http://www.samaritans.org/support/donations.shtm |
From: <tez...@ya...> - 2003-05-20 15:03:46
|
Hi, this is further to my Bug report via the SF site. Basically, setParsed() wasn't effecting the actual output of the Node thereafter. This made it a real pain to highlight HTML, the example here being making tables have a border of 1 to show them. Patch attached. Has some debugging commented out, you'll want to get rid of this. I put a patch for th testing code on the sourceforge bug report. Cheers, Terry. -------------------- *** HTMLTag.java 2003/05/20 12:33:42 1.1 --- HTMLTag.java 2003/05/20 14:52:42 *************** *** 273,283 **** } /** * Sets the parsed. ! * @param parsed The parsed to set */ public void setParsed(Hashtable parsed) { this.parsed = parsed; } /** * Sets the strictTags. * @param strictTags The strictTags to set --- 273,306 ---- } /** * Sets the parsed. ! * Note: There is no guarantee that the attributes will be: ! * in the same order or case as originally. ! * This isn't expected to be a problem, but then again ! * it never is, is it? ! * Also: This currently makes no effort to place the attribute ! * in quotes if necessary. You have to take care of that ! * yourself ! * @param parsed The hash of (key,value) attribute pairs to set */ public void setParsed(Hashtable parsed) { this.parsed = parsed; + + setText((String) parsed.get(this.TAGNAME)); //Set the tag first + for(Enumeration e = parsed.keys(); e.hasMoreElements();) { + String temp = (String) e.nextElement(); + if (!temp.equals(this.TAGNAME)) { //Don't add the tagname again + append(" " + temp + '=' + ((String) parsed.get(temp))); + + //Debug + //System.out.println("setParsed appending key: " + temp + " to value: " + ((String) parsed.get(temp))); + } + } + + //Debug + //System.out.println("setParsed: completed, now text is:" + getText()); + } + /** * Sets the strictTags. * @param strictTags The strictTags to set ===== ------------------------------------------------------------ Terry Alexis Lurie | 'Something witty that doesn't Freelance Computer Engineer | look good with variable United Kingdom | width fonts' - Most nerds __________________________________________________ It's Samaritans' Week. Help Samaritans help others. Call 08709 000032 to give or donate online now at http://www.samaritans.org/support/donations.shtm |
From: Somik R. <so...@ya...> - 2003-05-20 02:19:18
|
> Do u mean start tag as well as end tags? yes. > As I mentioned above I see some discrepancy between the code and documentation > hence I was asking for an explanation. MATCHIDS has been explained bt the above > 2 were not and hence I was asking. np. The fact that you had the question indicates the class needs further refactoring. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-05-20 02:18:02
|
> Actually I thought that if a particular Tag is not registered via the Scanner > then it will be parsed as just a Node (not of any specific type that is). But I > guess its going to get parsed as a Tag then there is no problem. But I just > wanted to know that what is the design consideration for not having visitNode() > since Node is superclass of Tag. Well, the first part of what you say answers the second part :). I couldn't see why I'd want to have a visitNode(). Regards, Somik |
From: <dha...@or...> - 2003-05-16 05:15:23
|
> > > In the documentation of CompositeTagScanner there is > mention of ENDERS & > > END_TAG_ENDERS string arrays. Can someone tell me the > difference between > the > > two? > > > > Ah - END_TAG_ENDERS is the array of tags/endtags that would > signal that the > current tag is not closed but should be. > Do u mean start tag as well as end tags? > > Also the documentation makes a call to a 4 argument > constructor with first > > argument as string array. I don't see any suc constructor > in the code. > > The code is self-documenting. Besides, usage is explained in > the javadoc of > the class CompositeTagScanner. > As I mentioned above I see some discrepancy between the code and documentation hence I was asking for an explanation. MATCHIDS has been explained bt the above 2 were not and hence I was asking. |
From: <dha...@or...> - 2003-05-16 05:12:12
|
> > Dhaval Udani wrote: > > I thought that since Tag extends from Node, a visitNode() > method would be > > appropriate. Apart from that if I want to search for some > tags that I have > not > > registered (say a <HTML> tag or a <HEAD> tag) I could use > the visitNode > > mechanism to do it. > > No - if you want to search for a "Tag" - you would use visitTag(). > Everything is a Tag. An EndTag is a Tag. All Tags are nodes. > A StringNode > and RemarkNode are also nodes. What node is it that you wish > to search for? > Actually I thought that if a particular Tag is not registered via the Scanner then it will be parsed as just a Node (not of any specific type that is). But I guess its going to get parsed as a Tag then there is no problem. But I just wanted to know that what is the design consideration for not having visitNode() since Node is superclass of Tag. Dhaval |
From: <dha...@or...> - 2003-05-16 05:03:44
|
> > > Dhaval Udani wrote: > > Well the situation just came up. > > > > Assume a <HEAD> tag which is not closed. It needs to be > closed when a > <BODY> > > tag is encountered. Hence BODY would be in the STARTERS > array for HEAD. > > I don't see a HeadScanner. If <HEAD> is not closed, it should > be no problem. > I wrote a HEAD scanner and have sent it to Derrick for inclusion in the next version. In ENDERS I put BODY & in END_TAG_ENDERS I put HTML. Works well. |
From: <dha...@or...> - 2003-05-16 05:02:29
|
> > I'd suggest the user-story driven approach. Do we have a > "real" scenario > where someone would benefit from this? No speculations :) > Somik, This is my "user" story. have code as follows: <!-- Some HTML header comments like copyright, standard function blocks etc... --> <% JSP Code for non-caching as well as initialization %> <HTML> blah blah blah </HTML I've attached a file for your reference. Thanx, Dhaval |
From: Somik R. <so...@ya...> - 2003-05-16 02:22:38
|
> In the documentation of CompositeTagScanner there is mention of ENDERS & > END_TAG_ENDERS string arrays. Can someone tell me the difference between the > two? > Ah - END_TAG_ENDERS is the array of tags/endtags that would signal that the current tag is not closed but should be. > Also the documentation makes a call to a 4 argument constructor with first > argument as string array. I don't see any suc constructor in the code. The code is self-documenting. Besides, usage is explained in the javadoc of the class CompositeTagScanner. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-05-16 02:17:17
|
Dhaval Udani wrote: > Well the situation just came up. > > Assume a <HEAD> tag which is not closed. It needs to be closed when a <BODY> > tag is encountered. Hence BODY would be in the STARTERS array for HEAD. I don't see a HeadScanner. If <HEAD> is not closed, it should be no problem. Regards, Somik |