htmlparser-cvs Mailing List for HTML Parser (Page 30)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(141) |
Jun
(108) |
Jul
(66) |
Aug
(127) |
Sep
(155) |
Oct
(149) |
Nov
(72) |
Dec
(72) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(100) |
Feb
(36) |
Mar
(21) |
Apr
(3) |
May
(87) |
Jun
(28) |
Jul
(84) |
Aug
(5) |
Sep
(14) |
Oct
|
Nov
|
Dec
|
2005 |
Jan
(1) |
Feb
(39) |
Mar
(26) |
Apr
(38) |
May
(14) |
Jun
(10) |
Jul
|
Aug
|
Sep
(13) |
Oct
(8) |
Nov
(10) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
(17) |
Apr
(20) |
May
(28) |
Jun
(24) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: <der...@us...> - 2003-11-08 21:31:01
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters In directory sc8-pr-cvs1:/tmp/cvs-serv18855/src/org/htmlparser/filters Added Files: AndFilter.java HasAttributeFilter.java HasChildFilter.java NodeClassFilter.java NotFilter.java OrFilter.java StringFilter.java TagNameFilter.java package.html Log Message: Implement generic node filtering. Added the NodeFilter interface and the filter package. Sideline tag specific scanners; tags now use only one scanner of each type, TagScanner or CompositeTagScanner (except for ScriptScanner). Obviated PeekingIterator by moving the META tag semantics to doSemanticAction, much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated. --- NEW FILE: AndFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/AndFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; /** * This class accepts all nodes matching both filters (AND operation). */ public class AndFilter implements NodeFilter { /** * The left hand side. */ protected NodeFilter mLeft; /** * The right hand side. */ protected NodeFilter mRight; /** * Creates a new instance of AndFilter that accepts nodes acceptable to both filters. * @param left One filter. * @param right The other filter. */ public AndFilter (NodeFilter left, NodeFilter right) { mLeft = left; mRight = right; } /** * Accept nodes that are acceptable to both filters. * @param node The node to check. */ public boolean accept (Node node) { return (mLeft.accept (node) && mRight.accept (node)); } } --- NEW FILE: HasAttributeFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/HasAttributeFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.lexer.nodes.TagNode; /** * This class accepts all tags that have a child acceptable to the filter. */ public class HasAttributeFilter implements NodeFilter { /** * The attribute to check for. */ protected String mAttribute; /** * Creates a new instance of HasAttributeFilter that accepts tags with the given attribute. * @param attribute The attribute to search for. */ public HasAttributeFilter (String attribute) { mAttribute = attribute.toUpperCase (); } /** * Accept tags with a certain attribute. * @param node The node to check. */ public boolean accept (Node node) { TagNode tag; boolean ret; ret = false; if (node instanceof TagNode) { tag = (TagNode)node; ret = null != tag.getAttributeEx (mAttribute); } return (ret); } } --- NEW FILE: HasChildFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/HasChildFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.tags.CompositeTag; import org.htmlparser.util.NodeList; /** * This class accepts all tags that have a child acceptable to the filter. */ public class HasChildFilter implements NodeFilter { /** * The filter to apply to children. */ protected NodeFilter mFilter; /** * Creates a new instance of HasChildFilter that accepts tags with children acceptable to the filter. * Similar to asking for the parent of a node returned by the given * filter, but where multiple children may be acceptable, this class * will only accept the parent once. * @param filter The filter to apply to children. */ public HasChildFilter (NodeFilter filter) { mFilter = filter; } /** * Accept tags with children acceptable to the filter. * @param node The node to check. */ public boolean accept (Node node) { CompositeTag tag; NodeList children; boolean ret; ret = false; if (node instanceof CompositeTag) { tag = (CompositeTag)node; children = tag.getChildren (); for (int i = 0; i < children.size (); i++) if (mFilter.accept (children.elementAt (i))) { ret = true; break; } } return (ret); } } --- NEW FILE: NodeClassFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/NodeClassFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; /** * This class accepts all tags of a given class. */ public class NodeClassFilter implements NodeFilter { /** * The class to match. */ protected Class mClass; /** * Creates a new instance of NodeClassFilter that accepts tags of the given class. * @param cls The cls to match. */ public NodeClassFilter (Class cls) { mClass = cls; } /** * Accept nodes that are assignable from the class provided in the constructor. * @param node The node to check. */ public boolean accept (Node node) { return (mClass.isAssignableFrom (node.getClass ())); } } --- NEW FILE: NotFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/NotFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; /** * This class accepts all nodes not acceptable to the filter. */ public class NotFilter implements NodeFilter { /** * The filter to gainsay. */ protected NodeFilter mFilter; /** * Creates a new instance of NotFilter that accepts nodes not acceptable to the filter. * @param filter The filter to consult. */ public NotFilter (NodeFilter filter) { mFilter = filter; } /** * Accept nodes that are not acceptable to the filter. * @param node The node to check. */ public boolean accept (Node node) { return (!mFilter.accept (node)); } } --- NEW FILE: OrFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/OrFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; /** * This class accepts all nodes matching either filter (OR operation). */ public class OrFilter implements NodeFilter { /** * The left hand side. */ protected NodeFilter mLeft; /** * The right hand side. */ protected NodeFilter mRight; /** * Creates a new instance of OrFilter that accepts nodes acceptable to either filter. * @param left One filter. * @param right The other filter. */ public OrFilter (NodeFilter left, NodeFilter right) { mLeft = left; mRight = right; } /** * Accept nodes that are acceptable to either filter. * @param node The node to check. */ public boolean accept (Node node) { return (mLeft.accept (node) || mRight.accept (node)); } } --- NEW FILE: StringFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/StringFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.lexer.nodes.StringNode; /** * This class accepts all string nodes containing the given string. */ public class StringFilter implements NodeFilter { /** * The string to search for. */ protected String mPattern; /** * Case sensitive toggle. */ protected boolean mCaseSensitive; /** * Creates a new instance of StringFilter that accepts string nodes containing a certain string. * The comparison is case insensitive. * @param pattern The pattern to search for. */ public StringFilter (String pattern) { this (pattern, false); } /** * Creates a new instance of StringFilter that accepts string nodes containing a certain string. * @param pattern The pattern to search for. * @param case_sensitive If <code>true</code>, comparisons are performed * respecting case. */ public StringFilter (String pattern, boolean case_sensitive) { mCaseSensitive = case_sensitive; if (mCaseSensitive) mPattern = pattern; else mPattern = pattern.toUpperCase (); } /** * Accept string nodes that contain the string. * @param node The node to check. */ public boolean accept (Node node) { String string; boolean ret; ret = false; if (node instanceof StringNode) { string = ((StringNode)node).getText (); if (!mCaseSensitive) string = string.toUpperCase (); ret = -1 != string.indexOf (mPattern); } return (ret); } } --- NEW FILE: TagNameFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/TagNameFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.lexer.nodes.TagNode; /** * This class accepts all tags matching the tag name. */ public class TagNameFilter implements NodeFilter { /** * The tag name to match. */ protected String mName; /** * Creates a new instance of TagNameFilter that accepts tags with the given name. * @param name The tag name to match. */ public TagNameFilter (String name) { mName = name.toUpperCase (); } /** * Accept nodes that are tags and have a matching tag name. * This discards non-tag nodes and end tags. * The end tags are available on the enclosing non-end tag. * @param node The node to check. */ public boolean accept (Node node) { return ((node instanceof TagNode) && !((TagNode)node).isEndTag () && ((TagNode)node).getTagName ().equals (mName)); } } --- NEW FILE: package.html --- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <!-- @(#)package.html 1.60 98/01/27 HTMLParser Library v1_4_20031026 - A java-based parser for HTML Copyright (C) Dec 31, 2000 Somik Raha This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA For any questions or suggestions, you can write to me at : Email :so...@in... Postal Address : Somik Raha Extreme Programmer & Coach Industrial Logic Corporation 2583 Cedar Street, Berkeley, CA 94708, USA Website : http://www.industriallogic.com --> <TITLE>Filters Package</TITLE> </HEAD> <BODY> The filters package contains example filters to select only desired nodes. For example, to display tags having the "id" attribute, you could use: <pre> Parser parser = new Parser ("http://yadda"); parser.parse (new HasAttributeFilter ("id")); </pre> These filters can be combined to yield powerfull extraction capabilities. For example, to get a list of links where the contents is an image, you could use: <pre> NodeList list = new NodeList (); NodeFilter filter = new AndFilter ( new TagNameFilter ("A"), new HasChildFilter ( new TagNameFilter ("IMG"))); for (NodeIterator e = parser.elements (); e.hasMoreNodes (); ) e.nextNode ().collectInto (list, filter); </pre> </BODY> </HTML> |
From: <der...@us...> - 2003-11-08 21:31:01
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util In directory sc8-pr-cvs1:/tmp/cvs-serv18855/src/org/htmlparser/util Modified Files: IteratorImpl.java ParserUtils.java PeekingIterator.java Added Files: PeekingIteratorImpl.java Log Message: Implement generic node filtering. Added the NodeFilter interface and the filter package. Sideline tag specific scanners; tags now use only one scanner of each type, TagScanner or CompositeTagScanner (except for ScriptScanner). Obviated PeekingIterator by moving the META tag semantics to doSemanticAction, much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated. --- NEW FILE: PeekingIteratorImpl.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util/PeekingIteratorImpl.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:57 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.util; import java.util.Vector; import org.htmlparser.Node; import org.htmlparser.lexer.Lexer; /** * @deprecated shouldn't need to pre-read tags. */ public class PeekingIteratorImpl implements PeekingIterator { Lexer mLexer; Vector preRead; ParserFeedback feedback; public PeekingIteratorImpl (Lexer lexer, ParserFeedback fb) { mLexer = lexer; preRead = new Vector (25); feedback = fb; } public Node peek () throws ParserException { Node ret; if (null == mLexer) ret = null; else try { ret = mLexer.nextNode (); if (null != ret) { // kick off recursion for the top level node if (ret instanceof org.htmlparser.tags.Tag) { org.htmlparser.tags.Tag tag; String name; org.htmlparser.scanners.TagScanner scanner; tag = (org.htmlparser.tags.Tag)ret; if (!tag.isEndTag ()) { // now recurse if there is a scanner for this type of tag scanner = tag.getThisScanner (); if ((null != scanner) && scanner.evaluate (tag, null)) ret = scanner.scan (tag, mLexer.getPage ().getUrl (), mLexer); } } preRead.addElement (ret); } } catch (Exception e) { StringBuffer msgBuffer = new StringBuffer(); msgBuffer.append("Unexpected Exception occurred while reading "); msgBuffer.append(mLexer.getPage ().getUrl ()); msgBuffer.append(", in nextHTMLNode"); // reader.appendLineDetails(msgBuffer); ParserException ex = new ParserException(msgBuffer.toString(),e); feedback.error(msgBuffer.toString(),ex); throw ex; } return (ret); } /** * Makes <code>node</code> the next <code>Node</code> that will be returned. * @param node The node to return next. */ public void push (Node node) { preRead.insertElementAt (node, 0); } /** * Check if more nodes are available. * @return <code>true</code> if a call to <code>nextNode()</code> will succeed. */ public boolean hasMoreNodes() throws ParserException { boolean ret; if (null == mLexer) ret = false; else if (0 != preRead.size ()) ret = true; else ret = !(null == peek ()); return (ret); } /** * Get the next node. * @return The next node in the HTML stream, or null if there are no more nodes. */ public Node nextNode() throws ParserException { Node ret; if (hasMoreNodes ()) ret = (Node)preRead.remove (0); else // should perhaps throw an exception? ret = null; return (ret); } } Index: IteratorImpl.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util/IteratorImpl.java,v retrieving revision 1.34 retrieving revision 1.35 diff -C2 -d -r1.34 -r1.35 *** IteratorImpl.java 6 Nov 2003 03:00:40 -0000 1.34 --- IteratorImpl.java 8 Nov 2003 21:30:57 -0000 1.35 *************** *** 29,118 **** package org.htmlparser.util; - import java.util.Vector; - import org.htmlparser.Node; import org.htmlparser.lexer.Lexer; ! public class IteratorImpl implements PeekingIterator { Lexer mLexer; ! Vector preRead; ! ParserFeedback feedback; public IteratorImpl (Lexer lexer, ParserFeedback fb) { mLexer = lexer; ! preRead = new Vector (25); ! feedback = fb; ! } ! ! public Node peek () throws ParserException ! { ! Node ret; ! ! if (null == mLexer) ! ret = null; ! else ! try ! { ! ret = mLexer.nextNode (); ! if (null != ret) ! { ! // kick off recursion for the top level node ! if (ret instanceof org.htmlparser.tags.Tag) ! { ! org.htmlparser.tags.Tag tag; ! String name; ! org.htmlparser.scanners.TagScanner scanner; ! ! tag = (org.htmlparser.tags.Tag)ret; ! if (!tag.isEndTag ()) ! { ! // now recurse if there is a scanner for this type of tag ! scanner = tag.getThisScanner (); ! if ((null != scanner) && scanner.evaluate (tag, null)) ! ret = scanner.scan (tag, mLexer.getPage ().getUrl (), mLexer); ! } ! } ! ! preRead.addElement (ret); ! } ! } ! catch (Exception e) { ! StringBuffer msgBuffer = new StringBuffer(); ! msgBuffer.append("Unexpected Exception occurred while reading "); ! msgBuffer.append(mLexer.getPage ().getUrl ()); ! msgBuffer.append(", in nextHTMLNode"); ! // reader.appendLineDetails(msgBuffer); ! ParserException ex = new ParserException(msgBuffer.toString(),e); ! feedback.error(msgBuffer.toString(),ex); ! throw ex; ! } ! ! return (ret); ! } ! ! /** ! * Makes <code>node</code> the next <code>Node</code> that will be returned. ! * @param node The node to return next. ! */ ! public void push (Node node) ! { ! preRead.insertElementAt (node, 0); } /** * Check if more nodes are available. ! * @return <code>true</code> if a call to <code>nextHTMLNode()</code> will succeed. */ ! public boolean hasMoreNodes() throws ParserException { boolean ret; ! if (null == mLexer) ! ret = false; ! else if (0 != preRead.size ()) ! ret = true; ! else ! ret = !(null == peek ()); return (ret); --- 29,62 ---- package org.htmlparser.util; import org.htmlparser.Node; + import org.htmlparser.lexer.Cursor; import org.htmlparser.lexer.Lexer; + import org.htmlparser.scanners.TagScanner; + import org.htmlparser.tags.Tag; + import org.htmlparser.util.NodeIterator; ! public class IteratorImpl implements NodeIterator { Lexer mLexer; ! ParserFeedback mFeedback; ! Cursor mCursor; public IteratorImpl (Lexer lexer, ParserFeedback fb) { mLexer = lexer; ! mFeedback = fb; ! mCursor = new Cursor (mLexer.getPage (), 0); } /** * Check if more nodes are available. ! * @return <code>true</code> if a call to <code>nextNode()</code> will succeed. */ ! public boolean hasMoreNodes() throws ParserException ! { boolean ret; ! mCursor.setPosition (mLexer.getPosition ()); ! ret = 0 != mLexer.getPage ().getCharacter (mCursor); // more characters? return (ret); *************** *** 123,135 **** * @return The next node in the HTML stream, or null if there are no more nodes. */ ! public Node nextNode() throws ParserException { Node ret; ! if (hasMoreNodes ()) ! ret = (Node)preRead.remove (0); ! else ! // should perhaps throw an exception? ! ret = null; return (ret); } --- 67,109 ---- * @return The next node in the HTML stream, or null if there are no more nodes. */ ! public Node nextNode() throws ParserException ! { Node ret; ! try ! { ! ret = mLexer.nextNode (); ! if (null != ret) ! { ! // kick off recursion for the top level node ! if (ret instanceof Tag) ! { ! Tag tag; ! String name; ! TagScanner scanner; + tag = (Tag)ret; + if (!tag.isEndTag ()) + { + // now recurse if there is a scanner for this type of tag + scanner = tag.getThisScanner (); + if ((null != scanner) && scanner.evaluate (tag, null)) + ret = scanner.scan (tag, mLexer.getPage ().getUrl (), mLexer); + } + } + } + } + catch (Exception e) + { + StringBuffer msgBuffer = new StringBuffer(); + msgBuffer.append("Unexpected Exception occurred while reading "); + msgBuffer.append(mLexer.getPage ().getUrl ()); + msgBuffer.append(", in nextHTMLNode"); + // reader.appendLineDetails(msgBuffer); + ParserException ex = new ParserException(msgBuffer.toString(),e); + mFeedback.error(msgBuffer.toString(),ex); + throw ex; + } + return (ret); } Index: ParserUtils.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util/ParserUtils.java,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** ParserUtils.java 26 Oct 2003 19:46:28 -0000 1.33 --- ParserUtils.java 8 Nov 2003 21:30:57 -0000 1.34 *************** *** 34,38 **** --- 34,40 ---- import org.htmlparser.Node; + import org.htmlparser.NodeFilter; import org.htmlparser.Parser; + import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.tags.Tag; *************** *** 105,119 **** /** ! * Search given node and pick up any objects of given type, return ! * Node array. ! * @param node ! * @param type ! * @return Node[] */ ! public static Node[] findTypeInNode(Node node, Class type) { ! NodeList nodeList = new NodeList(); ! node.collectInto(nodeList, type); ! Node spans[] = nodeList.toNodeArray(); ! return spans; } --- 107,125 ---- /** ! * Search given node and pick up any objects of given type. ! * @param node The node to search. ! * @param type The class to search for. ! * @return A node array with the matching nodes. */ ! public static Node[] findTypeInNode(Node node, Class type) ! { ! NodeFilter filter; ! NodeList ret; ! ! ret = new NodeList (); ! filter = new NodeClassFilter (type); ! node.collectInto (ret, filter); ! ! return (ret.toNodeArray ()); } Index: PeekingIterator.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util/PeekingIterator.java,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** PeekingIterator.java 26 Oct 2003 19:46:28 -0000 1.17 --- PeekingIterator.java 8 Nov 2003 21:30:57 -0000 1.18 *************** *** 31,34 **** --- 31,37 ---- import org.htmlparser.Node; + /** + * @deprecated shouldn't need to pre-read tags. + */ public interface PeekingIterator extends NodeIterator{ /** |
From: <der...@us...> - 2003-11-08 21:31:00
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/utilTests In directory sc8-pr-cvs1:/tmp/cvs-serv18855/src/org/htmlparser/tests/utilTests Modified Files: NodeListTest.java Log Message: Implement generic node filtering. Added the NodeFilter interface and the filter package. Sideline tag specific scanners; tags now use only one scanner of each type, TagScanner or CompositeTagScanner (except for ScriptScanner). Obviated PeekingIterator by moving the META tag semantics to doSemanticAction, much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated. Index: NodeListTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/utilTests/NodeListTest.java,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** NodeListTest.java 26 Oct 2003 19:46:27 -0000 1.21 --- NodeListTest.java 8 Nov 2003 21:30:57 -0000 1.22 *************** *** 126,132 **** } - public void collectInto(NodeList collectionList, String filter) { - } - public String toHtml() { return null; --- 126,129 ---- |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests In directory sc8-pr-cvs1:/tmp/cvs-serv18855/src/org/htmlparser/tests/scannersTests Modified Files: BodyScannerTest.java FormScannerTest.java HtmlTest.java InputTagScannerTest.java LabelScannerTest.java LinkScannerTest.java MetaTagScannerTest.java OptionTagScannerTest.java SelectTagScannerTest.java TextareaTagScannerTest.java TitleScannerTest.java Log Message: Implement generic node filtering. Added the NodeFilter interface and the filter package. Sideline tag specific scanners; tags now use only one scanner of each type, TagScanner or CompositeTagScanner (except for ScriptScanner). Obviated PeekingIterator by moving the META tag semantics to doSemanticAction, much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated. Index: BodyScannerTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/BodyScannerTest.java,v retrieving revision 1.16 retrieving revision 1.17 diff -C2 -d -r1.16 -r1.17 *** BodyScannerTest.java 26 Oct 2003 19:46:26 -0000 1.16 --- BodyScannerTest.java 8 Nov 2003 21:30:57 -0000 1.17 *************** *** 59,63 **** assertEquals("Body","This is a body tag",bodyTag.getBody()); assertEquals("Body","<body>This is a body tag</body>",bodyTag.toHtml()); - assertEquals("Body Scanner",bodyScanner,bodyTag.getThisScanner()); } --- 59,62 ---- *************** *** 73,77 **** BodyTag bodyTag = (BodyTag) node[4]; assertStringEquals("Body",body,bodyTag.toHtml()); - assertEquals("Body Scanner",bodyScanner,bodyTag.getThisScanner()); } --- 72,75 ---- *************** *** 87,91 **** BodyTag bodyTag = (BodyTag) node[4]; assertEquals("Body",body,bodyTag.toHtml()); - assertEquals("Body Scanner",bodyScanner,bodyTag.getThisScanner()); } --- 85,88 ---- *************** *** 101,105 **** BodyTag bodyTag = (BodyTag) node[1]; assertEquals("Body",body + "</body>",bodyTag.toHtml()); - assertEquals("Body Scanner",bodyScanner,bodyTag.getThisScanner()); } --- 98,101 ---- Index: FormScannerTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/FormScannerTest.java,v retrieving revision 1.40 retrieving revision 1.41 diff -C2 -d -r1.40 -r1.41 *** FormScannerTest.java 1 Nov 2003 21:55:43 -0000 1.40 --- FormScannerTest.java 8 Nov 2003 21:30:57 -0000 1.41 *************** *** 309,313 **** for (NodeIterator e = parser.elements(); e.hasMoreNodes();) nodes[i++] = e.nextNode(); ! assertEquals ("Expected nodes", 40, i); } } --- 309,313 ---- for (NodeIterator e = parser.elements(); e.hasMoreNodes();) nodes[i++] = e.nextNode(); ! assertEquals ("Expected nodes", 39, i); } } Index: HtmlTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/HtmlTest.java,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** HtmlTest.java 26 Oct 2003 19:46:26 -0000 1.13 --- HtmlTest.java 8 Nov 2003 21:30:57 -0000 1.14 *************** *** 30,33 **** --- 30,34 ---- import org.htmlparser.Node; + import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.scanners.HtmlScanner; import org.htmlparser.scanners.TitleScanner; *************** *** 64,68 **** Html html = (Html)node[0]; NodeList nodeList = new NodeList(); ! html.collectInto(nodeList, TitleTag.class); assertEquals("nodelist size",1,nodeList.size()); Node node = nodeList.elementAt(0); --- 65,70 ---- Html html = (Html)node[0]; NodeList nodeList = new NodeList(); ! NodeClassFilter filter = new NodeClassFilter (TitleTag.class); ! html.collectInto(nodeList, filter); assertEquals("nodelist size",1,nodeList.size()); Node node = nodeList.elementAt(0); Index: InputTagScannerTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/InputTagScannerTest.java,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** InputTagScannerTest.java 26 Oct 2003 19:46:26 -0000 1.29 --- InputTagScannerTest.java 8 Nov 2003 21:30:57 -0000 1.30 *************** *** 59,63 **** // check the input node InputTag inputTag = (InputTag) node[0]; - assertEquals("Input Scanner",scanner,inputTag.getThisScanner()); assertEquals("Type","text",inputTag.getAttribute("TYPE")); assertEquals("Name","Google",inputTag.getAttribute("NAME")); --- 59,62 ---- Index: LabelScannerTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/LabelScannerTest.java,v retrieving revision 1.41 retrieving revision 1.42 diff -C2 -d -r1.41 -r1.42 *** LabelScannerTest.java 1 Nov 2003 04:03:21 -0000 1.41 --- LabelScannerTest.java 8 Nov 2003 21:30:57 -0000 1.42 *************** *** 62,66 **** assertEquals("Label","This is a label tag",labelTag.getLabel()); assertStringEquals("Label", html, labelTag.toHtml()); - assertEquals("Label Scanner",labelScanner,labelTag.getThisScanner()); } --- 62,65 ---- *************** *** 76,80 **** LabelTag labelTag = (LabelTag) node[0]; assertStringEquals("Label",label,labelTag.toHtml()); - assertEquals("Label Scanner",labelScanner,labelTag.getThisScanner()); } --- 75,78 ---- *************** *** 92,96 **** assertEquals("Label value","Span within label",labelTag.getLabel()); assertStringEquals("Label", html, labelTag.toHtml()); - assertEquals("Label Scanner",labelScanner,labelTag.getThisScanner()); } --- 90,93 ---- *************** *** 108,112 **** assertEquals("Label value","Jane Doe Smith",labelTag.getLabel()); assertStringEquals("Label",html,labelTag.toHtml()); - assertEquals("Label Scanner",labelScanner,labelTag.getThisScanner()); } --- 105,108 ---- Index: LinkScannerTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/LinkScannerTest.java,v retrieving revision 1.47 retrieving revision 1.48 diff -C2 -d -r1.47 -r1.48 *** LinkScannerTest.java 29 Oct 2003 03:31:18 -0000 1.47 --- LinkScannerTest.java 8 Nov 2003 21:30:57 -0000 1.48 *************** *** 299,319 **** * tag - <A>Revision<\a> * Reported by Mazlan Mat */ ! public void testFreshMeatBug() throws ParserException { ! createParser("<a>Revision</a>","http://www.yahoo.com"); ! // Register the image scanner parser.addScanner(new LinkScanner("-l")); ! parseAndAssertNodeCount(3); assertTrue("Node 0 should be a tag",node[0] instanceof Tag); Tag tag = (Tag)node[0]; ! assertEquals("Tag Contents","a",tag.getText()); ! assertTrue("Node 1 should be a string node",node[1] instanceof StringNode); ! StringNode stringNode = (StringNode)node[1]; assertEquals("StringNode Contents","Revision",stringNode.getText()); - assertTrue("Node 2 should be an end tag",node[2] instanceof Tag); - tag = (Tag)node[2]; - assertTrue("Node 2 should be an end tag",tag.isEndTag ()); - assertEquals("End Tag Contents","/a",tag.getText()); } --- 299,319 ---- * tag - <A>Revision<\a> * Reported by Mazlan Mat + * Note: Actually, this is completely legal HTML - Derrick */ ! public void testFreshMeatBug() throws ParserException ! { ! String html = "<a>Revision</a>"; ! createParser(html,"http://www.yahoo.com"); ! // Register the link scanner parser.addScanner(new LinkScanner("-l")); ! parseAndAssertNodeCount(1); assertTrue("Node 0 should be a tag",node[0] instanceof Tag); Tag tag = (Tag)node[0]; ! assertEquals("Tag Contents",html,tag.toHtml()); ! assertEquals("Node 0 should have one child", 1, tag.getChildren ().size ()); ! assertTrue("The child should be a string node", tag.getChildren ().elementAt (0) instanceof StringNode); ! StringNode stringNode = (StringNode)tag.getChildren ().elementAt (0); assertEquals("StringNode Contents","Revision",stringNode.getText()); } Index: MetaTagScannerTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/MetaTagScannerTest.java,v retrieving revision 1.34 retrieving revision 1.35 diff -C2 -d -r1.34 -r1.35 *** MetaTagScannerTest.java 26 Oct 2003 19:46:26 -0000 1.34 --- MetaTagScannerTest.java 8 Nov 2003 21:30:57 -0000 1.35 *************** *** 95,100 **** assertEquals("Meta Tag 18 Contents","text/html; charset=ISO-8859-1",metaTag.getMetaContent()); assertEquals("Meta Tag 18 Http-Equiv","content-type",metaTag.getHttpEquiv()); - - assertEquals("This Scanner",scanner,metaTag.getThisScanner()); } --- 95,98 ---- Index: OptionTagScannerTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/OptionTagScannerTest.java,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** OptionTagScannerTest.java 28 Oct 2003 03:04:19 -0000 1.33 --- OptionTagScannerTest.java 8 Nov 2003 21:30:57 -0000 1.34 *************** *** 74,79 **** continue; assertTrue("Node " + j + " should be Option Tag",node[j] instanceof OptionTag); - OptionTag OptionTag = (OptionTag) node[j]; - assertEquals("Option Scanner",scanner,OptionTag.getThisScanner()); } } --- 74,77 ---- Index: SelectTagScannerTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/SelectTagScannerTest.java,v retrieving revision 1.32 retrieving revision 1.33 diff -C2 -d -r1.32 -r1.33 *** SelectTagScannerTest.java 28 Oct 2003 03:04:19 -0000 1.32 --- SelectTagScannerTest.java 8 Nov 2003 21:30:57 -0000 1.33 *************** *** 75,90 **** parseAndAssertNodeCount(5); - assertTrue(node[0] instanceof SelectTag); - assertTrue(node[1] instanceof SelectTag); - assertTrue(node[2] instanceof SelectTag); - assertTrue(node[3] instanceof SelectTag); - assertTrue(node[4] instanceof SelectTag); // check the Select node for(int j=0;j<nodeCount;j++) ! { ! SelectTag SelectTag = (SelectTag) node[j]; ! assertEquals("Select Scanner",scanner,SelectTag.getThisScanner()); ! } SelectTag selectTag = (SelectTag)node[0]; --- 75,82 ---- parseAndAssertNodeCount(5); // check the Select node for(int j=0;j<nodeCount;j++) ! assertTrue(node[j] instanceof SelectTag); SelectTag selectTag = (SelectTag)node[0]; Index: TextareaTagScannerTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/TextareaTagScannerTest.java,v retrieving revision 1.30 retrieving revision 1.31 diff -C2 -d -r1.30 -r1.31 *** TextareaTagScannerTest.java 28 Oct 2003 03:04:19 -0000 1.30 --- TextareaTagScannerTest.java 8 Nov 2003 21:30:57 -0000 1.31 *************** *** 64,79 **** parser.addScanner(scanner); parseAndAssertNodeCount(5); - assertTrue(node[0] instanceof TextareaTag); - assertTrue(node[1] instanceof TextareaTag); - assertTrue(node[2] instanceof TextareaTag); - assertTrue(node[3] instanceof TextareaTag); - assertTrue(node[4] instanceof TextareaTag); // check the Textarea node for(int j=0;j<nodeCount;j++) ! { ! TextareaTag TextareaTag = (TextareaTag) node[j]; ! assertEquals("Textarea Scanner",scanner,TextareaTag.getThisScanner()); ! } } } --- 64,71 ---- parser.addScanner(scanner); parseAndAssertNodeCount(5); // check the Textarea node for(int j=0;j<nodeCount;j++) ! assertTrue(node[j] instanceof TextareaTag); } } Index: TitleScannerTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/TitleScannerTest.java,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** TitleScannerTest.java 26 Oct 2003 19:46:27 -0000 1.33 --- TitleScannerTest.java 8 Nov 2003 21:30:57 -0000 1.34 *************** *** 63,67 **** TitleTag titleTag = (TitleTag) node[2]; assertEquals("Title","Yahoo!",titleTag.getTitle()); - assertEquals("Title Scanner",titleScanner,titleTag.getThisScanner()); } --- 63,66 ---- |
From: <der...@us...> - 2003-11-08 21:31:00
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests In directory sc8-pr-cvs1:/tmp/cvs-serv18855/src/org/htmlparser/tests/tagTests Modified Files: CompositeTagTest.java ImageTagTest.java ObjectCollectionTest.java Log Message: Implement generic node filtering. Added the NodeFilter interface and the filter package. Sideline tag specific scanners; tags now use only one scanner of each type, TagScanner or CompositeTagScanner (except for ScriptScanner). Obviated PeekingIterator by moving the META tag semantics to doSemanticAction, much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated. Index: CompositeTagTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/CompositeTagTest.java,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** CompositeTagTest.java 26 Oct 2003 19:46:27 -0000 1.11 --- CompositeTagTest.java 8 Nov 2003 21:30:57 -0000 1.12 *************** *** 100,104 **** CompositeTag parent = (CompositeTag)stringNode[0].getParent(); int pos = parent.findPositionOf(stringNode[0]); ! assertEquals("position",5,pos); } } --- 100,106 ---- CompositeTag parent = (CompositeTag)stringNode[0].getParent(); int pos = parent.findPositionOf(stringNode[0]); ! /* a(b(),string("sdsd"),/b(),string("Hello World")) */ ! /* 0 1 2 3 */ ! assertEquals("position",3,pos); } } Index: ImageTagTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/ImageTagTest.java,v retrieving revision 1.35 retrieving revision 1.36 diff -C2 -d -r1.35 -r1.36 *** ImageTagTest.java 29 Oct 2003 03:31:18 -0000 1.35 --- ImageTagTest.java 8 Nov 2003 21:30:57 -0000 1.36 *************** *** 37,40 **** --- 37,41 ---- import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; + import org.htmlparser.util.ParserUtils; import org.htmlparser.util.SimpleNodeIterator; *************** *** 168,182 **** public ImageTag extractLinkImage (LinkTag link) { ! NodeList subElements = new NodeList (); ! link.collectInto (subElements, ImageTag.class); ! SimpleNodeIterator subScan = subElements.elements (); ! while (subScan.hasMoreNodes ()) ! { ! Node subNode = subScan.nextNode (); ! if (subNode instanceof ImageTag) ! return (ImageTag) subNode; ! } ! ! return null; } --- 169,174 ---- public ImageTag extractLinkImage (LinkTag link) { ! Node[] list = ParserUtils.findTypeInNode (link, ImageTag.class); ! return (0 == list.length ? null : (ImageTag)list[0]); } Index: ObjectCollectionTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/ObjectCollectionTest.java,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** ObjectCollectionTest.java 26 Oct 2003 19:46:27 -0000 1.15 --- ObjectCollectionTest.java 8 Nov 2003 21:30:57 -0000 1.16 *************** *** 39,42 **** --- 39,43 ---- import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; + import org.htmlparser.util.ParserUtils; public class ObjectCollectionTest extends ParserTestCase { *************** *** 89,95 **** parseAndAssertNodeCount(1); Div div = (Div)node[0]; ! NodeList nodeList = new NodeList(); ! div.collectInto(nodeList,Span.class); ! Node[] spans = nodeList.toNodeArray(); assertSpanContent(spans); } --- 90,94 ---- parseAndAssertNodeCount(1); Div div = (Div)node[0]; ! Node[] spans = ParserUtils.findTypeInNode (div, Span.class); assertSpanContent(spans); } *************** *** 111,116 **** TableTag tableTag = (TableTag)node[0]; NodeList nodeList = new NodeList(); ! tableTag.collectInto(nodeList,Span.class); ! Node [] spans = nodeList.toNodeArray(); assertSpanContent(spans); } --- 110,114 ---- TableTag tableTag = (TableTag)node[0]; NodeList nodeList = new NodeList(); ! Node[] spans = ParserUtils.findTypeInNode (tableTag, Span.class); assertSpanContent(spans); } |
From: <der...@us...> - 2003-11-08 21:31:00
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests In directory sc8-pr-cvs1:/tmp/cvs-serv18855/src/org/htmlparser/tests Modified Files: AllTests.java ParserTest.java ParserTestCase.java Log Message: Implement generic node filtering. Added the NodeFilter interface and the filter package. Sideline tag specific scanners; tags now use only one scanner of each type, TagScanner or CompositeTagScanner (except for ScriptScanner). Obviated PeekingIterator by moving the META tag semantics to doSemanticAction, much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated. Index: AllTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/AllTests.java,v retrieving revision 1.55 retrieving revision 1.56 diff -C2 -d -r1.55 -r1.56 *** AllTests.java 26 Oct 2003 19:46:24 -0000 1.55 --- AllTests.java 8 Nov 2003 21:30:57 -0000 1.56 *************** *** 63,66 **** --- 63,67 ---- suite.addTest (org.htmlparser.tests.parserHelperTests.AllTests.suite ()); suite.addTest (org.htmlparser.tests.nodeDecoratorTests.AllTests.suite ()); + suite.addTestSuite (org.htmlparser.tests.filterTests.FilterTest.class); return (suite); Index: ParserTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/ParserTest.java,v retrieving revision 1.47 retrieving revision 1.48 diff -C2 -d -r1.47 -r1.48 *** ParserTest.java 26 Oct 2003 19:46:25 -0000 1.47 --- ParserTest.java 8 Nov 2003 21:30:57 -0000 1.48 *************** *** 41,48 **** --- 41,51 ---- import org.htmlparser.Parser; import org.htmlparser.StringNode; + import org.htmlparser.filters.NodeClassFilter; + import org.htmlparser.filters.TagNameFilter; import org.htmlparser.lexer.Lexer; import org.htmlparser.lexer.Page; import org.htmlparser.scanners.FormScanner; import org.htmlparser.scanners.TagScanner; + import org.htmlparser.tags.BodyTag; import org.htmlparser.tags.ImageTag; import org.htmlparser.tags.LinkTag; *************** *** 401,406 **** --- 404,414 ---- { parser = new Parser("http://www.sony.co.jp", Parser.noFeedback); + parser.registerScanners (); assertEquals("Character set by default is ISO-8859-1", "ISO-8859-1", parser.getEncoding ()); enumeration = parser.elements(); + // search for the <BODY> tag + while (enumeration.hasMoreNodes ()) + if (enumeration.nextNode () instanceof BodyTag) + break; assertTrue("Character set should be Shift_JIS", parser.getEncoding ().equalsIgnoreCase ("Shift_JIS")); } *************** *** 446,449 **** --- 454,458 ---- parser = new Parser(url); + parser.registerScanners (); for (NodeIterator e = parser.elements();e.hasMoreNodes();) e.nextNode(); *************** *** 466,469 **** --- 475,479 ---- parser = new Parser(url); + parser.registerScanners (); for (NodeIterator e = parser.elements();e.hasMoreNodes();) e.nextNode(); *************** *** 534,537 **** --- 544,548 ---- page.setConnection (connection); parser = new Parser (new Lexer (page)); + parser.registerScanners (); // must be the default assertTrue ("Wrong encoding", parser.getEncoding ().equals ("ISO-8859-1")); *************** *** 627,635 **** parser.registerScanners(); NodeList collectionList = new NodeList(); ! ! for (NodeIterator e = parser.elements();e.hasMoreNodes();) { ! Node node = e.nextNode(); ! node.collectInto(collectionList,LinkTag.class); ! } assertEquals("Size of collection vector should be 11",11,collectionList.size()); // All items in collection vector should be links --- 638,644 ---- parser.registerScanners(); NodeList collectionList = new NodeList(); ! NodeClassFilter filter = new NodeClassFilter (LinkTag.class); ! for (NodeIterator e = parser.elements();e.hasMoreNodes();) ! e.nextNode().collectInto(collectionList,filter); assertEquals("Size of collection vector should be 11",11,collectionList.size()); // All items in collection vector should be links *************** *** 683,691 **** parser.registerScanners(); NodeList collectionList = new NodeList(); ! ! for (NodeIterator e = parser.elements();e.hasMoreNodes();) { ! Node node = e.nextNode(); ! node.collectInto(collectionList,ImageTag.IMAGE_TAG_FILTER); ! } assertEquals("Size of collection vector should be 5",5,collectionList.size()); // All items in collection vector should be links --- 692,698 ---- parser.registerScanners(); NodeList collectionList = new NodeList(); ! TagNameFilter filter = new TagNameFilter ("IMG"); ! for (NodeIterator e = parser.elements();e.hasMoreNodes();) ! e.nextNode().collectInto(collectionList,filter); assertEquals("Size of collection vector should be 5",5,collectionList.size()); // All items in collection vector should be links Index: ParserTestCase.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/ParserTestCase.java,v retrieving revision 1.38 retrieving revision 1.39 diff -C2 -d -r1.38 -r1.39 *** ParserTestCase.java 1 Nov 2003 21:55:43 -0000 1.38 --- ParserTestCase.java 8 Nov 2003 21:30:57 -0000 1.39 *************** *** 48,52 **** import org.htmlparser.tags.Tag; import org.htmlparser.util.DefaultParserFeedback; - import org.htmlparser.util.IteratorImpl; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.ParserException; --- 48,51 ---- *************** *** 226,230 **** } ! public void assertXmlEquals(String displayMessage, String expected, String actual) throws Exception { expected = removeEscapeCharacters(expected); actual = removeEscapeCharacters(actual); --- 225,235 ---- } ! public void assertXmlEquals(String displayMessage, String expected, String actual) throws Exception ! { ! Node nextExpectedNode; ! Node nextActualNode; ! Tag tag1; ! Tag tag2; ! expected = removeEscapeCharacters(expected); actual = removeEscapeCharacters(actual); *************** *** 237,245 **** displayMessage = createGenericFailureMessage(displayMessage, expected, actual); ! Node nextExpectedNode = null, nextActualNode = null; do { ! nextExpectedNode = getNextNodeUsing(expectedIterator); ! nextActualNode = getNextNodeUsing(actualIterator); assertNotNull (nextActualNode); assertStringValueMatches( displayMessage, --- 242,261 ---- displayMessage = createGenericFailureMessage(displayMessage, expected, actual); ! nextExpectedNode = null; ! nextActualNode = null; ! tag1 = null; ! tag2 = null; do { ! if (null != tag1) ! nextExpectedNode = tag1; ! else ! nextExpectedNode = getNextNodeUsing (expectedIterator); ! if (null != tag2) ! nextActualNode = tag2; ! else ! nextActualNode = getNextNodeUsing (actualIterator); assertNotNull (nextActualNode); + tag1 = fixIfXmlEndTag (nextExpectedNode); + tag2 = fixIfXmlEndTag (nextActualNode); assertStringValueMatches( displayMessage, *************** *** 247,256 **** nextActualNode ); - fixIfXmlEndTag(actualIterator, nextActualNode); - fixIfXmlEndTag(expectedIterator, nextExpectedNode); assertSameType(displayMessage, nextExpectedNode, nextActualNode); assertTagEquals(displayMessage, nextExpectedNode, nextActualNode); } ! while (expectedIterator.hasMoreNodes()); assertActualXmlHasNoMoreNodes(displayMessage, actualIterator); } --- 263,270 ---- nextActualNode ); assertSameType(displayMessage, nextExpectedNode, nextActualNode); assertTagEquals(displayMessage, nextExpectedNode, nextActualNode); } ! while (expectedIterator.hasMoreNodes() || (null != tag1)); assertActualXmlHasNoMoreNodes(displayMessage, actualIterator); } *************** *** 305,312 **** } ! // TODO: ! // Man, this is really screwed up. ! private void fixIfXmlEndTag (NodeIterator iterator, Node node) { if (node instanceof Tag) { --- 319,330 ---- } ! /** ! * Return a following tag if node is an empty XML tag. ! */ ! private Tag fixIfXmlEndTag (Node node) { + Tag ret; + + ret = null; if (node instanceof Tag) { *************** *** 315,323 **** { tag.setEmptyXmlTag (false); ! node = new Tag (tag.getPage (), tag.getStartPosition (), tag.getEndPosition (), tag.getAttributesEx ()); ! // cheat here and poink the new node into the iterator ! ((IteratorImpl)iterator).push (node); } } } --- 333,341 ---- { tag.setEmptyXmlTag (false); ! ret = new Tag (tag.getPage (), tag.getStartPosition (), tag.getEndPosition (), tag.getAttributesEx ()); } } + + return (ret); } *************** *** 392,402 **** } public void assertType( String message, Class expectedType, ! Object object) { String expectedTypeName = expectedType.getName(); String actualTypeName = object.getClass().getName(); ! if (!actualTypeName.equals(expectedTypeName)) { fail( message+" should have been of type\n"+ --- 410,438 ---- } + public void assertSuperType( + String message, + Class expectedType, + Object object) + { + String expectedTypeName = expectedType.getName(); + String actualTypeName = object.getClass().getName(); + if (!expectedType.isAssignableFrom (object.getClass ())) + fail( + message+" should have been of type\n"+ + expectedTypeName+ + " but was of type \n"+ + actualTypeName+"\n and is :"+((Node)object).toHtml() + ); + } + public void assertType( String message, Class expectedType, ! Object object) ! { ! String expectedTypeName = expectedType.getName(); String actualTypeName = object.getClass().getName(); ! if (!actualTypeName.equals(expectedTypeName)) fail( message+" should have been of type\n"+ *************** *** 405,409 **** actualTypeName+"\n and is :"+((Node)object).toHtml() ); - } } --- 441,444 ---- |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags In directory sc8-pr-cvs1:/tmp/cvs-serv18855/src/org/htmlparser/tags Modified Files: BaseHrefTag.java CompositeTag.java LinkTag.java MetaTag.java ScriptTag.java Tag.java Log Message: Implement generic node filtering. Added the NodeFilter interface and the filter package. Sideline tag specific scanners; tags now use only one scanner of each type, TagScanner or CompositeTagScanner (except for ScriptScanner). Obviated PeekingIterator by moving the META tag semantics to doSemanticAction, much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated. Index: BaseHrefTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/BaseHrefTag.java,v retrieving revision 1.31 retrieving revision 1.32 diff -C2 -d -r1.31 -r1.32 *** BaseHrefTag.java 6 Nov 2003 03:00:27 -0000 1.31 --- BaseHrefTag.java 8 Nov 2003 21:30:56 -0000 1.32 *************** *** 32,35 **** --- 32,36 ---- import org.htmlparser.lexer.Page; import org.htmlparser.util.LinkProcessor; + import org.htmlparser.util.ParserException; /** *************** *** 92,96 **** * This sets the base URL to use for the rest of the page. */ ! public void doSemanticAction () { Page page; --- 93,97 ---- * This sets the base URL to use for the rest of the page. */ ! public void doSemanticAction () throws ParserException { Page page; Index: CompositeTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/CompositeTag.java,v retrieving revision 1.64 retrieving revision 1.65 diff -C2 -d -r1.64 -r1.65 *** CompositeTag.java 6 Nov 2003 03:00:28 -0000 1.64 --- CompositeTag.java 8 Nov 2003 21:30:57 -0000 1.65 *************** *** 30,33 **** --- 30,34 ---- import org.htmlparser.Node; + import org.htmlparser.NodeFilter; import org.htmlparser.StringNode; import org.htmlparser.AbstractNode; *************** *** 289,314 **** } ! public void collectInto (NodeList collectionList, String filter) { ! Node node; ! ! super.collectInto (collectionList, filter); for (SimpleNodeIterator e = children(); e.hasMoreNodes ();) ! { ! node = e.nextNode (); ! node.collectInto (collectionList, filter); ! } ! } ! ! public void collectInto (NodeList collectionList, Class nodeType) ! { ! Node node; ! ! super.collectInto (collectionList,nodeType); ! for (SimpleNodeIterator e = children(); e.hasMoreNodes (); ) ! { ! node = e.nextNode (); ! node.collectInto (collectionList, nodeType); ! } } --- 290,333 ---- } ! /** ! * Collect this node and its child nodes (if-applicable) into the collectionList parameter, provided the node ! * satisfies the filtering criteria.<P> ! * ! * This mechanism allows powerful filtering code to be written very easily, ! * without bothering about collection of embedded tags separately. ! * e.g. when we try to get all the links on a page, it is not possible to ! * get it at the top-level, as many tags (like form tags), can contain ! * links embedded in them. We could get the links out by checking if the ! * current node is a {@link CompositeTag}, and going through its children. ! * So this method provides a convenient way to do this.<P> ! * ! * Using collectInto(), programs get a lot shorter. Now, the code to ! * extract all links from a page would look like: ! * <pre> ! * NodeList collectionList = new NodeList(); ! * NodeFilter filter = new TagNameFilter ("A"); ! * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) ! * e.nextNode().collectInto(collectionList, filter); ! * </pre> ! * Thus, collectionList will hold all the link nodes, irrespective of how ! * deep the links are embedded.<P> ! * ! * Another way to accomplish the same objective is: ! * <pre> ! * NodeList collectionList = new NodeList(); ! * NodeFilter filter = new TagClassFilter (LinkTag.class); ! * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) ! * e.nextNode().collectInto(collectionList, filter); ! * </pre> ! * This is slightly less specific because the LinkTag class may be ! * registered for more than one node name, e.g. <LINK> tags too. ! */ ! public void collectInto (NodeList list, NodeFilter filter) { ! super.collectInto (list, filter); for (SimpleNodeIterator e = children(); e.hasMoreNodes ();) ! e.nextNode ().collectInto (list, filter); ! if (null != getEndTag ()) ! getEndTag ().collectInto (list, filter); } Index: LinkTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/LinkTag.java,v retrieving revision 1.42 retrieving revision 1.43 diff -C2 -d -r1.42 -r1.43 *** LinkTag.java 6 Nov 2003 03:00:29 -0000 1.42 --- LinkTag.java 8 Nov 2003 21:30:57 -0000 1.43 *************** *** 51,60 **** * The set of tag names that indicate the end of this tag. */ ! private static final String[] mEnders = new String[] {"A", "TD", "TR", "FORM", "LI"}; /** * The set of end tag names that indicate the end of this tag. */ ! private static final String[] mEndTagEnders = new String[] {"TD", "TR", "FORM", "LI", "BODY", "HTML"}; /** --- 51,60 ---- * The set of tag names that indicate the end of this tag. */ ! private static final String[] mEnders = new String[] {"A", "P", "DIV", "TD", "TR", "FORM", "LI"}; /** * The set of end tag names that indicate the end of this tag. */ ! private static final String[] mEndTagEnders = new String[] {"P", "DIV", "TD", "TR", "FORM", "LI", "BODY", "HTML"}; /** *************** *** 92,107 **** * } * </pre> - * There is another mechanism available that allows for uniform extraction - * of images. You could do this to get all images from a web page : - * <pre> - * Node node; - * Vector imageCollectionVector = new Vector(); - * for (NodeIterator e = parser.elements();e.hasMoreNode();) { - * node = e.nextHTMLNode(); - * node.collectInto(imageCollectionVector,ImageTag.IMAGE_FILTER); - * } - * </pre> - * The link tag processes all its contents in collectInto(). - * @see #linkData() */ public LinkTag () --- 92,95 ---- Index: MetaTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/MetaTag.java,v retrieving revision 1.30 retrieving revision 1.31 diff -C2 -d -r1.30 -r1.31 *** MetaTag.java 6 Nov 2003 03:00:29 -0000 1.30 --- MetaTag.java 8 Nov 2003 21:30:57 -0000 1.31 *************** *** 30,33 **** --- 30,34 ---- import org.htmlparser.lexer.nodes.Attribute; + import org.htmlparser.util.ParserException; /** *************** *** 100,103 **** --- 101,120 ---- else getAttributesEx ().add (new Attribute ("NAME", metaTagName)); + } + + /** + * Check for a charset directive, and if found, set the charset for the page. + */ + public void doSemanticAction () throws ParserException + { + String httpEquiv; + String charset; + + httpEquiv = getHttpEquiv (); + if ("Content-Type".equalsIgnoreCase (httpEquiv)) + { + charset = getPage ().getCharset (getAttribute ("CONTENT")); + getPage ().setEncoding (charset); + } } Index: ScriptTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/ScriptTag.java,v retrieving revision 1.31 retrieving revision 1.32 diff -C2 -d -r1.31 -r1.32 *** ScriptTag.java 6 Nov 2003 03:00:31 -0000 1.31 --- ScriptTag.java 8 Nov 2003 21:30:57 -0000 1.32 *************** *** 29,32 **** --- 29,34 ---- package org.htmlparser.tags; + import org.htmlparser.scanners.ScriptScanner; + /** * A script tag. *************** *** 49,52 **** --- 51,55 ---- public ScriptTag () { + setThisScanner (new ScriptScanner ()); } Index: Tag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/Tag.java,v retrieving revision 1.56 retrieving revision 1.57 diff -C2 -d -r1.56 -r1.57 *** Tag.java 6 Nov 2003 03:00:35 -0000 1.56 --- Tag.java 8 Nov 2003 21:30:57 -0000 1.57 *************** *** 137,153 **** /** - * This method verifies that the current tag matches the provided - * filter. The match is based on the string object and not its contents, - * so ensure that you are using static final filter strings provided - * in the tag classes. - * @see org.htmlparser.Node#collectInto(NodeList, String) - */ - public void collectInto(NodeList collectionList, String filter) - { - if (null != getThisScanner () && getThisScanner ().getFilter () == filter) - collectionList.add (this); - } - - /** * Handle a visitor. * <em>NOTE: This currently defers to accept(NodeVisitor). If --- 137,140 ---- |
From: <der...@us...> - 2003-11-08 21:31:00
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests In directory sc8-pr-cvs1:/tmp/cvs-serv18855/src/org/htmlparser/tests/lexerTests Modified Files: TagTests.java Log Message: Implement generic node filtering. Added the NodeFilter interface and the filter package. Sideline tag specific scanners; tags now use only one scanner of each type, TagScanner or CompositeTagScanner (except for ScriptScanner). Obviated PeekingIterator by moving the META tag semantics to doSemanticAction, much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated. Index: TagTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests/TagTests.java,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** TagTests.java 1 Nov 2003 04:03:21 -0000 1.4 --- TagTests.java 8 Nov 2003 21:30:57 -0000 1.5 *************** *** 154,157 **** --- 154,158 ---- String html = "<meta name=\"foo\" content=\"foo<bar>\">"; createParser(html); + parser.registerScanners (); parseAndAssertNodeCount (1); assertType ("should be MetaTag", MetaTag.class, node[0]); *************** *** 168,171 **** --- 169,173 ---- String html = "<meta name=\"foo\" content=\"foo<bar\">"; createParser(html); + parser.registerScanners (); parseAndAssertNodeCount (1); assertType ("should be MetaTag", MetaTag.class, node[0]); *************** *** 182,185 **** --- 184,188 ---- String html = "<meta name=\"foo\" content=\"foobar>\">"; createParser(html); + parser.registerScanners (); parseAndAssertNodeCount (1); assertType ("should be MetaTag", MetaTag.class, node[0]); *************** *** 196,199 **** --- 199,203 ---- String html = "<meta name=\"foo\" content=\"foo\nbar>\">"; createParser(html); + parser.registerScanners (); parseAndAssertNodeCount (1); assertType ("should be MetaTag", MetaTag.class, node[0]); *************** *** 216,219 **** --- 220,224 ---- String html = "<meta name=\"foo\" content=\"<foo>\nbar\">"; createParser(html); + parser.registerScanners (); parseAndAssertNodeCount (1); assertType ("should be MetaTag", MetaTag.class, node[0]); *************** *** 236,239 **** --- 241,245 ---- String html = "<meta name=\"foo\" content=\"foo>\nbar\">"; createParser(html); + parser.registerScanners (); parseAndAssertNodeCount (1); assertType ("should be MetaTag", MetaTag.class, node[0]); *************** *** 256,259 **** --- 262,266 ---- String html = "<meta name=\"foo\" content=\"<foo\nbar\""; createParser(html); + parser.registerScanners (); parseAndAssertNodeCount (1); assertType ("should be MetaTag", MetaTag.class, node[0]); |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser In directory sc8-pr-cvs1:/tmp/cvs-serv18855/src/org/htmlparser Modified Files: AbstractNode.java Node.java Parser.java RemarkNode.java StringNode.java Added Files: NodeFilter.java Log Message: Implement generic node filtering. Added the NodeFilter interface and the filter package. Sideline tag specific scanners; tags now use only one scanner of each type, TagScanner or CompositeTagScanner (except for ScriptScanner). Obviated PeekingIterator by moving the META tag semantics to doSemanticAction, much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated. --- NEW FILE: NodeFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/NodeFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:56 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser; /** * Implement this interface to select particular nodes. */ public interface NodeFilter { /** * Predicate to determine whether or not to keep the given node. * The behaviour based on this outcome is determined by the context * in which it is called. It may lead to the node being added to a list * or printed out. See the calling routine for details. * @return <code>true</code> if the node is to be kept, <code>false</code> * if it is to be discarded. */ boolean accept (Node node); } Index: AbstractNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/AbstractNode.java,v retrieving revision 1.19 retrieving revision 1.20 diff -C2 -d -r1.19 -r1.20 *** AbstractNode.java 1 Nov 2003 21:55:42 -0000 1.19 --- AbstractNode.java 8 Nov 2003 21:30:56 -0000 1.20 *************** *** 30,36 **** import java.io.Serializable; - import org.htmlparser.lexer.Page; import org.htmlparser.util.NodeList; /** --- 30,37 ---- import java.io.Serializable; + import org.htmlparser.lexer.Page; import org.htmlparser.util.NodeList; + import org.htmlparser.util.ParserException; /** *************** *** 110,174 **** /** ! * Collect this node and its child nodes (if-applicable) into the collection parameter, provided the node ! * satisfies the filtering criteria. <P/> * ! * This mechanism allows powerful filtering code to be written very easily, without bothering about collection ! * of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it ! * at the top-level, as many tags (like form tags), can contain links embedded in them. We could get the links ! * out by checking if the current node is a form tag, and going through its contents. However, this ties us down ! * to specific tags, and is not a very clean approach. <P/> * ! * Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look ! * like : * <pre> * NodeList collectionList = new NodeList(); ! * Node node; ! * String filter = LinkTag.LINK_TAG_FILTER; ! * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { ! * node = e.nextNode(); ! * node.collectInto (collectionVector, filter); ! * } * </pre> * Thus, collectionList will hold all the link nodes, irrespective of how ! * deep the links are embedded. This of course implies that tags must ! * fulfill their responsibilities toward honouring certain filters. ! * ! * <B>Important:</B> In order to keep performance optimal, <B>do not create</B> you own filter strings, as ! * the internal matching occurs with the pre-existing filter string object (in the relevant class). i.e. do not ! * make calls like : ! * <I>collectInto(collectionList,"-l")</I>, instead, make calls only like : ! * <I>collectInto(collectionList,LinkTag.LINK_TAG_FILTER)</I>.<P/> ! * ! * To find out if your desired tag has filtering support, check the API of the tag. ! */ ! public abstract void collectInto(NodeList collectionList, String filter); ! ! /** ! * Collect this node and its child nodes (if-applicable) into the collection parameter, provided the node ! * satisfies the filtering criteria. <P/> ! * ! * This mechanism allows powerful filtering code to be written very easily, without bothering about collection ! * of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it ! * at the top-level, as many tags (like form tags), can contain links embedded in them. We could get the links ! * out by checking if the current node is a form tag, and going through its contents. However, this ties us down ! * to specific tags, and is not a very clean approach. <P/> * ! * Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look ! * like : * <pre> * NodeList collectionList = new NodeList(); ! * Node node; ! * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { ! * node = e.nextNode(); ! * node.collectInto (collectionVector, LinkTag.class); ! * } * </pre> ! * Thus, collectionList will hold all the link nodes, irrespective of how ! * deep the links are embedded. */ ! public void collectInto(NodeList collectionList, Class nodeType) { ! if (nodeType.getName().equals(this.getClass().getName())) ! collectionList.add(this); } --- 111,150 ---- /** ! * Collect this node and its child nodes (if-applicable) into the collectionList parameter, provided the node ! * satisfies the filtering criteria.<P> * ! * This mechanism allows powerful filtering code to be written very easily, ! * without bothering about collection of embedded tags separately. ! * e.g. when we try to get all the links on a page, it is not possible to ! * get it at the top-level, as many tags (like form tags), can contain ! * links embedded in them. We could get the links out by checking if the ! * current node is a {@link CompositeTag}, and going through its children. ! * So this method provides a convenient way to do this.<P> * ! * Using collectInto(), programs get a lot shorter. Now, the code to ! * extract all links from a page would look like: * <pre> * NodeList collectionList = new NodeList(); ! * NodeFilter filter = new TagNameFilter ("A"); ! * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) ! * e.nextNode().collectInto(collectionList, filter); * </pre> * Thus, collectionList will hold all the link nodes, irrespective of how ! * deep the links are embedded.<P> * ! * Another way to accomplish the same objective is: * <pre> * NodeList collectionList = new NodeList(); ! * NodeFilter filter = new TagClassFilter (LinkTag.class); ! * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) ! * e.nextNode().collectInto(collectionList, filter); * </pre> ! * This is slightly less specific because the LinkTag class may be ! * registered for more than one node name, e.g. <LINK> tags too. */ ! public void collectInto (NodeList list, NodeFilter filter) { ! if (filter.accept (this)) ! list.add (this); } *************** *** 312,316 **** * The default action is to do nothing. */ ! public void doSemanticAction () { } --- 288,292 ---- * The default action is to do nothing. */ ! public void doSemanticAction () throws ParserException { } Index: Node.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/Node.java,v retrieving revision 1.43 retrieving revision 1.44 diff -C2 -d -r1.43 -r1.44 *** Node.java 1 Nov 2003 21:55:42 -0000 1.43 --- Node.java 8 Nov 2003 21:30:56 -0000 1.44 *************** *** 32,37 **** import org.htmlparser.util.NodeList; ! public interface Node { /** * Returns a string representation of the node. This is an important method, it allows a simple string transformation --- 32,39 ---- import org.htmlparser.util.NodeList; + import org.htmlparser.util.ParserException; ! public interface Node ! { /** * Returns a string representation of the node. This is an important method, it allows a simple string transformation *************** *** 47,50 **** --- 49,53 ---- */ public abstract String toPlainTextString(); + /** * This method will make it easier when using html parser to reproduce html pages (with or without modifications) *************** *** 53,56 **** --- 56,60 ---- */ public abstract String toHtml(); + /** * Return the string representation of the node. *************** *** 60,124 **** */ public abstract String toString(); /** ! * Collect this node and its child nodes (if-applicable) into the collection parameter, provided the node ! * satisfies the filtering criteria. <P/> * ! * This mechanism allows powerful filtering code to be written very easily, without bothering about collection ! * of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it ! * at the top-level, as many tags (like form tags), can contain links embedded in them. We could get the links ! * out by checking if the current node is a form tag, and going through its contents. However, this ties us down ! * to specific tags, and is not a very clean approach. <P/> * ! * Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look ! * like : * <pre> * NodeList collectionList = new NodeList(); ! * Node node; ! * String filter = LinkTag.LINK_TAG_FILTER; ! * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { ! * node = e.nextNode(); ! * node.collectInto (collectionVector, filter); ! * } * </pre> * Thus, collectionList will hold all the link nodes, irrespective of how ! * deep the links are embedded. This of course implies that tags must ! * fulfill their responsibilities toward honouring certain filters. ! * ! * <B>Important:</B> In order to keep performance optimal, <B>do not create</B> you own filter strings, as ! * the internal matching occurs with the pre-existing filter string object (in the relevant class). i.e. do not ! * make calls like : ! * <I>collectInto(collectionList,"-l")</I>, instead, make calls only like : ! * <I>collectInto(collectionList,LinkTag.LINK_TAG_FILTER)</I>.<P/> ! * ! * To find out if your desired tag has filtering support, check the API of the tag. ! */ ! public abstract void collectInto(NodeList collectionList, String filter); ! /** ! * Collect this node and its child nodes (if-applicable) into the collection parameter, provided the node ! * satisfies the filtering criteria. <P/> ! * ! * This mechanism allows powerful filtering code to be written very easily, without bothering about collection ! * of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it ! * at the top-level, as many tags (like form tags), can contain links embedded in them. We could get the links ! * out by checking if the current node is a form tag, and going through its contents. However, this ties us down ! * to specific tags, and is not a very clean approach. <P/> * ! * Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look ! * like : * <pre> * NodeList collectionList = new NodeList(); ! * Node node; ! * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { ! * node = e.nextNode(); ! * node.collectInto (collectionVector, LinkTag.class); ! * } * </pre> ! * Thus, collectionList will hold all the link nodes, irrespective of how ! * deep the links are embedded. */ ! public abstract void collectInto(NodeList collectionList, Class nodeType); /** * Returns the beginning position of the tag. ! * <br>deprecated Use {@link #getEndPosition} */ public abstract int elementBegin(); --- 64,106 ---- */ public abstract String toString(); + /** ! * Collect this node and its child nodes (if-applicable) into the collectionList parameter, provided the node ! * satisfies the filtering criteria.<P> * ! * This mechanism allows powerful filtering code to be written very easily, ! * without bothering about collection of embedded tags separately. ! * e.g. when we try to get all the links on a page, it is not possible to ! * get it at the top-level, as many tags (like form tags), can contain ! * links embedded in them. We could get the links out by checking if the ! * current node is a {@link CompositeTag}, and going through its children. ! * So this method provides a convenient way to do this.<P> * ! * Using collectInto(), programs get a lot shorter. Now, the code to ! * extract all links from a page would look like: * <pre> * NodeList collectionList = new NodeList(); ! * NodeFilter filter = new TagNameFilter ("A"); ! * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) ! * e.nextNode().collectInto(collectionList, filter); * </pre> * Thus, collectionList will hold all the link nodes, irrespective of how ! * deep the links are embedded.<P> * ! * Another way to accomplish the same objective is: * <pre> * NodeList collectionList = new NodeList(); ! * NodeFilter filter = new TagClassFilter (LinkTag.class); ! * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) ! * e.nextNode().collectInto(collectionList, filter); * </pre> ! * This is slightly less specific because the LinkTag class may be ! * registered for more than one node name, e.g. <LINK> tags too. */ ! public abstract void collectInto(NodeList collectionList, NodeFilter filter); ! /** * Returns the beginning position of the tag. ! * <br>deprecated Use {@link #getStartPosition} */ public abstract int elementBegin(); *************** *** 154,157 **** --- 136,142 ---- public abstract void setEndPosition (int position); + /** + * Apply the visitor object (of type NodeVisitor) to this node. + */ public abstract void accept(Object visitor); *************** *** 184,188 **** /** ! * Returns the text of the string line */ public String getText(); --- 169,173 ---- /** ! * Returns the text of the node. */ public String getText(); *************** *** 193,197 **** */ public void setText(String text); ! /** * Perform the meaning of this tag. --- 178,182 ---- */ public void setText(String text); ! /** * Perform the meaning of this tag. *************** *** 201,206 **** * with the character set to use (<META>), the base URL to use * (<BASE>). Other than that, the semantic meaning is up to the ! * application. */ ! public void doSemanticAction (); } --- 186,191 ---- * with the character set to use (<META>), the base URL to use * (<BASE>). Other than that, the semantic meaning is up to the ! * application and it's custom nodes. */ ! public void doSemanticAction () throws ParserException; } Index: Parser.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/Parser.java,v retrieving revision 1.73 retrieving revision 1.74 diff -C2 -d -r1.73 -r1.74 *** Parser.java 1 Nov 2003 21:55:42 -0000 1.73 --- Parser.java 8 Nov 2003 21:30:56 -0000 1.74 *************** *** 35,39 **** import java.net.URL; import java.net.URLConnection; - import java.util.HashMap; import java.util.Hashtable; import java.util.Iterator; --- 35,38 ---- *************** *** 41,50 **** import java.util.Vector; ! import org.htmlparser.Node; import org.htmlparser.lexer.Lexer; import org.htmlparser.lexer.Page; import org.htmlparser.lexer.nodes.Attribute; import org.htmlparser.lexer.nodes.NodeFactory; - import org.htmlparser.lexer.nodes.TagNode; import org.htmlparser.nodeDecorators.DecodingNode; import org.htmlparser.nodeDecorators.EscapeCharacterRemovingNode; --- 40,49 ---- import java.util.Vector; ! import org.htmlparser.filters.TagNameFilter; ! import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.lexer.Lexer; import org.htmlparser.lexer.Page; import org.htmlparser.lexer.nodes.Attribute; import org.htmlparser.lexer.nodes.NodeFactory; import org.htmlparser.nodeDecorators.DecodingNode; import org.htmlparser.nodeDecorators.EscapeCharacterRemovingNode; *************** *** 72,76 **** import org.htmlparser.tags.ImageTag; import org.htmlparser.tags.LinkTag; - import org.htmlparser.tags.MetaTag; import org.htmlparser.tags.Tag; import org.htmlparser.util.DefaultParserFeedback; --- 71,74 ---- *************** *** 573,577 **** { tag = ((CompositeTagScanner)scanner).createTag (null, 0, 0, null, null, null, null); - tag.setThisScanner (scanner); mBlastocyst.put (ids[i], tag); } --- 571,574 ---- *************** *** 579,583 **** { tag = scanner.createTag (null, 0, 0, null, null, null); - tag.setThisScanner (scanner); mBlastocyst.put (ids[i], tag); } --- 576,579 ---- *************** *** 612,677 **** * } * </pre> */ ! public NodeIterator elements() throws ParserException { ! boolean remove_scanner; ! Node node; ! TagNode tag; ! MetaTag meta; ! String httpEquiv; ! String charset; ! String original; ! IteratorImpl ret; ! ! ret = new IteratorImpl (getLexer (), feedback); ! original = getLexer ().getPage ().getEncoding (); ! remove_scanner = false; ! try ! { ! if (null == mScanners.get ("META")) ! { ! addScanner (new MetaTagScanner ("-m")); ! remove_scanner = true; ! } ! ! /* pre-read up to </HEAD> looking for charset directive */ ! while (null != (node = ret.peek ())) ! { ! if (node instanceof TagNode) ! { ! tag = (TagNode)node; ! if (tag instanceof MetaTag) ! { // check for charset on Content-Type ! meta = (MetaTag)node; ! httpEquiv = meta.getAttribute ("HTTP-EQUIV"); ! if ("Content-Type".equalsIgnoreCase (httpEquiv)) ! { ! charset = getLexer ().getPage ().getCharset (meta.getAttribute ("CONTENT")); ! if (!charset.equalsIgnoreCase (original)) ! { // oops, different character set, restart ! getLexer ().getPage ().setEncoding (charset); ! getLexer ().setPosition (0); ! ret = new IteratorImpl (getLexer (), feedback); ! } ! // once we see the Content-Type meta tag we're finished the pre-read ! break; ! } ! } ! else if (tag.isEndTag ()) ! { ! if (tag.getTagName ().equalsIgnoreCase ("HEAD")) ! // or, once we see the </HEAD> tag we're finished the pre-read ! break; ! } ! } ! } ! } ! finally ! { ! if (remove_scanner) ! mScanners.remove ("META"); ! } ! ! return ret; } --- 608,616 ---- * } * </pre> + * @param filter The filter to apply to the nodes. */ ! public NodeIterator elements () throws ParserException { ! return (new IteratorImpl (getLexer (), feedback)); } *************** *** 707,743 **** /** ! * Parse the given resource, using the filter provided */ ! public void parse(String filter) throws Exception { Node node; ! for (NodeIterator e=elements();e.hasMoreNodes();) { ! node = e.nextNode(); ! if (node!=null) { ! if (filter==null) ! System.out.println(node.toString()); ! else ! { ! // There is a filter. Find if the associated filter of this node ! // matches the specified filter ! if (!(node instanceof Tag)) ! continue; ! Tag tag=(Tag)node; ! TagScanner scanner = tag.getThisScanner(); ! if (scanner==null) ! continue; ! ! String tagFilter = scanner.getFilter(); ! if (tagFilter==null) ! continue; ! if (tagFilter.equals(filter)) ! System.out.println(node.toString()); ! } } ! else System.out.println("Node is null"); } - } --- 646,672 ---- /** ! * Parse the given resource, using the filter provided. ! * @param filter The filter to apply to the parsed nodes. */ ! public void parse (NodeFilter filter) throws ParserException { + NodeIterator e; Node node; ! NodeList list; ! ! list = new NodeList (); ! for (e = elements (); e.hasMoreNodes (); ) { ! node = e.nextNode (); ! if (null != filter) { ! node.collectInto (list, filter); ! for (int i = 0; i < list.size (); i++) ! System.out.println (list.elementAt (i)); ! list.removeAll (); } ! else ! System.out.println (node); } } *************** *** 928,966 **** { System.out.println(); ! System.out.println("Syntax : java -jar htmlparser.jar <resourceLocn/website> -l"); ! System.out.println(" <resourceLocn> the name of the file to be parsed (with complete path if not in current directory)"); ! System.out.println(" -l Show only the link tags extracted from the document"); ! System.out.println(" -i Show only the image tags extracted from the document"); ! System.out.println(" -s Show only the Javascript code extracted from the document"); ! System.out.println(" -t Show only the Style code extracted from the document"); ! System.out.println(" -a Show only the Applet tag extracted from the document"); ! System.out.println(" -j Parse JSP tags"); ! System.out.println(" -m Parse Meta tags"); ! System.out.println(" -T Extract the Title"); ! System.out.println(" -f Extract forms"); ! System.out.println(" -r Extract frameset"); ! System.out.println(" -help This screen"); ! System.out.println(); ! System.out.println("HTML Parser home page : http://htmlparser.sourceforge.net"); System.out.println(); System.out.println("Example : java -jar htmlparser.jar http://www.yahoo.com"); System.out.println(); ! System.out.println("If you have any doubts, please join the HTMLParser mailing list (user/developer) from the HTML Parser home page instead of mailing any of the contributors directly. You will be surprised with the quality of open source support. "); System.exit(-1); } ! try { ! Parser parser = new Parser(args[0]); ! System.out.println("Parsing " + parser.getURL ()); ! parser.registerScanners(); ! try { ! if (args.length==2) ! { ! parser.parse(args[1]); ! } else ! parser.parse(null); ! } ! catch (Exception e) { ! e.printStackTrace(); ! } } catch (ParserException e) { --- 857,885 ---- { System.out.println(); ! System.out.println("Syntax : java -jar htmlparser.jar <resourceLocn/website> [node_type]"); ! System.out.println(" <resourceLocn/website> the URL or file to be parsed"); ! System.out.println(" node_type an optional node name, for example:"); ! System.out.println(" A - Show only the link tags extracted from the document"); ! System.out.println(" IMG - Show only the image tags extracted from the document"); ! System.out.println(" TITLE - Extract the title from the document"); System.out.println(); System.out.println("Example : java -jar htmlparser.jar http://www.yahoo.com"); System.out.println(); ! System.out.println("For support, please join the HTMLParser mailing list (user/developer) from the HTML Parser home page..."); ! System.out.println("HTML Parser home page : http://htmlparser.sourceforge.net"); ! System.out.println(); System.exit(-1); } ! try ! { ! Parser parser = new Parser (args[0]); ! parser.registerScanners (); ! System.out.println ("Parsing " + parser.getURL ()); ! NodeFilter filter; ! if (1 < args.length) ! filter = new TagNameFilter (args[1]); ! else ! filter = null; ! parser.parse (filter); } catch (ParserException e) { *************** *** 993,1002 **** } ! public Node [] extractAllNodesThatAre(Class nodeType) throws ParserException { ! NodeList nodeList = new NodeList(); ! for (NodeIterator e = elements();e.hasMoreNodes();) { ! e.nextNode().collectInto(nodeList,nodeType); ! } ! return nodeList.toNodeArray(); } --- 912,942 ---- } ! /** ! * Extract all nodes matching the given filter. ! * @see Node#collectInto() ! */ ! public NodeList extractAllNodesThatMatch (NodeFilter filter) throws ParserException ! { ! NodeIterator e; ! NodeList ret; ! ! ret = new NodeList (); ! for (e = elements (); e.hasMoreNodes (); ) ! e.nextNode ().collectInto (ret, filter); ! ! return (ret); ! } ! ! /** ! * Convenience method to extract all nodes of a given class type. ! * @see Node#collectInto() ! */ ! public Node [] extractAllNodesThatAre (Class nodeType) throws ParserException ! { ! NodeList ret; ! ! ret = extractAllNodesThatMatch (new NodeClassFilter (nodeType)); ! ! return (ret.toNodeArray ()); } Index: RemarkNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/RemarkNode.java,v retrieving revision 1.35 retrieving revision 1.36 diff -C2 -d -r1.35 -r1.36 *** RemarkNode.java 1 Nov 2003 01:36:56 -0000 1.35 --- RemarkNode.java 8 Nov 2003 21:30:56 -0000 1.36 *************** *** 91,98 **** } - public void collectInto(NodeList collectionList, String filter) { - if (filter==REMARK_NODE_FILTER) collectionList.add(this); - } - /** * Remark visiting code. --- 91,94 ---- Index: StringNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/StringNode.java,v retrieving revision 1.43 retrieving revision 1.44 diff -C2 -d -r1.43 -r1.44 *** StringNode.java 1 Nov 2003 01:36:56 -0000 1.43 --- StringNode.java 8 Nov 2003 21:30:56 -0000 1.44 *************** *** 85,92 **** } - public void collectInto(NodeList collectionList, String filter) { - if (filter==STRING_FILTER) collectionList.add(this); - } - /** * String visiting code. --- 85,88 ---- |
From: <der...@us...> - 2003-11-08 21:30:59
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners In directory sc8-pr-cvs1:/tmp/cvs-serv18855/src/org/htmlparser/scanners Modified Files: LinkScanner.java Log Message: Implement generic node filtering. Added the NodeFilter interface and the filter package. Sideline tag specific scanners; tags now use only one scanner of each type, TagScanner or CompositeTagScanner (except for ScriptScanner). Obviated PeekingIterator by moving the META tag semantics to doSemanticAction, much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated. Index: LinkScanner.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/LinkScanner.java,v retrieving revision 1.60 retrieving revision 1.61 diff -C2 -d -r1.60 -r1.61 *** LinkScanner.java 31 Oct 2003 12:56:08 -0000 1.60 --- LinkScanner.java 8 Nov 2003 21:30:56 -0000 1.61 *************** *** 94,97 **** --- 94,98 ---- public boolean evaluate (Tag tag, TagScanner previousOpenScanner) { + // actually, this is a bogus test, A tags can just have a name or id and be a destination anchor return (null != tag.getAttributeEx ("HREF")); } |
From: <der...@us...> - 2003-11-08 21:30:59
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes In directory sc8-pr-cvs1:/tmp/cvs-serv18855/src/org/htmlparser/lexer/nodes Modified Files: RemarkNode.java StringNode.java TagNode.java Log Message: Implement generic node filtering. Added the NodeFilter interface and the filter package. Sideline tag specific scanners; tags now use only one scanner of each type, TagScanner or CompositeTagScanner (except for ScriptScanner). Obviated PeekingIterator by moving the META tag semantics to doSemanticAction, much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated. Index: RemarkNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/RemarkNode.java,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** RemarkNode.java 1 Nov 2003 01:36:56 -0000 1.11 --- RemarkNode.java 8 Nov 2003 21:30:56 -0000 1.12 *************** *** 94,101 **** } - public void collectInto(NodeList collectionList, String filter) { - if (filter==REMARK_NODE_FILTER) collectionList.add(this); - } - public void accept(Object visitor) { } --- 94,97 ---- Index: StringNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/StringNode.java,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** StringNode.java 1 Nov 2003 01:36:56 -0000 1.12 --- StringNode.java 8 Nov 2003 21:30:56 -0000 1.13 *************** *** 103,113 **** } - - public void collectInto (NodeList collectionList, String filter) - { - if (STRING_FILTER == filter) - collectionList.add (this); - } - public void accept (Object visitor) { --- 103,106 ---- Index: TagNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/TagNode.java,v retrieving revision 1.22 retrieving revision 1.23 diff -C2 -d -r1.22 -r1.23 *** TagNode.java 1 Nov 2003 21:55:42 -0000 1.22 --- TagNode.java 8 Nov 2003 21:30:56 -0000 1.23 *************** *** 661,675 **** /** - * This method verifies that the current tag matches the provided - * filter. The match is based on the string object and not its contents, - * so ensure that you are using static final filter strings provided - * in the tag classes. - * @see org.htmlparser.Node#collectInto(NodeList, String) - */ - public void collectInto (NodeList collectionList, String filter) - { - } - - /** * Returns table of attributes in the tag * @return Hashtable --- 661,664 ---- |
From: <der...@us...> - 2003-11-08 21:30:59
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/nodeDecorators In directory sc8-pr-cvs1:/tmp/cvs-serv18855/src/org/htmlparser/nodeDecorators Modified Files: AbstractNodeDecorator.java Log Message: Implement generic node filtering. Added the NodeFilter interface and the filter package. Sideline tag specific scanners; tags now use only one scanner of each type, TagScanner or CompositeTagScanner (except for ScriptScanner). Obviated PeekingIterator by moving the META tag semantics to doSemanticAction, much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated. Index: AbstractNodeDecorator.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/nodeDecorators/AbstractNodeDecorator.java,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** AbstractNodeDecorator.java 1 Nov 2003 21:55:42 -0000 1.14 --- AbstractNodeDecorator.java 8 Nov 2003 21:30:56 -0000 1.15 *************** *** 32,36 **** --- 32,38 ---- import org.htmlparser.Node; + import org.htmlparser.NodeFilter; import org.htmlparser.util.NodeList; + import org.htmlparser.util.ParserException; public abstract class AbstractNodeDecorator implements Node { *************** *** 45,54 **** } ! public void collectInto(NodeList collectionList, Class nodeType) { ! delegate.collectInto(collectionList, nodeType); ! } ! ! public void collectInto(NodeList collectionList, String filter) { ! delegate.collectInto(collectionList, filter); } --- 47,52 ---- } ! public void collectInto(NodeList list, NodeFilter filter) { ! delegate.collectInto(list, filter); } *************** *** 147,151 **** } ! public void doSemanticAction () { delegate.doSemanticAction (); } --- 145,149 ---- } ! public void doSemanticAction () throws ParserException { delegate.doSemanticAction (); } |
From: <der...@us...> - 2003-11-08 21:30:59
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer In directory sc8-pr-cvs1:/tmp/cvs-serv18855/src/org/htmlparser/lexer Modified Files: Page.java Log Message: Implement generic node filtering. Added the NodeFilter interface and the filter package. Sideline tag specific scanners; tags now use only one scanner of each type, TagScanner or CompositeTagScanner (except for ScriptScanner). Obviated PeekingIterator by moving the META tag semantics to doSemanticAction, much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated. Index: Page.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Page.java,v retrieving revision 1.24 retrieving revision 1.25 diff -C2 -d -r1.24 -r1.25 *** Page.java 4 Nov 2003 01:25:02 -0000 1.24 --- Page.java 8 Nov 2003 21:30:56 -0000 1.25 *************** *** 645,652 **** /** ! * Resets this page and begins reading from the source with the ! * given character set. * @param character_set The character set to use to convert bytes into * characters. */ public void setEncoding (String character_set) --- 645,667 ---- /** ! * Begins reading from the source with the given character set. ! * If the current encoding is the same as the requested encoding, ! * this method is a no-op. Otherwise any subsequent characters read from ! * this page will have been decoded using the given character set.<p> ! * Some magic happens here to obtain this result if characters have already ! * been consumed from this page. ! * Since a Reader cannot be dynamically altered to use a different character ! * set, the underlying stream is reset, a new Source is constructed ! * and a comparison made of the characters read so far with the newly ! * read characters up to the current position. ! * If a difference is encountered, or some other problem occurs, ! * an exception is thrown. * @param character_set The character set to use to convert bytes into * characters. + * @exception ParserException If a character mismatch occurs between + * characters already provided and those that would have been returned + * had the new character set been in effect from the beginning. An + * exception is also thrown if the underlying stream won't put up with + * these shenanigans. */ public void setEncoding (String character_set) *************** *** 655,672 **** { InputStream stream; ! ! stream = getSource ().getStream (); ! try { ! stream.reset (); ! if (!getEncoding ().equals (character_set)) { mSource = new Source (stream, character_set); ! mIndex = new PageIndex (this); } - } - catch (IOException ioe) - { - throw new ParserException (ioe.getMessage (), ioe); } } --- 670,705 ---- { InputStream stream; ! char[] buffer; ! int offset; ! char[] new_chars; ! ! if (!getEncoding ().equals (character_set)) { ! stream = getSource ().getStream (); ! try { + buffer = mSource.mBuffer; + offset = mSource.mOffset; + stream.reset (); mSource = new Source (stream, character_set); ! if (0 != offset) ! { ! new_chars = new char[offset]; ! if (offset != mSource.read (new_chars)) ! throw new ParserException ("reset stream failed"); ! for (int i = 0; i < offset; i++) ! if (new_chars[i] != buffer[i]) ! throw new ParserException ("character mismatch (new: " ! + new_chars[i] ! + " != old: " ! + buffer[i] ! + ") for encoding at offset " ! + offset); ! } ! } ! catch (IOException ioe) ! { ! throw new ParserException (ioe.getMessage (), ioe); } } } |
From: <der...@us...> - 2003-11-08 21:30:59
|
Update of /cvsroot/htmlparser/htmlparser In directory sc8-pr-cvs1:/tmp/cvs-serv18855 Modified Files: build.xml Log Message: Implement generic node filtering. Added the NodeFilter interface and the filter package. Sideline tag specific scanners; tags now use only one scanner of each type, TagScanner or CompositeTagScanner (except for ScriptScanner). Obviated PeekingIterator by moving the META tag semantics to doSemanticAction, much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated. Index: build.xml =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/build.xml,v retrieving revision 1.52 retrieving revision 1.53 diff -C2 -d -r1.52 -r1.53 *** build.xml 29 Oct 2003 03:31:17 -0000 1.52 --- build.xml 8 Nov 2003 21:30:55 -0000 1.53 *************** *** 262,265 **** --- 262,266 ---- <include name="org/htmlparser/AbstractNode.class"/> <include name="org/htmlparser/Node.class"/> + <include name="org/htmlparser/NodeFilter.class"/> <include name="org/htmlparser/util/ParserException.class"/> <include name="org/htmlparser/util/ChainedException.class"/> *************** *** 269,272 **** --- 270,274 ---- <include name="org/htmlparser/util/SpecialHashtable.class"/> <include name="org/htmlparser/util/LinkProcessor.class"/> + <include name="org/htmlparser/util/Translate.class"/> <include name="org/htmlparser/util/sort/**/*.class"/> <include name="org/htmlparser/parserHelper/SpecialHashtable.class"/> |
From: <der...@us...> - 2003-11-08 20:45:08
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/filterTests In directory sc8-pr-cvs1:/tmp/cvs-serv10983/filterTests Log Message: Directory /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/filterTests added to the repository |
From: <der...@us...> - 2003-11-08 19:37:27
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters In directory sc8-pr-cvs1:/tmp/cvs-serv31330/filters Log Message: Directory /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters added to the repository |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags In directory sc8-pr-cvs1:/tmp/cvs-serv6198/tags Modified Files: AppletTag.java BaseHrefTag.java BodyTag.java Bullet.java BulletList.java CompositeTag.java Div.java DoctypeTag.java FormTag.java FrameSetTag.java FrameTag.java HeadTag.java Html.java ImageTag.java InputTag.java JspTag.java LabelTag.java LinkTag.java MetaTag.java OptionTag.java ScriptTag.java SelectTag.java Span.java StyleTag.java TableColumn.java TableRow.java TableTag.java Tag.java TextareaTag.java TitleTag.java Log Message: The tags now own their ids, enders and end tag enders. The isTagToBeEndedFor logic is now uses information from the tags, not the scanners. The kludge to get the scanner from the NodeFactory is now gone too, this also comes from the tag. Index: AppletTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/AppletTag.java,v retrieving revision 1.34 retrieving revision 1.35 diff -C2 -d -r1.34 -r1.35 *** AppletTag.java 26 Oct 2003 19:46:21 -0000 1.34 --- AppletTag.java 6 Nov 2003 03:00:27 -0000 1.35 *************** *** 46,52 **** public class AppletTag extends CompositeTag { public AppletTag () { ! setTagName ("APPLET"); } --- 46,82 ---- public class AppletTag extends CompositeTag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"APPLET"}; + + /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"BODY", "HTML"}; + + /** + * Create a new applet tag. + */ public AppletTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of end tag names that cause this tag to finish. ! * @return The names of following end tags that stop further scanning. ! */ ! public String[] getEndTagEnders () ! { ! return (mEndTagEnders); } Index: BaseHrefTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/BaseHrefTag.java,v retrieving revision 1.30 retrieving revision 1.31 diff -C2 -d -r1.30 -r1.31 *** BaseHrefTag.java 1 Nov 2003 21:55:43 -0000 1.30 --- BaseHrefTag.java 6 Nov 2003 03:00:27 -0000 1.31 *************** *** 30,34 **** package org.htmlparser.tags; - import java.util.Vector; import org.htmlparser.lexer.Page; import org.htmlparser.util.LinkProcessor; --- 30,33 ---- *************** *** 40,46 **** public class BaseHrefTag extends Tag { public BaseHrefTag () { ! setTagName ("BASE"); } --- 39,61 ---- public class BaseHrefTag extends Tag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"BASE"}; + + /** + * Create a new base tag. + */ public BaseHrefTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); } Index: BodyTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/BodyTag.java,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** BodyTag.java 26 Oct 2003 19:46:23 -0000 1.17 --- BodyTag.java 6 Nov 2003 03:00:28 -0000 1.18 *************** *** 35,41 **** public class BodyTag extends CompositeTag { public BodyTag () { ! setTagName ("BODY"); } --- 35,80 ---- public class BodyTag extends CompositeTag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"BODY"}; + + /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"HTML"}; + + /** + * Create a new body tag. + */ public BodyTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of tag names that cause this tag to finish. ! * @return The names of following tags that stop further scanning. ! */ ! public String[] getEnders () ! { ! return (mIds); ! } ! ! /** ! * Return the set of end tag names that cause this tag to finish. ! * @return The names of following end tags that stop further scanning. ! */ ! public String[] getEndTagEnders () ! { ! return (mEndTagEnders); } Index: Bullet.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/Bullet.java,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** Bullet.java 26 Oct 2003 19:46:23 -0000 1.17 --- Bullet.java 6 Nov 2003 03:00:28 -0000 1.18 *************** *** 32,40 **** * A bullet tag. */ ! public class Bullet extends CompositeTag { public Bullet () { ! setTagName ("LI"); } } --- 32,79 ---- * A bullet tag. */ ! public class Bullet extends CompositeTag ! { ! /** ! * The set of names handled by this tag. ! */ ! private static final String[] mIds = new String[] {"LI"}; ! ! /** ! * The set of end tag names that indicate the end of this tag. ! */ ! private static final String[] mEndTagEnders = new String[] {"UL", "OL", "BODY", "HTML"}; + /** + * Create a new bullet tag. + */ public Bullet () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of tag names that cause this tag to finish. ! * @return The names of following tags that stop further scanning. ! */ ! public String[] getEnders () ! { ! return (mIds); ! } ! ! /** ! * Return the set of end tag names that cause this tag to finish. ! * @return The names of following end tags that stop further scanning. ! */ ! public String[] getEndTagEnders () ! { ! return (mEndTagEnders); } } Index: BulletList.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/BulletList.java,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** BulletList.java 26 Oct 2003 19:46:23 -0000 1.17 --- BulletList.java 6 Nov 2003 03:00:28 -0000 1.18 *************** *** 33,41 **** * Either <UL> or <OL>. */ ! public class BulletList extends CompositeTag { public BulletList () { ! setTagName ("UL"); // could be "OL" too } } --- 33,71 ---- * Either <UL> or <OL>. */ ! public class BulletList extends CompositeTag ! { ! /** ! * The set of names handled by this tag. ! */ ! private static final String[] mIds = new String[] {"UL", "OL"}; + /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"BODY", "HTML"}; + + /** + * Create a new bullet list (ordered or unordered) tag. + */ public BulletList () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of end tag names that cause this tag to finish. ! * @return The names of following end tags that stop further scanning. ! */ ! public String[] getEndTagEnders () ! { ! return (mEndTagEnders); } } Index: CompositeTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/CompositeTag.java,v retrieving revision 1.63 retrieving revision 1.64 diff -C2 -d -r1.63 -r1.64 *** CompositeTag.java 4 Nov 2003 01:25:02 -0000 1.63 --- CompositeTag.java 6 Nov 2003 03:00:28 -0000 1.64 *************** *** 29,37 **** package org.htmlparser.tags; ! import java.util.Vector; ! import org.htmlparser.*; import org.htmlparser.AbstractNode; - import org.htmlparser.lexer.Page; import org.htmlparser.lexer.nodes.TagNode; import org.htmlparser.util.NodeList; import org.htmlparser.util.SimpleNodeIterator; --- 29,37 ---- package org.htmlparser.tags; ! import org.htmlparser.Node; ! import org.htmlparser.StringNode; import org.htmlparser.AbstractNode; import org.htmlparser.lexer.nodes.TagNode; + import org.htmlparser.scanners.CompositeTagScanner; import org.htmlparser.util.NodeList; import org.htmlparser.util.SimpleNodeIterator; *************** *** 44,52 **** * the {@link #toHtml toHtml} method. */ ! public abstract class CompositeTag extends Tag { protected TagNode mEndTag; public CompositeTag () { } --- 44,63 ---- * the {@link #toHtml toHtml} method. */ ! public class CompositeTag extends Tag ! { ! /** ! * The tag that causes this tag to finish. ! * May be a virtual tag generated by the scanning logic. ! */ protected TagNode mEndTag; + /** + * The default scanner for non-composite tags. + */ + protected final static CompositeTagScanner mDefaultScanner = new CompositeTagScanner (); + public CompositeTag () { + setThisScanner (mDefaultScanner); } Index: Div.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/Div.java,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** Div.java 26 Oct 2003 19:46:23 -0000 1.17 --- Div.java 6 Nov 2003 03:00:28 -0000 1.18 *************** *** 34,40 **** public class Div extends CompositeTag { public Div () { ! setTagName ("DIV"); } } --- 34,70 ---- public class Div extends CompositeTag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"DIV"}; + + /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"BODY", "HTML"}; + + /** + * Create a new div tag. + */ public Div () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of end tag names that cause this tag to finish. ! * @return The names of following end tags that stop further scanning. ! */ ! public String[] getEndTagEnders () ! { ! return (mEndTagEnders); } } Index: DoctypeTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/DoctypeTag.java,v retrieving revision 1.32 retrieving revision 1.33 diff -C2 -d -r1.32 -r1.33 *** DoctypeTag.java 1 Nov 2003 01:36:57 -0000 1.32 --- DoctypeTag.java 6 Nov 2003 03:00:28 -0000 1.33 *************** *** 35,41 **** public class DoctypeTag extends Tag { public DoctypeTag () { ! setTagName ("!DOCTYPE"); } --- 35,57 ---- public class DoctypeTag extends Tag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"!DOCTYPE"}; + + /** + * Create a new !doctype tag. + */ public DoctypeTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); } Index: FormTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/FormTag.java,v retrieving revision 1.39 retrieving revision 1.40 diff -C2 -d -r1.39 -r1.40 *** FormTag.java 1 Nov 2003 21:55:43 -0000 1.39 --- FormTag.java 6 Nov 2003 03:00:28 -0000 1.40 *************** *** 48,55 **** protected String mFormLocation; public FormTag () { - setTagName ("FORM"); mFormLocation = null; } --- 48,94 ---- protected String mFormLocation; + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"FORM"}; + + /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"HTML", "BODY"}; + + /** + * Create a new form tag. + */ public FormTag () { mFormLocation = null; + } + + /** + * Return the set of names handled by this tag. + * @return The names to be matched that create tags of this type. + */ + public String[] getIds () + { + return (mIds); + } + + /** + * Return the set of tag names that cause this tag to finish. + * @return The names of following tags that stop further scanning. + */ + public String[] getEnders () + { + return (mIds); + } + + /** + * Return the set of end tag names that cause this tag to finish. + * @return The names of following end tags that stop further scanning. + */ + public String[] getEndTagEnders () + { + return (mEndTagEnders); } Index: FrameSetTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/FrameSetTag.java,v retrieving revision 1.30 retrieving revision 1.31 diff -C2 -d -r1.30 -r1.31 *** FrameSetTag.java 1 Nov 2003 01:36:57 -0000 1.30 --- FrameSetTag.java 6 Nov 2003 03:00:28 -0000 1.31 *************** *** 38,44 **** public class FrameSetTag extends CompositeTag { public FrameSetTag () { ! setTagName ("FRAMESET"); } --- 38,74 ---- public class FrameSetTag extends CompositeTag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"FRAMESET"}; + + /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"HTML"}; + + /** + * Create a new frame set tag. + */ public FrameSetTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of end tag names that cause this tag to finish. ! * @return The names of following end tags that stop further scanning. ! */ ! public String[] getEndTagEnders () ! { ! return (mEndTagEnders); } Index: FrameTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/FrameTag.java,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** FrameTag.java 1 Nov 2003 01:36:57 -0000 1.29 --- FrameTag.java 6 Nov 2003 03:00:28 -0000 1.30 *************** *** 36,42 **** public class FrameTag extends Tag { public FrameTag () { ! setTagName ("FRAME"); } --- 36,58 ---- public class FrameTag extends Tag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"FRAME"}; + + /** + * Create a new frame tag. + */ public FrameTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); } Index: HeadTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/HeadTag.java,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** HeadTag.java 26 Oct 2003 19:46:24 -0000 1.17 --- HeadTag.java 6 Nov 2003 03:00:28 -0000 1.18 *************** *** 37,44 **** public class HeadTag extends CompositeTag { public HeadTag () { ! setTagName ("HEAD"); } --- 37,87 ---- public class HeadTag extends CompositeTag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"HEAD"}; + + /** + * The set of tag names that indicate the end of this tag. + */ + private static final String[] mEnders = new String[] {"HEAD", "BODY"}; + /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"HTML"}; + + /** + * Create a new head tag. + */ public HeadTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of tag names that cause this tag to finish. ! * @return The names of following tags that stop further scanning. ! */ ! public String[] getEnders () ! { ! return (mEnders); ! } ! ! /** ! * Return the set of end tag names that cause this tag to finish. ! * @return The names of following end tags that stop further scanning. ! */ ! public String[] getEndTagEnders () ! { ! return (mEndTagEnders); } Index: Html.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/Html.java,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** Html.java 26 Oct 2003 19:46:24 -0000 1.29 --- Html.java 6 Nov 2003 03:00:28 -0000 1.30 *************** *** 34,41 **** public class Html extends CompositeTag { public Html () { ! setTagName ("HTML"); } } --- 34,56 ---- public class Html extends CompositeTag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"HTML"}; + /** + * Create a new html tag. + */ public Html () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); } } Index: ImageTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/ImageTag.java,v retrieving revision 1.34 retrieving revision 1.35 diff -C2 -d -r1.34 -r1.35 *** ImageTag.java 1 Nov 2003 21:55:43 -0000 1.34 --- ImageTag.java 6 Nov 2003 03:00:28 -0000 1.35 *************** *** 44,47 **** --- 44,52 ---- /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"IMG"}; + + /** * Holds the set value of the SRC attribute, since this can differ * from the attribute value due to relative references resolved by *************** *** 50,57 **** protected String imageURL; public ImageTag () { - setTagName ("IMG"); imageURL = null; } --- 55,73 ---- protected String imageURL; + /** + * Create a new image tag. + */ public ImageTag () { imageURL = null; + } + + /** + * Return the set of names handled by this tag. + * @return The names to be matched that create tags of this type. + */ + public String[] getIds () + { + return (mIds); } Index: InputTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/InputTag.java,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** InputTag.java 26 Oct 2003 19:46:24 -0000 1.29 --- InputTag.java 6 Nov 2003 03:00:28 -0000 1.30 *************** *** 36,42 **** public class InputTag extends Tag { public InputTag () { ! setTagName ("INPUT"); } --- 36,58 ---- public class InputTag extends Tag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"INPUT"}; + + /** + * Create a new input tag. + */ public InputTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); } Index: JspTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/JspTag.java,v retrieving revision 1.34 retrieving revision 1.35 diff -C2 -d -r1.34 -r1.35 *** JspTag.java 1 Nov 2003 01:36:57 -0000 1.34 --- JspTag.java 6 Nov 2003 03:00:28 -0000 1.35 *************** *** 37,43 **** public class JspTag extends Tag { public JspTag () { ! setTagName ("%"); } --- 37,59 ---- public class JspTag extends Tag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"%", "%=", "%@"}; + + /** + * Create a new jsp tag. + */ public JspTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); } Index: LabelTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/LabelTag.java,v retrieving revision 1.30 retrieving revision 1.31 diff -C2 -d -r1.30 -r1.31 *** LabelTag.java 26 Oct 2003 19:46:24 -0000 1.30 --- LabelTag.java 6 Nov 2003 03:00:28 -0000 1.31 *************** *** 37,43 **** public class LabelTag extends CompositeTag { public LabelTag () { ! setTagName ("LABEL"); } --- 37,68 ---- public class LabelTag extends CompositeTag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"LABEL"}; + + /** + * Create a new lavel tag. + */ public LabelTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of tag names that cause this tag to finish. ! * @return The names of following tags that stop further scanning. ! */ ! public String[] getEnders () ! { ! return (mIds); } Index: LinkTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/LinkTag.java,v retrieving revision 1.41 retrieving revision 1.42 diff -C2 -d -r1.41 -r1.42 *** LinkTag.java 1 Nov 2003 21:55:43 -0000 1.41 --- LinkTag.java 6 Nov 2003 03:00:29 -0000 1.42 *************** *** 42,45 **** --- 42,61 ---- { public static final String LINK_TAG_FILTER="-l"; + + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"A"}; + + /** + * The set of tag names that indicate the end of this tag. + */ + private static final String[] mEnders = new String[] {"A", "TD", "TR", "FORM", "LI"}; + + /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"TD", "TR", "FORM", "LI", "BODY", "HTML"}; + /** * The URL where the link points to *************** *** 58,62 **** /** ! * Constructor creates an LinkNode object, which basically stores the location * where the link points to, and the text it contains. * <p> --- 74,78 ---- /** ! * Constructor creates an LinkTag object, which basically stores the location * where the link points to, and the text it contains. * <p> *************** *** 91,95 **** public LinkTag () { ! setTagName ("A"); } --- 107,137 ---- public LinkTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of tag names that cause this tag to finish. ! * @return The names of following tags that stop further scanning. ! */ ! public String[] getEnders () ! { ! return (mEnders); ! } ! ! /** ! * Return the set of end tag names that cause this tag to finish. ! * @return The names of following end tags that stop further scanning. ! */ ! public String[] getEndTagEnders () ! { ! return (mEndTagEnders); } Index: MetaTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/MetaTag.java,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** MetaTag.java 26 Oct 2003 19:46:24 -0000 1.29 --- MetaTag.java 6 Nov 2003 03:00:29 -0000 1.30 *************** *** 36,43 **** public class MetaTag extends Tag { ! public MetaTag () { ! setTagName ("META"); } --- 36,58 ---- public class MetaTag extends Tag { ! /** ! * The set of names handled by this tag. ! */ ! private static final String[] mIds = new String[] {"META"}; ! ! /** ! * Create a new meta tag. ! */ public MetaTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); } Index: OptionTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/OptionTag.java,v retrieving revision 1.32 retrieving revision 1.33 diff -C2 -d -r1.32 -r1.33 *** OptionTag.java 26 Oct 2003 19:46:24 -0000 1.32 --- OptionTag.java 6 Nov 2003 03:00:30 -0000 1.33 *************** *** 34,40 **** public class OptionTag extends CompositeTag { public OptionTag () { ! setTagName ("OPTION"); } --- 34,84 ---- public class OptionTag extends CompositeTag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"OPTION"}; + + /** + * The set of tag names that indicate the end of this tag. + */ + private static final String[] mEnders = new String[] {"INPUT", "TEXTAREA", "SELECT", "OPTION"}; + + /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"SELECT", "FORM", "BODY", "HTML"}; + + /** + * Create a new option tag. + */ public OptionTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of tag names that cause this tag to finish. ! * @return The names of following tags that stop further scanning. ! */ ! public String[] getEnders () ! { ! return (mEnders); ! } ! ! /** ! * Return the set of end tag names that cause this tag to finish. ! * @return The names of following end tags that stop further scanning. ! */ ! public String[] getEndTagEnders () ! { ! return (mEndTagEnders); } Index: ScriptTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/ScriptTag.java,v retrieving revision 1.30 retrieving revision 1.31 diff -C2 -d -r1.30 -r1.31 *** ScriptTag.java 26 Oct 2003 19:46:24 -0000 1.30 --- ScriptTag.java 6 Nov 2003 03:00:31 -0000 1.31 *************** *** 34,40 **** public class ScriptTag extends CompositeTag { public ScriptTag () { ! setTagName ("SCRIPT"); } --- 34,70 ---- public class ScriptTag extends CompositeTag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"SCRIPT"}; + + /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"BODY", "HTML"}; + + /** + * Create a new script tag. + */ public ScriptTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of end tag names that cause this tag to finish. ! * @return The names of following end tags that stop further scanning. ! */ ! public String[] getEndTagEnders () ! { ! return (mEndTagEnders); } Index: SelectTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/SelectTag.java,v retrieving revision 1.31 retrieving revision 1.32 diff -C2 -d -r1.31 -r1.32 *** SelectTag.java 26 Oct 2003 19:46:24 -0000 1.31 --- SelectTag.java 6 Nov 2003 03:00:31 -0000 1.32 *************** *** 39,45 **** public class SelectTag extends CompositeTag { public SelectTag () { ! setTagName ("SELECT"); } --- 39,89 ---- public class SelectTag extends CompositeTag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"SELECT"}; + + /** + * The set of tag names that indicate the end of this tag. + */ + private static final String[] mEnders = new String[] {"INPUT", "TEXTAREA", "SELECT"}; + + /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"FORM", "BODY", "HTML"}; + + /** + * Create a new select tag. + */ public SelectTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of tag names that cause this tag to finish. ! * @return The names of following tags that stop further scanning. ! */ ! public String[] getEnders () ! { ! return (mEnders); ! } ! ! /** ! * Return the set of end tag names that cause this tag to finish. ! * @return The names of following end tags that stop further scanning. ! */ ! public String[] getEndTagEnders () ! { ! return (mEndTagEnders); } Index: Span.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/Span.java,v retrieving revision 1.31 retrieving revision 1.32 diff -C2 -d -r1.31 -r1.32 *** Span.java 26 Oct 2003 19:46:24 -0000 1.31 --- Span.java 6 Nov 2003 03:00:31 -0000 1.32 *************** *** 32,40 **** * A span tag. */ ! public class Span extends CompositeTag { public Span () { ! setTagName ("SPAN"); } } --- 32,56 ---- * A span tag. */ ! public class Span extends CompositeTag ! { ! /** ! * The set of names handled by this tag. ! */ ! private static final String[] mIds = new String[] {"SPAN"}; + /** + * Create a new span tag. + */ public Span () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); } } Index: StyleTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/StyleTag.java,v retrieving revision 1.30 retrieving revision 1.31 diff -C2 -d -r1.30 -r1.31 *** StyleTag.java 26 Oct 2003 19:46:24 -0000 1.30 --- StyleTag.java 6 Nov 2003 03:00:32 -0000 1.31 *************** *** 32,40 **** * A StyleTag represents a <style> tag. */ ! public class StyleTag extends CompositeTag { ! public StyleTag () { ! setTagName ("STYLE"); } --- 32,56 ---- * A StyleTag represents a <style> tag. */ ! public class StyleTag extends CompositeTag ! { ! /** ! * The set of names handled by this tag. ! */ ! private static final String[] mIds = new String[] {"STYLE"}; ! ! /** ! * Create a new style tag. ! */ public StyleTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); } Index: TableColumn.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TableColumn.java,v retrieving revision 1.31 retrieving revision 1.32 diff -C2 -d -r1.31 -r1.32 *** TableColumn.java 26 Oct 2003 19:46:24 -0000 1.31 --- TableColumn.java 6 Nov 2003 03:00:33 -0000 1.32 *************** *** 32,40 **** * A table column tag. */ ! public class TableColumn extends CompositeTag { public TableColumn () { ! setTagName ("TD"); } } --- 32,65 ---- * A table column tag. */ ! public class TableColumn extends CompositeTag ! { ! /** ! * The set of names handled by this tag. ! */ ! private static final String[] mIds = new String[] {"TD"}; + /** + * Create a new table column tag. + */ public TableColumn () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of tag names that cause this tag to finish. ! * @return The names of following tags that stop further scanning. ! */ ! public String[] getEnders () ! { ! return (mIds); } } Index: TableRow.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TableRow.java,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** TableRow.java 26 Oct 2003 19:46:24 -0000 1.33 --- TableRow.java 6 Nov 2003 03:00:34 -0000 1.34 *************** *** 34,42 **** * A table row tag. */ ! public class TableRow extends CompositeTag { ! public TableRow () { ! setTagName ("TR"); } --- 34,67 ---- * A table row tag. */ ! public class TableRow extends CompositeTag ! { ! /** ! * The set of names handled by this tag. ! */ ! private static final String[] mIds = new String[] {"TR"}; ! ! /** ! * Create a new table row tag. ! */ public TableRow () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of tag names that cause this tag to finish. ! * @return The names of following tags that stop further scanning. ! */ ! public String[] getEnders () ! { ! return (mIds); } Index: TableTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TableTag.java,v retrieving revision 1.34 retrieving revision 1.35 diff -C2 -d -r1.34 -r1.35 *** TableTag.java 26 Oct 2003 19:46:24 -0000 1.34 --- TableTag.java 6 Nov 2003 03:00:34 -0000 1.35 *************** *** 34,41 **** public class TableTag extends CompositeTag { public TableTag () { ! setTagName ("TABLE"); } --- 34,70 ---- public class TableTag extends CompositeTag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"TABLE"}; + + /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"BODY", "HTML"}; + /** + * Create a new table tag. + */ public TableTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of end tag names that cause this tag to finish. ! * @return The names of following end tags that stop further scanning. ! */ ! public String[] getEndTagEnders () ! { ! return (mEndTagEnders); } Index: Tag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/Tag.java,v retrieving revision 1.55 retrieving revision 1.56 diff -C2 -d -r1.55 -r1.56 *** Tag.java 1 Nov 2003 21:55:43 -0000 1.55 --- Tag.java 6 Nov 2003 03:00:35 -0000 1.56 *************** *** 29,46 **** package org.htmlparser.tags; - import java.lang.CloneNotSupportedException; - import java.util.Enumeration; - import java.util.HashSet; - import java.util.Hashtable; - import java.util.Map; import java.util.Vector; - import org.htmlparser.AbstractNode; import org.htmlparser.lexer.Page; import org.htmlparser.lexer.nodes.TagNode; import org.htmlparser.scanners.TagScanner; import org.htmlparser.util.NodeList; - import org.htmlparser.util.ParserException; - import org.htmlparser.util.SpecialHashtable; import org.htmlparser.visitors.NodeVisitor; --- 29,38 ---- *************** *** 53,60 **** --- 45,71 ---- public class Tag extends TagNode implements Cloneable { + /** + * An empty set of tag names. + */ + private final static String[] NONE = new String[0]; + + /** + * The scanner for this tag. + */ private TagScanner mScanner; + + /** + * The default scanner for non-composite tags. + */ + protected final static TagScanner mDefaultScanner = new TagScanner (); public Tag () { + String[] names; + + names = getIds (); + if ((null != names) && (0 != names.length)) + setTagName (names[0]); + setThisScanner (mDefaultScanner); } *************** *** 74,77 **** --- 85,124 ---- { return (super.clone ()); + } + + /** + * Return the set of names handled by this tag. + * Since this a a generic tag, it has no ids. + * @return The names to be matched that create tags of this type. + */ + public String[] getIds () + { + return (NONE); + } + + /** + * Return the set of tag names that cause this tag to finish. + * These are the normal (non end tags) that if encountered while + * scanning (a composite tag) will cause the generation of a virtual + * tag. + * Since this a a non-composite tag, the default is no enders. + * @return The names of following tags that stop further scanning. + */ + public String[] getEnders () + { + return (NONE); + } + + /** + * Return the set of end tag names that cause this tag to finish. + * These are the end tags that if encountered while + * scanning (a composite tag) will cause the generation of a virtual + * tag. + * Since this a a non-composite tag, it has no end tag enders. + * @return The names of following end tags that stop further scanning. + */ + public String[] getEndTagEnders () + { + return (NONE); } Index: TextareaTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TextareaTag.java,v retrieving revision 1.28 retrieving revision 1.29 diff -C2 -d -r1.28 -r1.29 *** TextareaTag.java 26 Oct 2003 19:46:24 -0000 1.28 --- TextareaTag.java 6 Nov 2003 03:00:36 -0000 1.29 *************** *** 36,42 **** public class TextareaTag extends CompositeTag { public TextareaTag () { ! setTagName ("TEXTAREA"); } --- 36,86 ---- public class TextareaTag extends CompositeTag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"TEXTAREA"}; + + /** + * The set of tag names that indicate the end of this tag. + */ + private static final String[] mEnders = new String[] {"INPUT", "TEXTAREA", "SELECT", "OPTION"}; + + /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"FORM", "BODY", "HTML"}; + + /** + * Create a new text area tag. + */ public TextareaTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of tag names that cause this tag to finish. ! * @return The names of following tags that stop further scanning. ! */ ! public String[] getEnders () ! { ! return (mEnders); ! } ! ! /** ! * Return the set of end tag names that cause this tag to finish. ! * @return The names of following end tags that stop further scanning. ! */ ! public String[] getEndTagEnders () ! { ! return (mEndTagEnders); } Index: TitleTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TitleTag.java,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** TitleTag.java 26 Oct 2003 19:46:24 -0000 1.29 --- TitleTag.java 6 Nov 2003 03:00:37 -0000 1.30 *************** *** 36,42 **** public class TitleTag extends CompositeTag { public TitleTag () { ! setTagName ("TITLE"); } --- 36,86 ---- public class TitleTag extends CompositeTag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"TITLE"}; + + /** + * The set of tag names that indicate the end of this tag. + */ + private static final String[] mEnders = new String[] {"TITLE","BODY"}; + + /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"HEAD", "HTML"}; + + /** + * Create a new title tag. + */ public TitleTag () { ! } ! ! /** ! * Return the set of names handled by this tag. ! * @return The names to be matched that create tags of this type. ! */ ! public String[] getIds () ! { ! return (mIds); ! } ! ! /** ! * Return the set of tag names that cause this tag to finish. ! * @return The names of following tags that stop further scanning. ! */ ! public String[] getEnders () ! { ! return (mEnders); ! } ! ! /** ! * Return the set of end tag names that cause this tag to finish. ! * @return The names of following end tags that stop further scanning. ! */ ! public String[] getEndTagEnders () ! { ! return (mEndTagEnders); } |
From: <der...@us...> - 2003-11-06 03:01:15
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests In directory sc8-pr-cvs1:/tmp/cvs-serv6198/tests/scannersTests Modified Files: CompositeTagScannerTest.java Log Message: The tags now own their ids, enders and end tag enders. The isTagToBeEndedFor logic is now uses information from the tags, not the scanners. The kludge to get the scanner from the NodeFactory is now gone too, this also comes from the tag. Index: CompositeTagScannerTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/CompositeTagScannerTest.java,v retrieving revision 1.51 retrieving revision 1.52 diff -C2 -d -r1.51 -r1.52 *** CompositeTagScannerTest.java 1 Nov 2003 21:55:43 -0000 1.51 --- CompositeTagScannerTest.java 6 Nov 2003 03:00:40 -0000 1.52 *************** *** 559,562 **** --- 559,563 ---- public static class CustomScanner extends CompositeTagScanner { private static final String MATCH_NAME [] = { "CUSTOM" }; + private boolean selfChildrenAllowed; public CustomScanner() { this(true); *************** *** 564,568 **** public CustomScanner(boolean selfChildrenAllowed) { ! super("", selfChildrenAllowed ? new String[] {} : MATCH_NAME); } --- 565,570 ---- public CustomScanner(boolean selfChildrenAllowed) { ! // super("", selfChildrenAllowed ? new String[] {} : MATCH_NAME); ! this.selfChildrenAllowed = selfChildrenAllowed; } *************** *** 575,579 **** CustomTag ret; ! ret = new CustomTag (); ret.setPage (page); ret.setStartPosition (start); --- 577,581 ---- CustomTag ret; ! ret = new CustomTag (selfChildrenAllowed); ret.setPage (page); ret.setStartPosition (start); *************** *** 590,599 **** public static class AnotherScanner extends CompositeTagScanner { private static final String MATCH_NAME [] = { "ANOTHER" }; public AnotherScanner() { ! super("", new String[] {"CUSTOM"}); } public AnotherScanner(boolean acceptCustomTagsButDontAcceptCustomEndTags) { ! super("", new String[] {}, new String[] {"CUSTOM"}); } --- 592,604 ---- public static class AnotherScanner extends CompositeTagScanner { private static final String MATCH_NAME [] = { "ANOTHER" }; + private boolean acceptCustomTagsButDontAcceptCustomEndTags; public AnotherScanner() { ! // super("", new String[] {"CUSTOM"}); ! acceptCustomTagsButDontAcceptCustomEndTags = false; } public AnotherScanner(boolean acceptCustomTagsButDontAcceptCustomEndTags) { ! // super("", new String[] {}, new String[] {"CUSTOM"}); ! this.acceptCustomTagsButDontAcceptCustomEndTags = acceptCustomTagsButDontAcceptCustomEndTags; } *************** *** 606,610 **** AnotherTag ret; ! ret = new AnotherTag (); ret.setPage (page); ret.setStartPosition (start); --- 611,615 ---- AnotherTag ret; ! ret = new AnotherTag (acceptCustomTagsButDontAcceptCustomEndTags); ret.setPage (page); ret.setStartPosition (start); *************** *** 625,632 **** --- 630,729 ---- public static class CustomTag extends CompositeTag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"CUSTOM"}; + + protected String[] mEnders; + + public CustomTag () + { + this (true); + } + + public CustomTag (boolean selfChildrenAllowed) + { + if (selfChildrenAllowed) + mEnders = new String[0]; + else + mEnders = mIds; + } + + /** + * Return the set of names handled by this tag. + * @return The names to be matched that create tags of this type. + */ + public String[] getIds () + { + return (mIds); + } + + /** + * Return the set of tag names that cause this tag to finish. + * @return The names of following tags that stop further scanning. + */ + public String[] getEnders () + { + return (mEnders); + } } public static class AnotherTag extends CompositeTag { + /** + * The set of names handled by this tag. + */ + private static final String[] mIds = new String[] {"ANOTHER"}; + + /** + * The set of tag names that indicate the end of this tag. + */ + private final String[] mEnders; + + /** + * The set of end tag names that indicate the end of this tag. + */ + private final String[] mEndTagEnders; + + public AnotherTag (boolean acceptCustomTagsButDontAcceptCustomEndTags) + { + if (acceptCustomTagsButDontAcceptCustomEndTags) + { + mEnders = new String[0]; + mEndTagEnders = new String[] {"CUSTOM"}; + } + else + { + mEnders = new String[] {"CUSTOM"}; + mEndTagEnders = new String[] {"CUSTOM"}; + } + } + + /** + * Return the set of names handled by this tag. + * @return The names to be matched that create tags of this type. + */ + public String[] getIds () + { + return (mIds); + } + + /** + * Return the set of tag names that cause this tag to finish. + * @return The names of following tags that stop further scanning. + */ + public String[] getEnders () + { + return (mEnders); + } + + /** + * Return the set of end tag names that cause this tag to finish. + * @return The names of following end tags that stop further scanning. + */ + public String[] getEndTagEnders () + { + return (mEndTagEnders); + } } } |
From: <der...@us...> - 2003-11-06 03:01:03
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners In directory sc8-pr-cvs1:/tmp/cvs-serv6198/scanners Modified Files: CompositeTagScanner.java ScriptScanner.java TagScanner.java Log Message: The tags now own their ids, enders and end tag enders. The isTagToBeEndedFor logic is now uses information from the tags, not the scanners. The kludge to get the scanner from the NodeFactory is now gone too, this also comes from the tag. Index: CompositeTagScanner.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagScanner.java,v retrieving revision 1.79 retrieving revision 1.80 diff -C2 -d -r1.79 -r1.80 *** CompositeTagScanner.java 1 Nov 2003 21:55:43 -0000 1.79 --- CompositeTagScanner.java 6 Nov 2003 03:00:22 -0000 1.80 *************** *** 91,95 **** * Inside the scanner, use createTag() to specify what tag needs to be created. */ ! public abstract class CompositeTagScanner extends TagScanner { protected Set tagEnderSet; --- 91,95 ---- * Inside the scanner, use createTag() to specify what tag needs to be created. */ ! public class CompositeTagScanner extends TagScanner { protected Set tagEnderSet; *************** *** 216,220 **** node = null; } ! else if (isTagToBeEndedFor (next)) // check DTD { // insert a virtual end tag and backup one node --- 216,220 ---- node = null; } ! else if (isTagToBeEndedFor (tag, next)) // check DTD { // insert a virtual end tag and backup one node *************** *** 226,234 **** { // now recurse if there is a scanner for this type of tag ! // whoah! really cheat here to get the parser ! // maybe eventually the tag will know it's own scanner eh ! org.htmlparser.Parser parser = (org.htmlparser.Parser)lexer.getNodeFactory (); ! scanner = parser.getScanner (name); ! if ((null != scanner) && scanner.evaluate (next, this)) node = scanner.scan (next, lexer.getPage ().getUrl (), lexer); } --- 226,231 ---- { // now recurse if there is a scanner for this type of tag ! scanner = next.getThisScanner (); ! if ((null != scanner) && scanner.evaluate (next, null)) node = scanner.scan (next, lexer.getPage ().getUrl (), lexer); } *************** *** 304,312 **** * @param children The list of nodes contained within the ebgin end tag pair. */ ! public abstract Tag createTag(Page page, int start, int end, Vector attributes, Tag startTag, Tag endTag, NodeList children) throws ParserException; ! public final boolean isTagToBeEndedFor(Tag tag) { String name; boolean ret; --- 301,324 ---- * @param children The list of nodes contained within the ebgin end tag pair. */ ! public Tag createTag(Page page, int start, int end, Vector attributes, Tag startTag, Tag endTag, NodeList children) throws ParserException ! { ! CompositeTag ret; ! ret = new CompositeTag (); ! ret.setPage (page); ! ret.setStartPosition (start); ! ret.setEndPosition (end); ! ret.setAttributesEx (attributes); ! ret.setStartTag (startTag); ! ret.setEndTag (endTag); ! ret.setChildren (children); ! ! return (ret); ! } ! ! public final boolean isTagToBeEndedFor (Tag current, Tag tag) { String name; + String[] ends; boolean ret; *************** *** 315,321 **** name = tag.getTagName (); if (tag.isEndTag ()) ! ret = endTagEnderSet.contains (name); else ! ret = tagEnderSet.contains (name); return (ret); --- 327,339 ---- name = tag.getTagName (); if (tag.isEndTag ()) ! ends = current.getEndTagEnders (); else ! ends = current.getEnders (); ! for (int i = 0; i < ends.length; i++) ! if (name.equalsIgnoreCase (ends[i])) ! { ! ret = true; ! break; ! } return (ret); Index: ScriptScanner.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.java,v retrieving revision 1.49 retrieving revision 1.50 diff -C2 -d -r1.49 -r1.50 *** ScriptScanner.java 1 Nov 2003 21:55:43 -0000 1.49 --- ScriptScanner.java 6 Nov 2003 03:00:24 -0000 1.50 *************** *** 118,122 **** done = true; } ! else if (isTagToBeEndedFor ((Tag)node)) { lexer.setPosition (position); --- 118,122 ---- done = true; } ! else if (isTagToBeEndedFor (tag, (Tag)node)) { lexer.setPosition (position); Index: TagScanner.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/TagScanner.java,v retrieving revision 1.48 retrieving revision 1.49 diff -C2 -d -r1.48 -r1.49 *** TagScanner.java 1 Nov 2003 21:55:43 -0000 1.48 --- TagScanner.java 6 Nov 2003 03:00:25 -0000 1.49 *************** *** 42,45 **** --- 42,46 ---- import org.htmlparser.lexer.Lexer; import org.htmlparser.lexer.Page; + import org.htmlparser.lexer.nodes.Attribute; import org.htmlparser.tags.Tag; import org.htmlparser.util.NodeIterator; *************** *** 66,70 **** * */ ! public abstract class TagScanner implements Serializable --- 67,71 ---- * */ ! public class TagScanner implements Serializable *************** *** 149,154 **** * @throws ParserException */ ! public abstract Tag createTag(Page page, int start, int end, Vector attributes, Tag tag, String url) throws ParserException; ! public abstract String [] getID(); } --- 150,171 ---- * @throws ParserException */ ! public Tag createTag(Page page, int start, int end, Vector attributes, Tag tag, String url) throws ParserException ! { ! Tag ret; ! ret = null; ! ! ret = new Tag (); ! ret.setPage (page); ! ret.setStartPosition (start); ! ret.setEndPosition (end); ! ret.setAttributesEx (attributes); ! ! return (ret); ! } ! ! public String [] getID () ! { ! return (new String[0]); ! } } |
From: <der...@us...> - 2003-11-06 03:00:48
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util In directory sc8-pr-cvs1:/tmp/cvs-serv6198/util Modified Files: IteratorImpl.java Log Message: The tags now own their ids, enders and end tag enders. The isTagToBeEndedFor logic is now uses information from the tags, not the scanners. The kludge to get the scanner from the NodeFactory is now gone too, this also comes from the tag. Index: IteratorImpl.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util/IteratorImpl.java,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** IteratorImpl.java 28 Oct 2003 12:54:22 -0000 1.33 --- IteratorImpl.java 6 Nov 2003 03:00:40 -0000 1.34 *************** *** 70,78 **** { // now recurse if there is a scanner for this type of tag ! name = tag.getTagName (); ! // whoah! really cheat here to get the parser ! // maybe eventually the tag will know it's own scanner eh ! org.htmlparser.Parser parser = (org.htmlparser.Parser)mLexer.getNodeFactory (); ! scanner = parser.getScanner (name); if ((null != scanner) && scanner.evaluate (tag, null)) ret = scanner.scan (tag, mLexer.getPage ().getUrl (), mLexer); --- 70,74 ---- { // now recurse if there is a scanner for this type of tag ! scanner = tag.getThisScanner (); if ((null != scanner) && scanner.evaluate (tag, null)) ret = scanner.scan (tag, mLexer.getPage ().getUrl (), mLexer); |
From: <der...@us...> - 2003-11-04 01:25:07
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/visitors In directory sc8-pr-cvs1:/tmp/cvs-serv25697/visitors Modified Files: ObjectFindingVisitor.java UrlModifyingVisitor.java Log Message: Made visiting order the same order as on the page. The 'shouldRecurseSelf' boolean of NodeVisitor could probably be removed since it doesn't make much sense any more. Fixed StringBean, which was still looking for end tags with names starting with a slash, i.e. "/SCRIPT", silly beany. Added some debugging support to the lexer, you can easily base a breakpoint on line number. Index: ObjectFindingVisitor.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/visitors/ObjectFindingVisitor.java,v retrieving revision 1.35 retrieving revision 1.36 diff -C2 -d -r1.35 -r1.36 *** ObjectFindingVisitor.java 26 Oct 2003 19:46:28 -0000 1.35 --- ObjectFindingVisitor.java 4 Nov 2003 01:25:02 -0000 1.36 *************** *** 37,49 **** public class ObjectFindingVisitor extends NodeVisitor { private Class classTypeToFind; - private int count = 0; private NodeList tags; public ObjectFindingVisitor(Class classTypeToFind) { ! this(classTypeToFind,false); } public ObjectFindingVisitor(Class classTypeToFind,boolean recurse) { ! super(recurse); this.classTypeToFind = classTypeToFind; this.tags = new NodeList(); --- 37,48 ---- public class ObjectFindingVisitor extends NodeVisitor { private Class classTypeToFind; private NodeList tags; public ObjectFindingVisitor(Class classTypeToFind) { ! this(classTypeToFind,true); } public ObjectFindingVisitor(Class classTypeToFind,boolean recurse) { ! super(recurse, true); this.classTypeToFind = classTypeToFind; this.tags = new NodeList(); *************** *** 51,62 **** public int getCount() { ! return count; } public void visitTag(Tag tag) { ! if (tag.getClass().getName().equals(classTypeToFind.getName())) { ! count++; tags.add(tag); - } } --- 50,59 ---- public int getCount() { ! return (tags.size ()); } public void visitTag(Tag tag) { ! if (tag.getClass().equals(classTypeToFind)) tags.add(tag); } Index: UrlModifyingVisitor.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/visitors/UrlModifyingVisitor.java,v retrieving revision 1.37 retrieving revision 1.38 diff -C2 -d -r1.37 -r1.38 *** UrlModifyingVisitor.java 1 Nov 2003 21:55:44 -0000 1.37 --- UrlModifyingVisitor.java 4 Nov 2003 01:25:03 -0000 1.38 *************** *** 30,37 **** --- 30,40 ---- package org.htmlparser.visitors; + + import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.StringNode; import org.htmlparser.scanners.ImageScanner; import org.htmlparser.scanners.LinkScanner; + import org.htmlparser.tags.CompositeTag; import org.htmlparser.tags.ImageTag; import org.htmlparser.tags.LinkTag; *************** *** 65,70 **** public void visitTag(Tag tag) ! { ! if (null == tag.getParent ()) modifiedResult.append(tag.toHtml()); } --- 68,77 ---- public void visitTag(Tag tag) ! { // process only those nodes that won't be processed by an end tag, ! // nodes without parents or parents without an end tag, since ! // the complete processing of all children should happen before ! // we turn this node back into html text ! if (null == tag.getParent () ! && (!(tag instanceof CompositeTag) || null == ((CompositeTag)tag).getEndTag ())) modifiedResult.append(tag.toHtml()); } *************** *** 72,77 **** public void visitEndTag(Tag tag) { ! if (null == tag.getParent ()) modifiedResult.append(tag.toHtml()); } --- 79,89 ---- public void visitEndTag(Tag tag) { ! Node parent; ! ! parent = tag.getParent (); ! if (null == parent) modifiedResult.append(tag.toHtml()); + else + modifiedResult.append(parent.toHtml()); } |
From: <der...@us...> - 2003-11-04 01:25:06
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexerapplications/thumbelina In directory sc8-pr-cvs1:/tmp/cvs-serv25697/lexerapplications/thumbelina Modified Files: Thumbelina.java Log Message: Made visiting order the same order as on the page. The 'shouldRecurseSelf' boolean of NodeVisitor could probably be removed since it doesn't make much sense any more. Fixed StringBean, which was still looking for end tags with names starting with a slash, i.e. "/SCRIPT", silly beany. Added some debugging support to the lexer, you can easily base a breakpoint on line number. Index: Thumbelina.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexerapplications/thumbelina/Thumbelina.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** Thumbelina.java 26 Oct 2003 16:44:01 -0000 1.2 --- Thumbelina.java 4 Nov 2003 01:25:02 -0000 1.3 *************** *** 1076,1080 **** urls = getImageLinks (link); fetch (urls[0]); ! append (filter (urls[1])); setCurrentURL (null); } --- 1076,1087 ---- urls = getImageLinks (link); fetch (urls[0]); ! //append (filter (urls[1])); ! synchronized (mEnqueuers) ! { ! Enqueuer enqueuer = new Enqueuer (urls[1]); ! enqueuer.setPriority (Thread.MIN_PRIORITY); ! mEnqueuers.add (enqueuer); ! enqueuer.start (); ! } setCurrentURL (null); } *************** *** 1092,1095 **** --- 1099,1122 ---- } + static ArrayList mEnqueuers = new ArrayList (); + + class Enqueuer extends Thread + { + URL[] mList; + + public Enqueuer (URL[] list) + { + mList = list; + } + + public void run () + { + append (filter (mList)); + synchronized (mEnqueuers) + { + mEnqueuers.remove (this); + } + } + } // // ItemListener interface *************** *** 1427,1430 **** --- 1454,1466 ---- * * $Log$ + * Revision 1.3 2003/11/04 01:25:02 derrickoswald + * Made visiting order the same order as on the page. + * The 'shouldRecurseSelf' boolean of NodeVisitor could probably + * be removed since it doesn't make much sense any more. + * Fixed StringBean, which was still looking for end tags with names starting with + * a slash, i.e. "/SCRIPT", silly beany. + * Added some debugging support to the lexer, you can easily base a breakpoint on + * line number. + * * Revision 1.2 2003/10/26 16:44:01 derrickoswald * Get thumbelina working again. The tag.getName() method doesn't include the / of end tags. |
From: <der...@us...> - 2003-11-04 01:25:06
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags In directory sc8-pr-cvs1:/tmp/cvs-serv25697/tags Modified Files: CompositeTag.java Log Message: Made visiting order the same order as on the page. The 'shouldRecurseSelf' boolean of NodeVisitor could probably be removed since it doesn't make much sense any more. Fixed StringBean, which was still looking for end tags with names starting with a slash, i.e. "/SCRIPT", silly beany. Added some debugging support to the lexer, you can easily base a breakpoint on line number. Index: CompositeTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/CompositeTag.java,v retrieving revision 1.62 retrieving revision 1.63 diff -C2 -d -r1.62 -r1.63 *** CompositeTag.java 1 Nov 2003 21:55:43 -0000 1.62 --- CompositeTag.java 4 Nov 2003 01:25:02 -0000 1.63 *************** *** 326,329 **** --- 326,331 ---- Node child; + if (visitor.shouldRecurseSelf ()) + visitor.visitTag (this); if (visitor.shouldRecurseChildren ()) { *************** *** 340,345 **** getEndTag ().accept (visitor); } - if (visitor.shouldRecurseSelf ()) - visitor.visitTag (this); } --- 342,345 ---- |
From: <der...@us...> - 2003-11-04 01:25:06
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer In directory sc8-pr-cvs1:/tmp/cvs-serv25697/lexer Modified Files: Lexer.java Page.java PageIndex.java Log Message: Made visiting order the same order as on the page. The 'shouldRecurseSelf' boolean of NodeVisitor could probably be removed since it doesn't make much sense any more. Fixed StringBean, which was still looking for end tags with names starting with a slash, i.e. "/SCRIPT", silly beany. Added some debugging support to the lexer, you can easily base a breakpoint on line number. Index: Lexer.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Lexer.java,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** Lexer.java 28 Oct 2003 03:04:18 -0000 1.18 --- Lexer.java 4 Nov 2003 01:25:02 -0000 1.19 *************** *** 79,89 **** /** * Creates a new instance of a Lexer. */ public Lexer () { ! setPage (new Page ("")); ! setCursor (new Cursor (getPage (), 0)); ! setNodeFactory (this); } --- 79,96 ---- /** + * Line number to trigger on. + * This is tested on each <code>nextNode()</code> call, as an aid to debugging. + * Alter this value and set a breakpoint on the line after the test. + * Remember, these line numbers are zero based, while most editors are one based. + * @see #nextNode + */ + static protected int mDebugLineTrigger = -1; + + /** * Creates a new instance of a Lexer. */ public Lexer () { ! this (new Page ("")); } *************** *** 247,250 **** --- 254,265 ---- Node ret; + // debugging suppport + if (-1 != mDebugLineTrigger) + { + Page page = getPage (); + int lineno = page.row (mCursor); + if (mDebugLineTrigger < lineno) + mDebugLineTrigger = lineno + 1; // trigger on subsequent lines too + } probe = mCursor.dup (); ch = mPage.getCharacter (probe); Index: Page.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Page.java,v retrieving revision 1.23 retrieving revision 1.24 diff -C2 -d -r1.23 -r1.24 *** Page.java 29 Oct 2003 03:31:17 -0000 1.23 --- Page.java 4 Nov 2003 01:25:02 -0000 1.24 *************** *** 804,823 **** { int line; int start; int end; line = row (cursor); ! start = mIndex.elementAt (line); ! line++; ! end = mIndex.last (); ! if (end <= line) ! end = mIndex.elementAt (end); ! else end = mSource.mOffset; return (getText (start, end)); } - // todo refactor into common code method: - /** * Get the text line the position of the cursor lies on. --- 804,832 ---- { int line; + int size; int start; int end; line = row (cursor); ! size = mIndex.size (); ! if (line < size) ! { ! start = mIndex.elementAt (line); ! line++; ! if (line <= size) ! end = mIndex.elementAt (line); ! else ! end = mSource.mOffset; ! } ! else // current line ! { ! start = mIndex.elementAt (line - 1); end = mSource.mOffset; + } + + return (getText (start, end)); } /** * Get the text line the position of the cursor lies on. *************** *** 828,844 **** public String getLine (int position) { ! int line; int start; ! int end; ! line = row (position); ! start = mIndex.elementAt (line); ! line++; ! end = mIndex.last (); ! if (end <= line) ! end = mIndex.elementAt (end); else ! end = mSource.mOffset; ! return (getText (start, end)); } } --- 837,868 ---- public String getLine (int position) { ! return (getLine (new Cursor (this, position))); ! } ! ! /** ! * Display some of this page as a string. ! * @return The last few characters the source read in. ! */ ! public String toString () ! { ! StringBuffer buffer; int start; ! String ret; ! if (mSource.mOffset > 0) ! { ! buffer = new StringBuffer (43); ! start = mSource.mOffset - 40; ! if (0 > start) ! start = 0; ! else ! buffer.append ("..."); ! getText (buffer, start, mSource.mOffset); ! ret = buffer.toString (); ! } else ! ret = super.toString (); ! ! return (ret); } } Index: PageIndex.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/PageIndex.java,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** PageIndex.java 26 Oct 2003 19:46:18 -0000 1.12 --- PageIndex.java 4 Nov 2003 01:25:02 -0000 1.13 *************** *** 198,202 **** public int elementAt (int index) { ! return (mIndices[index]); } --- 198,205 ---- public int elementAt (int index) { ! if (index >= mCount) // negative index is handled by array.. below ! throw new IndexOutOfBoundsException ("index " + index + " beyond current limit"); ! else ! return (mIndices[index]); } *************** *** 353,356 **** --- 356,360 ---- * @return The index of the last element. * If this were an array object this would be (object.length - 1). + * For an empty index this will return -1. */ public int last () |
From: <der...@us...> - 2003-11-04 01:25:06
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans In directory sc8-pr-cvs1:/tmp/cvs-serv25697/beans Modified Files: BeanyBaby.java StringBean.java Log Message: Made visiting order the same order as on the page. The 'shouldRecurseSelf' boolean of NodeVisitor could probably be removed since it doesn't make much sense any more. Fixed StringBean, which was still looking for end tags with names starting with a slash, i.e. "/SCRIPT", silly beany. Added some debugging support to the lexer, you can easily base a breakpoint on line number. Index: BeanyBaby.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans/BeanyBaby.java,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** BeanyBaby.java 26 Oct 2003 19:46:17 -0000 1.17 --- BeanyBaby.java 4 Nov 2003 01:25:02 -0000 1.18 *************** *** 371,375 **** BeanyBaby bb = new BeanyBaby (); bb.show (); ! bb.setURL ("http://www.netbeans.org"); } } --- 371,375 ---- BeanyBaby bb = new BeanyBaby (); bb.show (); ! bb.setURL ("http://www.slashdot.org"); } } Index: StringBean.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans/StringBean.java,v retrieving revision 1.30 retrieving revision 1.31 diff -C2 -d -r1.30 -r1.31 *** StringBean.java 26 Oct 2003 19:46:17 -0000 1.30 --- StringBean.java 4 Nov 2003 01:25:02 -0000 1.31 *************** *** 162,166 **** public StringBean () { ! super (true, false); mPropertySupport = new PropertyChangeSupport (this); mParser = new Parser (); --- 162,166 ---- public StringBean () { ! super (true, true); mPropertySupport = new PropertyChangeSupport (this); mParser = new Parser (); *************** *** 624,630 **** name = tag.getTagName (); ! if (name.equalsIgnoreCase ("/PRE")) mIsPre = false; ! else if (name.equalsIgnoreCase ("/SCRIPT")) mIsScript = false; } --- 624,630 ---- name = tag.getTagName (); ! if (name.equalsIgnoreCase ("PRE")) mIsPre = false; ! else if (name.equalsIgnoreCase ("SCRIPT")) mIsScript = false; } |