[Htmlparser-cvs] htmlparser/src/org/htmlparser/filters AndFilter.java,NONE,1.1 HasAttributeFilter.ja
Brought to you by:
derrickoswald
From: <der...@us...> - 2003-11-08 21:31:01
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters In directory sc8-pr-cvs1:/tmp/cvs-serv18855/src/org/htmlparser/filters Added Files: AndFilter.java HasAttributeFilter.java HasChildFilter.java NodeClassFilter.java NotFilter.java OrFilter.java StringFilter.java TagNameFilter.java package.html Log Message: Implement generic node filtering. Added the NodeFilter interface and the filter package. Sideline tag specific scanners; tags now use only one scanner of each type, TagScanner or CompositeTagScanner (except for ScriptScanner). Obviated PeekingIterator by moving the META tag semantics to doSemanticAction, much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated. --- NEW FILE: AndFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/AndFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; /** * This class accepts all nodes matching both filters (AND operation). */ public class AndFilter implements NodeFilter { /** * The left hand side. */ protected NodeFilter mLeft; /** * The right hand side. */ protected NodeFilter mRight; /** * Creates a new instance of AndFilter that accepts nodes acceptable to both filters. * @param left One filter. * @param right The other filter. */ public AndFilter (NodeFilter left, NodeFilter right) { mLeft = left; mRight = right; } /** * Accept nodes that are acceptable to both filters. * @param node The node to check. */ public boolean accept (Node node) { return (mLeft.accept (node) && mRight.accept (node)); } } --- NEW FILE: HasAttributeFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/HasAttributeFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.lexer.nodes.TagNode; /** * This class accepts all tags that have a child acceptable to the filter. */ public class HasAttributeFilter implements NodeFilter { /** * The attribute to check for. */ protected String mAttribute; /** * Creates a new instance of HasAttributeFilter that accepts tags with the given attribute. * @param attribute The attribute to search for. */ public HasAttributeFilter (String attribute) { mAttribute = attribute.toUpperCase (); } /** * Accept tags with a certain attribute. * @param node The node to check. */ public boolean accept (Node node) { TagNode tag; boolean ret; ret = false; if (node instanceof TagNode) { tag = (TagNode)node; ret = null != tag.getAttributeEx (mAttribute); } return (ret); } } --- NEW FILE: HasChildFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/HasChildFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.tags.CompositeTag; import org.htmlparser.util.NodeList; /** * This class accepts all tags that have a child acceptable to the filter. */ public class HasChildFilter implements NodeFilter { /** * The filter to apply to children. */ protected NodeFilter mFilter; /** * Creates a new instance of HasChildFilter that accepts tags with children acceptable to the filter. * Similar to asking for the parent of a node returned by the given * filter, but where multiple children may be acceptable, this class * will only accept the parent once. * @param filter The filter to apply to children. */ public HasChildFilter (NodeFilter filter) { mFilter = filter; } /** * Accept tags with children acceptable to the filter. * @param node The node to check. */ public boolean accept (Node node) { CompositeTag tag; NodeList children; boolean ret; ret = false; if (node instanceof CompositeTag) { tag = (CompositeTag)node; children = tag.getChildren (); for (int i = 0; i < children.size (); i++) if (mFilter.accept (children.elementAt (i))) { ret = true; break; } } return (ret); } } --- NEW FILE: NodeClassFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/NodeClassFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; /** * This class accepts all tags of a given class. */ public class NodeClassFilter implements NodeFilter { /** * The class to match. */ protected Class mClass; /** * Creates a new instance of NodeClassFilter that accepts tags of the given class. * @param cls The cls to match. */ public NodeClassFilter (Class cls) { mClass = cls; } /** * Accept nodes that are assignable from the class provided in the constructor. * @param node The node to check. */ public boolean accept (Node node) { return (mClass.isAssignableFrom (node.getClass ())); } } --- NEW FILE: NotFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/NotFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; /** * This class accepts all nodes not acceptable to the filter. */ public class NotFilter implements NodeFilter { /** * The filter to gainsay. */ protected NodeFilter mFilter; /** * Creates a new instance of NotFilter that accepts nodes not acceptable to the filter. * @param filter The filter to consult. */ public NotFilter (NodeFilter filter) { mFilter = filter; } /** * Accept nodes that are not acceptable to the filter. * @param node The node to check. */ public boolean accept (Node node) { return (!mFilter.accept (node)); } } --- NEW FILE: OrFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/OrFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; /** * This class accepts all nodes matching either filter (OR operation). */ public class OrFilter implements NodeFilter { /** * The left hand side. */ protected NodeFilter mLeft; /** * The right hand side. */ protected NodeFilter mRight; /** * Creates a new instance of OrFilter that accepts nodes acceptable to either filter. * @param left One filter. * @param right The other filter. */ public OrFilter (NodeFilter left, NodeFilter right) { mLeft = left; mRight = right; } /** * Accept nodes that are acceptable to either filter. * @param node The node to check. */ public boolean accept (Node node) { return (mLeft.accept (node) || mRight.accept (node)); } } --- NEW FILE: StringFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/StringFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.lexer.nodes.StringNode; /** * This class accepts all string nodes containing the given string. */ public class StringFilter implements NodeFilter { /** * The string to search for. */ protected String mPattern; /** * Case sensitive toggle. */ protected boolean mCaseSensitive; /** * Creates a new instance of StringFilter that accepts string nodes containing a certain string. * The comparison is case insensitive. * @param pattern The pattern to search for. */ public StringFilter (String pattern) { this (pattern, false); } /** * Creates a new instance of StringFilter that accepts string nodes containing a certain string. * @param pattern The pattern to search for. * @param case_sensitive If <code>true</code>, comparisons are performed * respecting case. */ public StringFilter (String pattern, boolean case_sensitive) { mCaseSensitive = case_sensitive; if (mCaseSensitive) mPattern = pattern; else mPattern = pattern.toUpperCase (); } /** * Accept string nodes that contain the string. * @param node The node to check. */ public boolean accept (Node node) { String string; boolean ret; ret = false; if (node instanceof StringNode) { string = ((StringNode)node).getText (); if (!mCaseSensitive) string = string.toUpperCase (); ret = -1 != string.indexOf (mPattern); } return (ret); } } --- NEW FILE: TagNameFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/TagNameFilter.java,v $ // $Author: derrickoswald $ // $Date: 2003/11/08 21:30:58 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.lexer.nodes.TagNode; /** * This class accepts all tags matching the tag name. */ public class TagNameFilter implements NodeFilter { /** * The tag name to match. */ protected String mName; /** * Creates a new instance of TagNameFilter that accepts tags with the given name. * @param name The tag name to match. */ public TagNameFilter (String name) { mName = name.toUpperCase (); } /** * Accept nodes that are tags and have a matching tag name. * This discards non-tag nodes and end tags. * The end tags are available on the enclosing non-end tag. * @param node The node to check. */ public boolean accept (Node node) { return ((node instanceof TagNode) && !((TagNode)node).isEndTag () && ((TagNode)node).getTagName ().equals (mName)); } } --- NEW FILE: package.html --- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <!-- @(#)package.html 1.60 98/01/27 HTMLParser Library v1_4_20031026 - A java-based parser for HTML Copyright (C) Dec 31, 2000 Somik Raha This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA For any questions or suggestions, you can write to me at : Email :so...@in... Postal Address : Somik Raha Extreme Programmer & Coach Industrial Logic Corporation 2583 Cedar Street, Berkeley, CA 94708, USA Website : http://www.industriallogic.com --> <TITLE>Filters Package</TITLE> </HEAD> <BODY> The filters package contains example filters to select only desired nodes. For example, to display tags having the "id" attribute, you could use: <pre> Parser parser = new Parser ("http://yadda"); parser.parse (new HasAttributeFilter ("id")); </pre> These filters can be combined to yield powerfull extraction capabilities. For example, to get a list of links where the contents is an image, you could use: <pre> NodeList list = new NodeList (); NodeFilter filter = new AndFilter ( new TagNameFilter ("A"), new HasChildFilter ( new TagNameFilter ("IMG"))); for (NodeIterator e = parser.elements (); e.hasMoreNodes (); ) e.nextNode ().collectInto (list, filter); </pre> </BODY> </HTML> |