htmlparser-cvs Mailing List for HTML Parser (Page 30)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters
In directory sc8-pr-cvs1:/tmp/cvs-serv18855/src/org/htmlparser/filters

Added Files:
	AndFilter.java HasAttributeFilter.java HasChildFilter.java 
	NodeClassFilter.java NotFilter.java OrFilter.java 
	StringFilter.java TagNameFilter.java package.html 
Log Message:
Implement generic node filtering.
Added the NodeFilter interface and the filter package.
Sideline tag specific scanners; tags now use only one scanner of each type,
TagScanner or CompositeTagScanner (except for ScriptScanner).
Obviated PeekingIterator by moving the META tag semantics to doSemanticAction,
much simpler, old IteratorImpl is now PeekingIteratorImpl but deprecated.

--- NEW FILE: AndFilter.java ---
// HTMLParser Library $Name:  $ - A java-based parser for HTML
// http://sourceforge.org/projects/htmlparser
// Copyright (C) 2003 Derrick Oswald
//
// Revision Control Information
//
// $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/AndFilter.java,v $
// $Author: derrickoswald $
// $Date: 2003/11/08 21:30:58 $
// $Revision: 1.1 $
//
// This library is free software; you can redistribute it and/or
// modify it under the terms of the GNU Lesser General Public
// License as published by the Free Software Foundation; either
// version 2.1 of the License, or (at your option) any later version.
//
// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
// Lesser General Public License for more details.
//
// You should have received a copy of the GNU Lesser General Public
// License along with this library; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
//

package org.htmlparser.filters;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;

/**
 * This class accepts all nodes matching both filters (AND operation).
 */
public class AndFilter implements NodeFilter
{
    /**
     * The left hand side.
     */
    protected NodeFilter mLeft;

    /**
     * The right hand side.
     */
    protected NodeFilter mRight;

    /**
     * Creates a new instance of AndFilter that accepts nodes acceptable to both filters.
     * @param left One filter.
     * @param right The other filter.
     */
    public AndFilter (NodeFilter left, NodeFilter right)
    {
        mLeft = left;
        mRight = right;
    }

    /**
     * Accept nodes that are acceptable to both filters.
     * @param node The node to check.
     */
    public boolean accept (Node node)
    {
        return (mLeft.accept (node) && mRight.accept (node));
    }
}

--- NEW FILE: HasAttributeFilter.java ---
// HTMLParser Library $Name:  $ - A java-based parser for HTML
// http://sourceforge.org/projects/htmlparser
// Copyright (C) 2003 Derrick Oswald
//
// Revision Control Information
//
// $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/HasAttributeFilter.java,v $
// $Author: derrickoswald $
// $Date: 2003/11/08 21:30:58 $
// $Revision: 1.1 $
//
// This library is free software; you can redistribute it and/or
// modify it under the terms of the GNU Lesser General Public
// License as published by the Free Software Foundation; either
// version 2.1 of the License, or (at your option) any later version.
//
// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
// Lesser General Public License for more details.
//
// You should have received a copy of the GNU Lesser General Public
// License along with this library; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
//

package org.htmlparser.filters;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.lexer.nodes.TagNode;

/**
 * This class accepts all tags that have a child acceptable to the filter.
 */
public class HasAttributeFilter implements NodeFilter
{
    /**
     * The attribute to check for.
     */
    protected String mAttribute;

    /**
     * Creates a new instance of HasAttributeFilter that accepts tags with the given attribute.
     * @param attribute The attribute to search for.
     */
    public HasAttributeFilter (String attribute)
    {
        mAttribute = attribute.toUpperCase ();
    }

    /**
     * Accept tags with a certain attribute.
     * @param node The node to check.
     */
    public boolean accept (Node node)
    {
        TagNode tag;
        boolean ret;

        ret = false;
        if (node instanceof TagNode)
        {
            tag = (TagNode)node;
            ret = null != tag.getAttributeEx (mAttribute);
        }

        return (ret);
    }
}

--- NEW FILE: HasChildFilter.java ---
// HTMLParser Library $Name:  $ - A java-based parser for HTML
// http://sourceforge.org/projects/htmlparser
// Copyright (C) 2003 Derrick Oswald
//
// Revision Control Information
//
// $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/HasChildFilter.java,v $
// $Author: derrickoswald $
// $Date: 2003/11/08 21:30:58 $
// $Revision: 1.1 $
//
// This library is free software; you can redistribute it and/or
// modify it under the terms of the GNU Lesser General Public
// License as published by the Free Software Foundation; either
// version 2.1 of the License, or (at your option) any later version.
//
// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
// Lesser General Public License for more details.
//
// You should have received a copy of the GNU Lesser General Public
// License along with this library; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
//

package org.htmlparser.filters;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.tags.CompositeTag;
import org.htmlparser.util.NodeList;

/**
 * This class accepts all tags that have a child acceptable to the filter.
 */
public class HasChildFilter implements NodeFilter
{
    /**
     * The filter to apply to children.
     */
    protected NodeFilter mFilter;

    /**
     * Creates a new instance of HasChildFilter that accepts tags with children acceptable to the filter.
     * Similar to asking for the parent of a node returned by the given
     * filter, but where multiple children may be acceptable, this class
     * will only accept the parent once.
     * @param filter The filter to apply to children.
     */
    public HasChildFilter (NodeFilter filter)
    {
        mFilter = filter;
    }

    /**
     * Accept tags with children acceptable to the filter.
     * @param node The node to check.
     */
    public boolean accept (Node node)
    {
        CompositeTag tag;
        NodeList children;
        boolean ret;

        ret = false;
        if (node instanceof CompositeTag)
        {
            tag = (CompositeTag)node;
            children = tag.getChildren ();
            for (int i = 0; i < children.size (); i++)
                if (mFilter.accept (children.elementAt (i)))
                {
                    ret = true;
                    break;
                }
        }

        return (ret);
    }
}

--- NEW FILE: NodeClassFilter.java ---
// HTMLParser Library $Name:  $ - A java-based parser for HTML
// http://sourceforge.org/projects/htmlparser
// Copyright (C) 2003 Derrick Oswald
//
// Revision Control Information
//
// $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/NodeClassFilter.java,v $
// $Author: derrickoswald $
// $Date: 2003/11/08 21:30:58 $
// $Revision: 1.1 $
//
// This library is free software; you can redistribute it and/or
// modify it under the terms of the GNU Lesser General Public
// License as published by the Free Software Foundation; either
// version 2.1 of the License, or (at your option) any later version.
//
// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
// Lesser General Public License for more details.
//
// You should have received a copy of the GNU Lesser General Public
// License along with this library; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
//

package org.htmlparser.filters;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;

/**
 * This class accepts all tags of a given class.
 */
public class NodeClassFilter implements NodeFilter
{
    /**
     * The class to match.
     */
    protected Class mClass;

    /**
     * Creates a new instance of NodeClassFilter that accepts tags of the given class.
     * @param cls The cls to match.
     */
    public NodeClassFilter (Class cls)
    {
        mClass = cls;
    }

    /**
     * Accept nodes that are assignable from the class provided in the constructor.
     * @param node The node to check.
     */
    public boolean accept (Node node)
    {
        return (mClass.isAssignableFrom (node.getClass ()));
    }
}

--- NEW FILE: NotFilter.java ---
// HTMLParser Library $Name:  $ - A java-based parser for HTML
// http://sourceforge.org/projects/htmlparser
// Copyright (C) 2003 Derrick Oswald
//
// Revision Control Information
//
// $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/NotFilter.java,v $
// $Author: derrickoswald $
// $Date: 2003/11/08 21:30:58 $
// $Revision: 1.1 $
//
// This library is free software; you can redistribute it and/or
// modify it under the terms of the GNU Lesser General Public
// License as published by the Free Software Foundation; either
// version 2.1 of the License, or (at your option) any later version.
//
// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
// Lesser General Public License for more details.
//
// You should have received a copy of the GNU Lesser General Public
// License along with this library; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
//

package org.htmlparser.filters;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;

/**
 * This class accepts all nodes not acceptable to the filter.
 */
public class NotFilter implements NodeFilter
{
    /**
     * The filter to gainsay.
     */
    protected NodeFilter mFilter;

    /**
     * Creates a new instance of NotFilter that accepts nodes not acceptable to the filter.
     * @param filter The filter to consult.
     */
    public NotFilter (NodeFilter filter)
    {
        mFilter = filter;
    }

    /**
     * Accept nodes that are not acceptable to the filter.
     * @param node The node to check.
     */
    public boolean accept (Node node)
    {
        return (!mFilter.accept (node));
    }
}

--- NEW FILE: OrFilter.java ---
// HTMLParser Library $Name:  $ - A java-based parser for HTML
// http://sourceforge.org/projects/htmlparser
// Copyright (C) 2003 Derrick Oswald
//
// Revision Control Information
//
// $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/OrFilter.java,v $
// $Author: derrickoswald $
// $Date: 2003/11/08 21:30:58 $
// $Revision: 1.1 $
//
// This library is free software; you can redistribute it and/or
// modify it under the terms of the GNU Lesser General Public
// License as published by the Free Software Foundation; either
// version 2.1 of the License, or (at your option) any later version.
//
// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
// Lesser General Public License for more details.
//
// You should have received a copy of the GNU Lesser General Public
// License along with this library; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
//

package org.htmlparser.filters;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;

/**
 * This class accepts all nodes matching either filter (OR operation).
 */
public class OrFilter implements NodeFilter
{
    /**
     * The left hand side.
     */
    protected NodeFilter mLeft;

    /**
     * The right hand side.
     */
    protected NodeFilter mRight;

    /**
     * Creates a new instance of OrFilter that accepts nodes acceptable to either filter.
     * @param left One filter.
     * @param right The other filter.
     */
    public OrFilter (NodeFilter left, NodeFilter right)
    {
        mLeft = left;
        mRight = right;
    }

    /**
     * Accept nodes that are acceptable to either filter.
     * @param node The node to check.
     */
    public boolean accept (Node node)
    {
        return (mLeft.accept (node) || mRight.accept (node));
    }
}

--- NEW FILE: StringFilter.java ---
// HTMLParser Library $Name:  $ - A java-based parser for HTML
// http://sourceforge.org/projects/htmlparser
// Copyright (C) 2003 Derrick Oswald
//
// Revision Control Information
//
// $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/StringFilter.java,v $
// $Author: derrickoswald $
// $Date: 2003/11/08 21:30:58 $
// $Revision: 1.1 $
//
// This library is free software; you can redistribute it and/or
// modify it under the terms of the GNU Lesser General Public
// License as published by the Free Software Foundation; either
// version 2.1 of the License, or (at your option) any later version.
//
// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
// Lesser General Public License for more details.
//
// You should have received a copy of the GNU Lesser General Public
// License along with this library; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
//

package org.htmlparser.filters;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.lexer.nodes.StringNode;

/**
 * This class accepts all string nodes containing the given string.
 */
public class StringFilter implements NodeFilter
{
    /**
     * The string to search for.
     */
    protected String mPattern;

    /**
     * Case sensitive toggle.
     */
    protected boolean mCaseSensitive;

    /**
     * Creates a new instance of StringFilter that accepts string nodes containing a certain string.
     * The comparison is case insensitive.
     * @param pattern The pattern to search for.
     */
    public StringFilter (String pattern)
    {
        this (pattern, false);
    }

    /**
     * Creates a new instance of StringFilter that accepts string nodes containing a certain string.
     * @param pattern The pattern to search for.
     * @param case_sensitive If <code>true</code>, comparisons are performed
     * respecting case.
     */
    public StringFilter (String pattern, boolean case_sensitive)
    {
        mCaseSensitive = case_sensitive;
        if (mCaseSensitive)
            mPattern = pattern;
        else
            mPattern = pattern.toUpperCase ();
    }

    /**
     * Accept string nodes that contain the string.
     * @param node The node to check.
     */
    public boolean accept (Node node)
    {
        String string;
        boolean ret;

        ret = false;
        if (node instanceof StringNode)
        {
            string = ((StringNode)node).getText ();
            if (!mCaseSensitive)
                string = string.toUpperCase ();
            ret = -1 != string.indexOf (mPattern);
        }

        return (ret);
    }
}

--- NEW FILE: TagNameFilter.java ---
// HTMLParser Library $Name:  $ - A java-based parser for HTML
// http://sourceforge.org/projects/htmlparser
// Copyright (C) 2003 Derrick Oswald
//
// Revision Control Information
//
// $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/TagNameFilter.java,v $
// $Author: derrickoswald $
// $Date: 2003/11/08 21:30:58 $
// $Revision: 1.1 $
//
// This library is free software; you can redistribute it and/or
// modify it under the terms of the GNU Lesser General Public
// License as published by the Free Software Foundation; either
// version 2.1 of the License, or (at your option) any later version.
//
// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
// Lesser General Public License for more details.
//
// You should have received a copy of the GNU Lesser General Public
// License along with this library; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
//

package org.htmlparser.filters;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.lexer.nodes.TagNode;

/**
 * This class accepts all tags matching the tag name.
 */
public class TagNameFilter
    implements
        NodeFilter
{
    /**
     * The tag name to match.
     */
    protected String mName;

    /**
     * Creates a new instance of TagNameFilter that accepts tags with the given name.
     * @param name The tag name to match.
     */
    public TagNameFilter (String name)
    {
        mName = name.toUpperCase ();
    }

    /**
     * Accept nodes that are tags and have a matching tag name.
     * This discards non-tag nodes and end tags.
     * The end tags are available on the enclosing non-end tag.
     * @param node The node to check.
     */
    public boolean accept (Node node)
    {
        return ((node instanceof TagNode) &&
                !((TagNode)node).isEndTag () &&
                ((TagNode)node).getTagName ().equals (mName));
    }
}

--- NEW FILE: package.html ---
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<HTML>
<HEAD>
<!--

  @(#)package.html  1.60 98/01/27

 HTMLParser Library v1_4_20031026 - A java-based parser for HTML
 Copyright (C) Dec 31, 2000 Somik Raha

 This library is free software; you can redistribute it and/or
 modify it under the terms of the GNU Lesser General Public
 License as published by the Free Software Foundation; either
 version 2.1 of the License, or (at your option) any later version.

 This library is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 Lesser General Public License for more details.

 You should have received a copy of the GNU Lesser General Public
 License along with this library; if not, write to the Free Software
 Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA

 For any questions or suggestions, you can write to me at :
 Email :so...@in...

 Postal Address :
 Somik Raha
 Extreme Programmer & Coach
 Industrial Logic Corporation
 2583 Cedar Street, Berkeley,
 CA 94708, USA
 Website : http://www.industriallogic.com

-->
<TITLE>Filters Package</TITLE>
</HEAD>
<BODY>
The filters package contains example filters to select only desired nodes.
For example, to display tags having the "id" attribute, you could use:
<pre>
Parser parser = new Parser ("http://yadda");
parser.parse (new HasAttributeFilter ("id"));
</pre>
These filters can be combined to yield powerfull extraction capabilities.
For example, to get a list of links where the contents is an image, you could use:
<pre>
NodeList list = new NodeList ();
NodeFilter filter =
    new AndFilter (
        new TagNameFilter ("A"),
        new HasChildFilter (
            new TagNameFilter ("IMG")));
for (NodeIterator e = parser.elements (); e.hasMoreNodes (); )
    e.nextNode ().collectInto (list, filter);
</pre>
</BODY>
</HTML>

2003	Jan	Feb	Mar	Apr	May (141)	Jun (108)	Jul (66)	Aug (127)	Sep (155)	Oct (149)	Nov (72)	Dec (72)
2004	Jan (100)	Feb (36)	Mar (21)	Apr (3)	May (87)	Jun (28)	Jul (84)	Aug (5)	Sep (14)	Oct	Nov	Dec
2005	Jan (1)	Feb (39)	Mar (26)	Apr (38)	May (14)	Jun (10)	Jul	Aug	Sep (13)	Oct (8)	Nov (10)	Dec
2006	Jan	Feb (1)	Mar (17)	Apr (20)	May (28)	Jun (24)	Jul	Aug	Sep	Oct	Nov	Dec
2015	Jan	Feb	Mar (1)	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec

htmlparser-cvs Mailing List for HTML Parser (Page 30)

htmlparser-cvs — syncmail email notification of CVS commits