Thread: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners Scanner.java,NONE,1.1 CompositeTagScanner.ja

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners
In directory sc8-pr-cvs1:/tmp/cvs-serv12747/org/htmlparser/scanners

Modified Files:
	CompositeTagScanner.java JspScanner.java ScriptScanner.java 
	TagScanner.java package.html 
Added Files:
	Scanner.java 
Log Message:
Reduce recursion on the JVM stack in CompositeTagScanner.
Pass a stack of open tags to the scanner.
Add smarter tag closing by walking up the stack on encountering an unopened end tag.
Avoids a problem with bad HTML such as that found at
http://scores.nba.com/games/20031029/scoreboard.html by Shaun Roach.
Added testInvalidNesting to CompositeTagScanner Test based on the above.

--- NEW FILE: Scanner.java ---
// HTMLParser Library $Name:  $ - A java-based parser for HTML
// http://sourceforge.org/projects/htmlparser
// Copyright (C) 2003 Derrick Oswald
//
// Revision Control Information
//
// $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/Scanner.java,v $
// $Author: derrickoswald $
// $Date: 2003/12/20 23:47:55 $
// $Revision: 1.1 $
//
// This library is free software; you can redistribute it and/or
// modify it under the terms of the GNU Lesser General Public
// License as published by the Free Software Foundation; either
// version 2.1 of the License, or (at your option) any later version.
//
// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
// Lesser General Public License for more details.
//
// You should have received a copy of the GNU Lesser General Public
// License along with this library; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
//

package org.htmlparser.scanners;

import org.htmlparser.lexer.Lexer;
import org.htmlparser.tags.Tag;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

/**
 * Generic interface for scanning.
 * Tags needing specialized operations can provide an object that implements
 * this interface via getThisScanner().
 * By default non-composite tags simply perform the semantic action and
 * return while composite tags will gather their children.
 */
public interface Scanner
{
    /**
     * Scan the tag.
     * The Lexer is provided in order to do a lookahead operation.
     * @param tag HTML tag to be scanned for identification.
     * @param lexer Provides html page access.
     * @param stack The parse stack. May contain pending tags that enclose
     * this tag. Nodes on the stack should be considered incomplete.
     * @return The resultant tag (may be unchanged).
     * @exception ParserException if an unrecoverable problem occurs.
     */
    public Tag scan (Tag tag, Lexer lexer, NodeList stack) throws ParserException;
}

Index: CompositeTagScanner.java
===================================================================
RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagScanner.java,v
retrieving revision 1.83
retrieving revision 1.84
diff -C2 -d -r1.83 -r1.84
*** CompositeTagScanner.java	8 Dec 2003 13:13:59 -0000	1.83
--- CompositeTagScanner.java	20 Dec 2003 23:47:55 -0000	1.84
***************
*** 1,4 ****
! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML
! // Copyright (C) Dec 31, 2000 Somik Raha
  //
  // This library is free software; you can redistribute it and/or
--- 1,12 ----
! // HTMLParser Library $Name$ - A java-based parser for HTML
! // http://sourceforge.org/projects/htmlparser
! // Copyright (C) 2003 Somik Raha
! //
! // Revision Control Information
! //
! // $Source$
! // $Author$
! // $Date$
! // $Revision$
  //
  // This library is free software; you can redistribute it and/or
***************
*** 9,29 ****
  // This library is distributed in the hope that it will be useful,
  // but WITHOUT ANY WARRANTY; without even the implied warranty of
! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  // Lesser General Public License for more details.
  //
  // You should have received a copy of the GNU Lesser General Public
  // License along with this library; if not, write to the Free Software
! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
! //
! // For any questions or suggestions, you can write to me at :
! // Email :so...@in...
  //
- // Postal Address :
- // Somik Raha
- // Extreme Programmer & Coach
- // Industrial Logic Corporation
- // 2583 Cedar Street, Berkeley,
- // CA 94708, USA
- // Website : http://www.industriallogic.com

  package org.htmlparser.scanners;
--- 17,27 ----
  // This library is distributed in the hope that it will be useful,
  // but WITHOUT ANY WARRANTY; without even the implied warranty of
! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
  // Lesser General Public License for more details.
  //
  // You should have received a copy of the GNU Lesser General Public
  // License along with this library; if not, write to the Free Software
! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
  //

  package org.htmlparser.scanners;
***************
*** 37,40 ****
--- 35,39 ----
  import org.htmlparser.lexer.Page;
  import org.htmlparser.lexer.nodes.Attribute;
+ import org.htmlparser.scanners.Scanner;
  import org.htmlparser.tags.CompositeTag;
  import org.htmlparser.tags.Tag;
***************
*** 43,167 ****

  /**
!  * To create your own scanner that can create tags tht hold children, create a subclass of this class.
!  * The composite tag scanner can be configured with:<br>
!  * <ul>
!  * <li>Tags which will trigger a match</li>
!  * <li>Tags which when encountered before a legal end tag, should force a correction</li>
!  * </ul>
!  * Here are examples of each:<BR>
!  * <B>Tags which will trigger a match</B>
!  * If we wish to recognize &lt;mytag&gt;,
!  * <pre>
!  * MyScanner extends CompositeTagScanner {
!  *   private static final String [] MATCH_IDS = { "MYTAG" };
!  *   MyScanner() {
!  *      super(MATCH_IDS);
!  *   }
!  *   ...
!  * }
!  * </pre>
!  * <B>Tags which force correction</B>
!  * If we wish to insert end tags if we get a </BODY> or </HTML> without recieving
!  * &lt;/mytag&gt;
!  * <pre>
!  * MyScanner extends CompositeTagScanner {
!  *   private static final String [] MATCH_IDS = { "MYTAG" };
!  *   private static final String [] ENDERS = {};
!  *   private static final String [] END_TAG_ENDERS = { "BODY", "HTML" };
!  *   MyScanner() {
!  *      super(MATCH_IDS, ENDERS, END_TAG_ENDERS, true);
!  *   }
!  *   ...
!  * }
!  * </pre>
!  * <B>Preventing children of same type</B>
!  * This is useful when you know that a certain tag can never hold children of its own type.
!  * e.g. &lt;FORM&gt; can never have more form tags within it. If it does, it is an error and should
!  * be corrected. Specify the tagEnders set to contain (at least) the match ids.
!  * <pre>
!  * MyScanner extends CompositeTagScanner {
!  *   private static final String [] MATCH_IDS = { "FORM" };
!  *   private static final String [] END_TAG_ENDERS = { "BODY", "HTML" };
!  *   MyScanner() {
!  *      super(MATCH_IDS, MATCH_IDS, END_TAG_ENDERS, false);
!  *   }
!  *   ...
!  * }
!  * </pre>
!  * Inside the scanner, use createTag() to specify what tag needs to be created.
   */
  public class CompositeTagScanner extends TagScanner
  {
!     protected Set tagEnderSet;
!     private Set endTagEnderSet;
!     private boolean balance_quotes;
! 
!     public CompositeTagScanner()
!     {
!         this(new String[] {});
!     }
! 
!     public CompositeTagScanner(String [] tagEnders)
!     {
!         this("",tagEnders);
!     }
! 
!     public CompositeTagScanner(String filter)
!     {
!         this(filter,new String [] {});
!     }
! 
!     public CompositeTagScanner(
!         String filter,
!         String [] tagEnders) 
!     {
!         this(filter,tagEnders,new String[] {});
!     }

!     public CompositeTagScanner(
!         String filter,
!         String [] tagEnders,
!         String [] endTagEnders)
!     {
!         this(filter,tagEnders,endTagEnders, false);
!     }

!    /**
!     * Constructor specifying all member fields.
!     * @param filter A string that is used to match which tags are to be allowed
!     * to pass through. This can be useful when one wishes to dynamically filter
!     * out all tags except one type which may be programmed later than the parser.
!     * @param tagEnders The non-endtag tag names which signal that no closing
!     * end tag was found. For example, encountering &lt;FORM&gt; while
!     * scanning a &lt;A&gt; link tag would mean that no &lt;/A&gt; was found
!     * and needs to be corrected.
!     * @param endTagEnders The endtag names which signal that no closing end
!     * tag was found. For example, encountering &lt;/HTML&gt; while
!     * scanning a &lt;BODY&gt; tag would mean that no &lt;/BODY&gt; was found
!     * and needs to be corrected. These items are not prefixed by a '/'.
!     * @param balance_quotes <code>true</code> if scanning string nodes needs to
!     * honour quotes. For example, ScriptScanner defines this <code>true</code>
!     * so that text within &lt;SCRIPT&gt;&lt;/SCRIPT&gt; ignores tag-like text
!     * within quotes.
!     */
!     public CompositeTagScanner(
!         String filter,
!         String [] tagEnders,
!         String [] endTagEnders,
!         boolean balance_quotes) 
      {
-         super(filter);
-         this.balance_quotes = balance_quotes;
-         this.tagEnderSet = new HashSet();
-         for (int i=0;i<tagEnders.length;i++)
-             tagEnderSet.add(tagEnders[i]);
-         this.endTagEnderSet = new HashSet();
-         for (int i=0;i<endTagEnders.length;i++)
-             endTagEnderSet.add(endTagEnders[i]);
      }

      /**
       * Collect the children.
!      * An initial test is performed for an empty XML tag, in which case
       * the start tag and end tag of the returned tag are the same and it has
       * no children.<p>
--- 42,78 ----

  /**
!  * The main scanning logic for nested tags.
!  * When asked to scan, this class gathers nodes into a heirarchy of tags.
   */
  public class CompositeTagScanner extends TagScanner
  {
!     /**
!      * Determine whether to use JVM or NodeList stack.
!      * This can be set to true to get the original behaviour of
!      * recursion into composite tags on the JVM stack.
!      * This may lead to StackOverFlowException problems in some cases
!      * i.e. Windows.
!      */
!     private static final boolean mUseJVMStack = false;

!     /**
!      * Determine whether unexpected end tags should cause stack roll-up.
!      * This can be set to true to get the original behaviour of gathering
!      * end tags into whatever tag is open.
!      * This can be expensive, but should only be needed in the presence of
!      * bad HTML.
!      */
!     private static final boolean mLeaveEnds = false;

!     /**
!      * Create a composite tag scanner.
!      */
!     public CompositeTagScanner ()
      {
      }

      /**
       * Collect the children.
!      * <p>An initial test is performed for an empty XML tag, in which case
       * the start tag and end tag of the returned tag are the same and it has
       * no children.<p>
***************
*** 171,221 ****
       * In the latter case, a virtual end tag is created.
       * Each node found that is not the end tag is added to
!      * the list of children.<p>
!      * The scanner's {@link #createTag} method is called with details about
!      * the start tag, end tag and children. The attributes from the start tag
!      * will wind up duplicated in the newly created tag, so the start tag is
!      * kind of redundant (and may be removed in subsequent refactoring).
!      * @param tag The tag this scanner is responsible for. This will be the
!      * start (and possibly end) tag passed to {@link #createTag}.
!      * @param url The url for the page the tag is discovered on.
       * @param lexer The source of subsequent nodes.
!      * @return The scanner specific tag from the call to {@link #createTag}.
       */
!     public Tag scan (Tag tag, String url, Lexer lexer) throws ParserException
      {
          Node node;
!         NodeList nodeList;
!         Tag endTag;
!         String match;
          String name;
!         TagScanner scanner;
          CompositeTag ret;

!         nodeList = new NodeList ();
!         endTag = null;
!         match = tag.getTagName ();

!         if (tag.isEmptyXmlTag ())
!             endTag = tag;
          else
              do
              {
!                 node = lexer.nextNode (balance_quotes);
                  if (null != node)
                  {
                      if (node instanceof Tag)
                      {
!                         Tag next = (Tag)node;
                          name = next.getTagName ();
                          // check for normal end tag
!                         if (next.isEndTag () && name.equals (match))
                          {
!                             endTag = next;
                              node = null;
                          }
!                         else if (isTagToBeEndedFor (tag, next)) // check DTD
                          {
!                             // insert a virtual end tag and backup one node
!                             endTag = createVirtualEndTag (tag, lexer.getPage (), next.getStartPosition ());
                              lexer.setPosition (next.getStartPosition ());
                              node = null;
--- 82,131 ----
       * In the latter case, a virtual end tag is created.
       * Each node found that is not the end tag is added to
!      * the list of children. The end tag is special and not a child.<p>
!      * Nodes that also have a CompositeTagScanner as their scanner are
!      * recursed into, which provides the nested structure of an HTML page.
!      * This method operates in two possible modes, depending on a private boolean.
!      * It can recurse on the JVM stack, which has caused some overflow problems
!      * in the past, or it can use the supplied stack argument to nest scanning
!      * of child tags within itself. The former is left as an option in the code,
!      * mostly to help subsequent modifiers visualize what the internal nesting
!      * is doing.
!      * @param tag The tag this scanner is responsible for.
       * @param lexer The source of subsequent nodes.
!      * @param stack The parse stack. May contain pending tags that enclose
!      * this tag.
!      * @return The resultant tag (may be unchanged).
       */
!     public Tag scan (Tag tag, Lexer lexer, NodeList stack) throws ParserException
      {
          Node node;
!         Tag next;
          String name;
!         Scanner scanner;
          CompositeTag ret;

!         ret = (CompositeTag)tag;

!         if (ret.isEmptyXmlTag ())
!             ret.setEndTag (ret);
          else
              do
              {
!                 node = lexer.nextNode (false);
                  if (null != node)
                  {
                      if (node instanceof Tag)
                      {
!                         next = (Tag)node;
                          name = next.getTagName ();
                          // check for normal end tag
!                         if (next.isEndTag () && name.equals (ret.getTagName ()))
                          {
!                             ret.setEndTag (next);
                              node = null;
                          }
!                         else if (isTagToBeEndedFor (ret, next)) // check DTD
                          {
!                             // backup one node. insert a virtual end tag later
                              lexer.setPosition (next.getStartPosition ());
                              node = null;
***************
*** 225,249 ****
                              // now recurse if there is a scanner for this type of tag
                              scanner = next.getThisScanner ();
!                             if ((null != scanner) && scanner.evaluate (next, null))
!                                 node = scanner.scan (next, lexer.getPage ().getUrl (), lexer);
                          }
                      }

!                     if (null != node)
!                         nodeList.add (node);
                  }
              }
              while (null != node);

!         if (null == endTag)
!             endTag = createVirtualEndTag (tag, lexer.getPage (), lexer.getCursor ().getPosition ());
! 
!         ret = (CompositeTag)tag;
!         ret.setEndTag (endTag);
!         ret.setChildren (nodeList);
!         for (int i = 0; i < ret.getChildCount (); i++)
!             ret.childAt (i).setParent (ret);
!         endTag.setParent (ret);
!         ret.doSemanticAction ();

          return (ret);
--- 135,275 ----
                              // now recurse if there is a scanner for this type of tag
                              scanner = next.getThisScanner ();
!                             if (null != scanner)
!                             {
!                                 if (mUseJVMStack)
!                                 {   // JVM stack recursion
!                                     node = scanner.scan (next, lexer, stack);
!                                     addChild (ret, node);
!                                 }
!                                 else
!                                 {
!                                     // fake recursion:
!                                     if ((scanner == this) && (next instanceof CompositeTag))
!                                     {
!                                         CompositeTag ondeck = (CompositeTag)next;
!                                         if (ondeck.isEmptyXmlTag ())
!                                         {
!                                             ondeck.setEndTag (ondeck);
!                                             finishTag (ondeck, lexer);
!                                             addChild (ret, ondeck);
!                                         }
!                                         else
!                                         {
!                                             stack.add (ret);
!                                             ret = ondeck;
!                                         }
!                                     }
!                                     else
!                                     {   // normal recursion if switching scanners
!                                         node = scanner.scan (next, lexer, stack);
!                                         addChild (ret, node);
!                                     }
!                                 }
!                             }
!                             else
!                                 addChild (ret, next);
!                         }
!                         else
!                         {
!                             if (!mUseJVMStack && !mLeaveEnds)
!                             {
!                                 // Since all non-end tags are consumed by the
!                                 // previous clause, we're here because we have an
!                                 // end tag with no opening tag... this could be bad.
!                                 // There are two cases...
!                                 // 1) The tag hasn't been registered, in which case
!                                 // we just add it as a simple child, like it's
!                                 // opening tag
!                                 // 2) There may be an opening tag further up the
!                                 // parse stack that needs closing.
!                                 // So, we ask the factory for a node like this one
!                                 // (since end tags never have scanners) and see
!                                 // if it's scanner is a composite tag scanner.
!                                 // If it is we walk up the parse stack looking for
!                                 // something that needs this end tag to finish it.
!                                 // If there is something, we close off all the tags
!                                 // walked over and continue on as if nothing
!                                 // happened.
!                                 Vector attributes = new Vector ();
!                                 attributes.addElement (new Attribute (name, null));
!                                 Tag opener = (Tag)lexer.getNodeFactory ().createTagNode (
!                                     next.getPage (), next.getStartPosition (), next.getEndPosition (),
!                                     attributes);
! 
!                                 scanner = opener.getThisScanner ();
!                                 if ((null != scanner) && (scanner == this))
!                                 {
!                                     // uh-oh
!                                     int index = -1;
!                                     for (int i = stack.size () - 1; (-1 == index) && (i >= 0); i--)
!                                     {
!                                         // short circuit here... assume everything on the stack is a CompositeTag and has this as it's scanner
!                                         // we'll need to stop if either of those conditions isn't met
!                                         CompositeTag boffo = (CompositeTag)stack.elementAt (i);
!                                         if (name.equals (boffo.getTagName ()))
!                                             index = i;
!                                         else if (isTagToBeEndedFor (boffo, next)) // check DTD
!                                             index = i;
!                                     }
!                                     if (-1 != index)
!                                     {
!                                         // finish off the current one first
!                                         finishTag (ret, lexer);
!                                         addChild ((CompositeTag)stack.elementAt (stack.size () - 1), ret);
!                                         for (int i = stack.size () - 1; i > index; i--)
!                                         {
!                                             CompositeTag fred = (CompositeTag)stack.remove (i);
!                                             finishTag (fred, lexer);
!                                             addChild ((CompositeTag)stack.elementAt (i - 1), fred);
!                                         }
!                                         ret = (CompositeTag)stack.remove (index);
!                                         node = null;
!                                     }
!                                     else
!                                         addChild (ret, next); // default behaviour
!                                 }
!                                 else
!                                     addChild (ret, next); // default behaviour
!                             }
!                             else
!                                 addChild (ret, next);
                          }
                      }
+                     else
+                         addChild (ret, node);
+                 }

!                 if (!mUseJVMStack)
!                 {
!                     // handle coming out of fake recursion
!                     if (null == node)
!                     {
!                         int depth = stack.size ();
!                         if (0 != depth)
!                         {
!                             node = stack.elementAt (depth - 1);
!                             if (node instanceof CompositeTag)
!                             {
!                                 CompositeTag precursor = (CompositeTag)node;
!                                 scanner = precursor.getThisScanner ();
!                                 if (scanner == this)
!                                 {
!                                     stack.remove (depth - 1);
!                                     finishTag (ret, lexer);
!                                     addChild (precursor, ret);
!                                     ret = precursor;
!                                 }
!                                 else
!                                     node = null; // normal recursion
!                             }
!                             else
!                                 node = null; // normal recursion
!                         }
!                     }
                  }
              }
              while (null != node);

!         finishTag (ret, lexer);

          return (ret);
***************
*** 251,264 ****

      /**
       * Creates an end tag with the same name as the given tag.
-      * NOTE: This does not call the {@link #createTag} method, but may in the
-      * future after refactoring.
       * @param tag The tag to end.
       * @param page The page the tag is on (virtually).
       * @param position The offset into the page at which the tag is to
       * be anchored.
!      * @return An end tag with the name "/" + tag.getTagName() and a start
!      * and end position at the given position. The fact these are equal may
!      * be used to distinguish it as a virtual tag.
       */
      protected Tag createVirtualEndTag (Tag tag, Page page, int position)
--- 277,319 ----

      /**
+      * Add a child to the given tag.
+      * @param parent The parent tag.
+      * @param child The child node.
+      */
+     protected void addChild (Tag parent, Node child)
+     {
+         if (null == parent.getChildren ())
+             parent.setChildren (new NodeList ());
+         child.setParent (parent);
+         parent.getChildren ().add (child);
+     }
+ 
+     /**
+      * Finish off a tag.
+      * Perhap add a virtual end tag.
+      * Set the end tag parent as this tag.
+      * Perform the semantic acton.
+      * @param tag The tag to finish off.
+      * @param lexer A lexer positioned at the end of the tag.
+      */
+     protected void finishTag (CompositeTag tag, Lexer lexer)
+         throws
+             ParserException
+     {
+         if (null == tag.getEndTag ())
+             tag.setEndTag (createVirtualEndTag (tag, lexer.getPage (), lexer.getCursor ().getPosition ()));
+         tag.getEndTag ().setParent (tag);
+         tag.doSemanticAction ();
+     }
+     
+     /**
       * Creates an end tag with the same name as the given tag.
       * @param tag The tag to end.
       * @param page The page the tag is on (virtually).
       * @param position The offset into the page at which the tag is to
       * be anchored.
!      * @return An end tag with the name '"/" + tag.getTagName()' and a start
!      * and end position at the given position. The fact these positions are
!      * equal may be used to distinguish it as a virtual tag later on.
       */
      protected Tag createVirtualEndTag (Tag tag, Page page, int position)
***************
*** 277,287 ****

      /**
!      * For composite tags this shouldn't be used and hence throws an exception.
       */
-     public Tag createTag (Page page, int start, int end, Vector attributes, Tag tag, String url) throws ParserException
-     {
-         throw new ParserException ("composite tags shouldn't be using this");
-     }
- 
      public final boolean isTagToBeEndedFor (Tag current, Tag tag)
      {
--- 332,344 ----

      /**
!      * Determine if the current tag should be terminated by the given tag.
!      * Examines the 'enders' or 'end tag enders' lists of the current tag
!      * for a match with the given tag. Which list is chosen depends on whether
!      * tag is an end tag ('end tag enders') or not ('enders').
!      * @param current The tag that might need to be ended.
!      * @param tag The candidate tag that might end the current one.
!      * @return <code>true</code> if the name of the given tag is a member of
!      * the appropriate list.
       */
      public final boolean isTagToBeEndedFor (Tag current, Tag tag)
      {

Index: JspScanner.java
===================================================================
RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/JspScanner.java,v
retrieving revision 1.33
retrieving revision 1.34
diff -C2 -d -r1.33 -r1.34
*** JspScanner.java	8 Dec 2003 01:31:52 -0000	1.33
--- JspScanner.java	20 Dec 2003 23:47:55 -0000	1.34
***************
*** 1,4 ****
! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML
! // Copyright (C) Dec 31, 2000 Somik Raha
  //
  // This library is free software; you can redistribute it and/or
--- 1,12 ----
! // HTMLParser Library $Name$ - A java-based parser for HTML
! // http://sourceforge.org/projects/htmlparser
! // Copyright (C) 2003 Somik Raha
! //
! // Revision Control Information
! //
! // $Source$
! // $Author$
! // $Date$
! // $Revision$
  //
  // This library is free software; you can redistribute it and/or
***************
*** 9,71 ****
  // This library is distributed in the hope that it will be useful,
  // but WITHOUT ANY WARRANTY; without even the implied warranty of
! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  // Lesser General Public License for more details.
  //
  // You should have received a copy of the GNU Lesser General Public
  // License along with this library; if not, write to the Free Software
! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
! //
! // For any questions or suggestions, you can write to me at :
! // Email :so...@in...
  //
- // Postal Address :
- // Somik Raha
- // Extreme Programmer & Coach
- // Industrial Logic Corporation
- // 2583 Cedar Street, Berkeley,
- // CA 94708, USA
- // Website : http://www.industriallogic.com
- 

  package org.htmlparser.scanners;

! import java.util.Vector;
! import org.htmlparser.lexer.Page;
! /////////////////////////
! // HTML Parser Imports //
! /////////////////////////
! import org.htmlparser.tags.JspTag;
! import org.htmlparser.tags.Tag;
! import org.htmlparser.util.ParserException;
! 
! public class JspScanner extends TagScanner {
! 
!     public JspScanner() {
!         super();
!     }
! 
!     public JspScanner(String filter) {
!         super(filter);
!     }
! 
!     public String [] getID() {
!         String [] ids = new String[3];
!         ids[0] = "%";
!         ids[1] = "%=";
!         ids[2] = "%@";
!         return ids;
!     }
! 
!     public Tag createTag (Page page, int start, int end, Vector attributes, Tag tag, String url) throws ParserException
      {
-         JspTag ret;
-         
-         ret = new JspTag ();
-         ret.setPage (page);
-         ret.setStartPosition (start);
-         ret.setEndPosition (end);
-         ret.setAttributesEx (attributes);
-         
-         return (ret);
      }
  }
--- 17,41 ----
  // This library is distributed in the hope that it will be useful,
  // but WITHOUT ANY WARRANTY; without even the implied warranty of
! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
  // Lesser General Public License for more details.
  //
  // You should have received a copy of the GNU Lesser General Public
  // License along with this library; if not, write to the Free Software
! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
  //

  package org.htmlparser.scanners;

! /**
!  * Placeholder for <em>yet to be written</em> scanner for JSP tags.
!  * This vacuous class does nothing special at the moment.
!  */
! public class JspScanner extends TagScanner
! {
!     /**
!      * Create a new JspScanner.
!      */
!     public JspScanner ()
      {
      }
  }

Index: ScriptScanner.java
===================================================================
RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.java,v
retrieving revision 1.53
retrieving revision 1.54
diff -C2 -d -r1.53 -r1.54
*** ScriptScanner.java	8 Dec 2003 01:31:52 -0000	1.53
--- ScriptScanner.java	20 Dec 2003 23:47:55 -0000	1.54
***************
*** 1,4 ****
! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML
! // Copyright (C) Dec 31, 2000 Somik Raha
  //
  // This library is free software; you can redistribute it and/or
--- 1,12 ----
! // HTMLParser Library $Name$ - A java-based parser for HTML
! // http://sourceforge.org/projects/htmlparser
! // Copyright (C) 2003 Somik Raha
! //
! // Revision Control Information
! //
! // $Source$
! // $Author$
! // $Date$
! // $Revision$
  //
  // This library is free software; you can redistribute it and/or
***************
*** 9,29 ****
  // This library is distributed in the hope that it will be useful,
  // but WITHOUT ANY WARRANTY; without even the implied warranty of
! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  // Lesser General Public License for more details.
  //
  // You should have received a copy of the GNU Lesser General Public
  // License along with this library; if not, write to the Free Software
! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
! //
! // For any questions or suggestions, you can write to me at :
! // Email :so...@in...
  //
- // Postal Address :
- // Somik Raha
- // Extreme Programmer & Coach
- // Industrial Logic Corporation
- // 2583 Cedar Street, Berkeley,
- // CA 94708, USA
- // Website : http://www.industriallogic.com

  package org.htmlparser.scanners;
--- 17,27 ----
  // This library is distributed in the hope that it will be useful,
  // but WITHOUT ANY WARRANTY; without even the implied warranty of
! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
  // Lesser General Public License for more details.
  //
  // You should have received a copy of the GNU Lesser General Public
  // License along with this library; if not, write to the Free Software
! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
  //

  package org.htmlparser.scanners;
***************
*** 53,85 ****
          CompositeTagScanner
  {
!     private static final String SCRIPT_END_TAG = "</SCRIPT>";
!     private static final String MATCH_NAME [] = {"SCRIPT"};
!     private static final String ENDERS [] = {"BODY", "HTML"};
! 
!     public ScriptScanner() {
!         super("",ENDERS);
!     }
! 
!     public ScriptScanner(String filter) {
!         super(filter,ENDERS);
!     }
! 
!     public String [] getID() {
!         return MATCH_NAME;
!     }
! 
!     public Tag createTag(Page page, int start, int end, Vector attributes, Tag startTag, Tag endTag, NodeList children) throws ParserException
      {
-         ScriptTag ret;
- 
-         ret = new ScriptTag ();
-         ret.setPage (page);
-         ret.setStartPosition (start);
-         ret.setEndPosition (end);
-         ret.setAttributesEx (attributes);
-         ret.setEndTag (endTag);
-         ret.setChildren (children);
- 
-         return (ret);
      }

--- 51,59 ----
          CompositeTagScanner
  {
!     /**
!      * Create a script scanner.
!      */
!     public ScriptScanner()
      {
      }

***************
*** 88,95 ****
       * Accumulates nodes returned from the lexer, until &lt;/SCRIPT&gt;,
       * &lt;BODY&gt; or &lt;HTML&gt; is encountered. Replaces the node factory
!      * in the lexer with a new Parser to avoid other scanners missing their 
!      * end tags and accumulating even the &lt;/SCRIPT&gt;.
       */
!     public Tag scan (Tag tag, String url, Lexer lexer)
          throws ParserException
      {
--- 62,72 ----
       * Accumulates nodes returned from the lexer, until &lt;/SCRIPT&gt;,
       * &lt;BODY&gt; or &lt;HTML&gt; is encountered. Replaces the node factory
!      * in the lexer with a new (empty) one to avoid other scanners missing their 
!      * end tags and accumulating even the &lt;/SCRIPT&gt; tag.
!      * @param tag The tag this scanner is responsible for.
!      * @param lexer The source of subsequent nodes.
!      * @param stack The parse stack, <em>not used</em>.
       */
!     public Tag scan (Tag tag, Lexer lexer, NodeList stack)
          throws ParserException
      {
***************
*** 118,122 ****
                      if (node instanceof Tag)
                          if (   ((Tag)node).isEndTag ()
!                             && ((Tag)node).getTagName ().equals (MATCH_NAME[0]))
                          {
                              end = (Tag)node;
--- 95,99 ----
                      if (node instanceof Tag)
                          if (   ((Tag)node).isEndTag ()
!                             && ((Tag)node).getTagName ().equals (tag.getIds ()[0]))
                          {
                              end = (Tag)node;
***************
*** 181,194 ****

          return (ret);
-     }
- 
-     /**
-      * Gets the end tag that the scanner uses to stop scanning. Subclasses of
-      * <code>ScriptScanner</code> you should override this method.
-      * @return String containing the end tag to search for, i.e. &lt;/SCRIPT&gt;
-      */
-     public String getEndTag()
-     {
-         return SCRIPT_END_TAG;
      }
  }
--- 158,161 ----

Index: TagScanner.java
===================================================================
RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/TagScanner.java,v
retrieving revision 1.52
retrieving revision 1.53
diff -C2 -d -r1.52 -r1.53
*** TagScanner.java	8 Dec 2003 13:13:59 -0000	1.52
--- TagScanner.java	20 Dec 2003 23:47:55 -0000	1.53
***************
*** 1,4 ****
! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML
! // Copyright (C) Dec 31, 2000 Somik Raha
  //
  // This library is free software; you can redistribute it and/or
--- 1,12 ----
! // HTMLParser Library $Name$ - A java-based parser for HTML
! // http://sourceforge.org/projects/htmlparser
! // Copyright (C) 2003 Somik Raha
! //
! // Revision Control Information
! //
! // $Source$
! // $Author$
! // $Date$
! // $Revision$
  //
  // This library is free software; you can redistribute it and/or
***************
*** 9,147 ****
  // This library is distributed in the hope that it will be useful,
  // but WITHOUT ANY WARRANTY; without even the implied warranty of
! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  // Lesser General Public License for more details.
  //
  // You should have received a copy of the GNU Lesser General Public
  // License along with this library; if not, write to the Free Software
! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
! //
! // For any questions or suggestions, you can write to me at :
! // Email :so...@in...
  //
- // Postal Address :
- // Somik Raha
- // Extreme Programmer & Coach
- // Industrial Logic Corporation
- // 2583 Cedar Street, Berkeley,
- // CA 94708, USA
- // Website : http://www.industriallogic.com

  package org.htmlparser.scanners;
! //////////////////
! // Java Imports //
! //////////////////
  import java.io.Serializable;
- import java.util.Hashtable;
- import java.util.Map;
- import java.util.Vector;

- import org.htmlparser.AbstractNode;
- import org.htmlparser.Node;
- import org.htmlparser.Parser;
- import org.htmlparser.StringNode;
  import org.htmlparser.lexer.Lexer;
- import org.htmlparser.lexer.Page;
- import org.htmlparser.lexer.nodes.Attribute;
  import org.htmlparser.tags.Tag;
! import org.htmlparser.util.NodeIterator;
  import org.htmlparser.util.ParserException;
- import org.htmlparser.util.ParserFeedback;

  /**
!  * TagScanner is an abstract superclass which is subclassed to create specific
!  * scanners.
!  * This isn't much use other than creating a specific tag type since scanning
!  * is mostly done by the lexer level. If you want to match end tags and 
!  * handle special syntax between tags, then you'll probably want to subclass
!  * {@link CompositeTagScanner} instead. Use TagScanner when you have meta task
!  * to do like setting the BASE url for the page when a BASE tag is encountered.
!  * <br>
!  * If you wish to write your own scanner, then you must implement scan().
!  * You MAY implement evaluate() as well, if your evaluation logic is not based
!  * on a match of the tag name.
!  * You MUST implement getID() - which identifies your scanner uniquely in the hashtable of scanners.
!  *
!  * <br>
!  * Also, you have a feedback object provided to you, should you want to send log messages. This object is
!  * instantiated by Parser when a scanner is added to its collection.
!  *
   */
  public class TagScanner
      implements
          Serializable
  {
      /**
!      * A filter which is used to associate this tag. The filter contains a string
!      * that is used to match which tags are to be allowed to pass through. This can
!      * be useful when one wishes to dynamically filter out all tags except one type
!      * which may be programmed later than the parser. Is also useful for command line
!      * implementations of the parser.
!      */
!     protected String filter;
!     
!     /**
!      * Default Constructor, automatically registers the scanner into a static array of
!      * scanners inside Tag
       */
      public TagScanner ()
      {
-         this ("");
      }

      /**
!      * This constructor automatically registers the scanner, and sets the filter for this
!      * tag.
!      * @param filter The filter which will allow this tag to pass through.
!      */
!     public TagScanner (String filter)
!     {
!         this.filter=filter;
!     }
! 
!     /**
!      * This method is used to decide if this scanner can handle this tag type. If the
!      * evaluation returns true, the calling side makes a call to scan().
!      * <strong>This method has to be implemented meaningfully only if a first-word match with
!      * the scanner id does not imply a match (or extra processing needs to be done).
!      * Default returns true</strong>
!      * @param tag The tag with a name that matches a value from {@link #getID}.
!      * @param previousOpenScanner Indicates any previous scanner which hasn't
!      * completed, before the current scan has begun, and hence allows us to
!      * write scanners that can work with dirty html.
!      */
!     public boolean evaluate (Tag tag, TagScanner previousOpenScanner)
!     {
!         return (true);
!     }
!     
!     public String getFilter()
!     {
!         return filter;
!     }
! 
!     /**
!      * Scan the tag and extract the information related to this type. The url of the
!      * initiating scan has to be provided in case relative links are found. The initial
!      * url is then prepended to it to give an absolute link.
!      * The Lexer is provided in order to do a lookahead operation. We assume that
!      * the identification has already been performed using the evaluate() method.
!      * @param tag HTML Tag to be scanned for identification.
!      * @param url The initiating url of the scan (Where the html page lies).
       * @param lexer Provides html page access.
       * @return The resultant tag (may be unchanged).
       */
!     public Tag scan (Tag tag, String url, Lexer lexer) throws ParserException
      {
!         Tag ret;
!         
!         ret = tag;
!         ret.doSemanticAction ();
! 
!         return (ret);
!     }

!     public String [] getID ()
!     {
!         return (new String[0]);
      }
  }
--- 17,73 ----
  // This library is distributed in the hope that it will be useful,
  // but WITHOUT ANY WARRANTY; without even the implied warranty of
! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
  // Lesser General Public License for more details.
  //
  // You should have received a copy of the GNU Lesser General Public
  // License along with this library; if not, write to the Free Software
! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
  //

  package org.htmlparser.scanners;
! 
  import java.io.Serializable;

  import org.htmlparser.lexer.Lexer;
  import org.htmlparser.tags.Tag;
! import org.htmlparser.util.NodeList;
  import org.htmlparser.util.ParserException;

  /**
!  * TagScanner is an abstract superclass, subclassed to create specific scanners.
!  * When asked to scan the tag, this class does nothing other than perform the
!  * tag's semantic action.
!  * Use TagScanner when you have a meta task to do like setting the BASE url for
!  * the page when a BASE tag is encountered.
!  * If you want to match end tags and handle special syntax between tags,
!  * then you'll probably want to subclass {@link CompositeTagScanner} instead.
   */
  public class TagScanner
      implements
+         Scanner,
          Serializable
  {
      /**
!      * Create a (non-composite) tag scanner.
       */
      public TagScanner ()
      {
      }

      /**
!      * Scan the tag.
!      * For this implementation, the only operation is to perform the tag's
!      * semantic action.
!      * @param tag The tag to scan.
       * @param lexer Provides html page access.
+      * @param stack The parse stack. May contain pending tags that enclose
+      * this tag.
       * @return The resultant tag (may be unchanged).
       */
!     public Tag scan (Tag tag, Lexer lexer, NodeList stack) throws ParserException
      {
!         tag.doSemanticAction ();

!         return (tag);
      }
  }

Index: package.html
===================================================================
RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/package.html,v
retrieving revision 1.18
retrieving revision 1.19
diff -C2 -d -r1.18 -r1.19
*** package.html	8 Dec 2003 01:31:52 -0000	1.18
--- package.html	20 Dec 2003 23:47:55 -0000	1.19
***************
*** 3,54 ****
  <head>
  <!--

!   @(#)package.html  1.60 98/01/27
! 
!  HTMLParser Library v1_4_20031207 - A java-based parser for HTML
!  Copyright (C) Dec 31, 2000 Somik Raha
! 
!  This library is free software; you can redistribute it and/or
!  modify it under the terms of the GNU Lesser General Public
!  License as published by the Free Software Foundation; either
!  version 2.1 of the License, or (at your option) any later version.
! 
!  This library is distributed in the hope that it will be useful,
!  but WITHOUT ANY WARRANTY; without even the implied warranty of
!  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
!  Lesser General Public License for more details.
! 
!  You should have received a copy of the GNU Lesser General Public
!  License along with this library; if not, write to the Free Software
!  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
! 
!  For any questions or suggestions, you can write to me at :
!  Email :so...@in...
! 
!  Postal Address :
!  Somik Raha
!  Extreme Programmer & Coach
!  Industrial Logic Corporation
!  2583 Cedar Street, Berkeley,
!  CA 94708, USA
!  Website : http://www.industriallogic.com

  -->
  </head>
  <body bgcolor="white">
! The scanners package contains scanners that can be fired automatically upon the identification of tags.
! Developers should familiarize themselves with this package, as extension to this framework will be mostly in the form of
! addition of custom scanners.
! 
! 
! <h2>Related Documentation</h2>
! 
! For overviews, tutorials, examples, guides, and tool documentation, please see:
! <ul>
!   <li><a href="http://htmlparser.sourceforge.net">HTML Parser Home Page</a>
! </ul>
! 
! <!-- Put @see and @since tags down here. -->
! 
  </body>
  </html>
--- 3,51 ----
  <head>
  <!--
+ HTMLParser Library $Name$ - A java-based parser for HTML
+ http://sourceforge.org/projects/htmlparser
+ Copyright (C) 2003 Somik Raha

! Revision Control Information

+ $Source$
+ $Author$
+ $Date$
+ $Revision$
+ //
+ This library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+ //
+ This library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+ //
+ You should have received a copy of the GNU Lesser General Public
+ License along with this library; if not, write to the Free Software
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ //
  -->
  </head>
  <body bgcolor="white">
! The scanners package contains classes responsible for the tertiary
! identification of tags. The lower level classes in the {@link
! org.htmlparser.lexer.Lexer lexer} package convert
! byte streams to characters and characters to nodes (via the {@link
! org.htmlparser.lexer.nodes.NodeFactory NodeFactory}). In the case of tags, the
! scanners in this package can then complete the tag or override the current tag
! and return an augmented tag. The existing implementation of the {@link
! org.htmlparser.scanners.CompositeTagScanner composite tag
! scanner}, for example, gathers the children of composite tags, identifying the
! nested structure of HTML documents. The {@link
! org.htmlparser.scanners.ScriptScanner script scanner} overrides the nodes
! returned by the lexer and creates a tag containing a single string that is the
! script code.<br>
! You might need to create a scanner (that implements the {@link Scanner Scanner} interface) if
! the text you are trying to parse doesn't look like HTML, as is the case for the
! script scanner, or the normal processing of tags by nesting their structure is
! inadequate.
  </body>
  </html>

Thread: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners Scanner.java,NONE,1.1 CompositeTagScanner.ja

htmlparser-cvs