Thread: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners Scanner.java,NONE,1.1 CompositeTagScanner.ja
Brought to you by:
derrickoswald
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners In directory sc8-pr-cvs1:/tmp/cvs-serv12747/org/htmlparser/scanners Modified Files: CompositeTagScanner.java JspScanner.java ScriptScanner.java TagScanner.java package.html Added Files: Scanner.java Log Message: Reduce recursion on the JVM stack in CompositeTagScanner. Pass a stack of open tags to the scanner. Add smarter tag closing by walking up the stack on encountering an unopened end tag. Avoids a problem with bad HTML such as that found at http://scores.nba.com/games/20031029/scoreboard.html by Shaun Roach. Added testInvalidNesting to CompositeTagScanner Test based on the above. --- NEW FILE: Scanner.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/Scanner.java,v $ // $Author: derrickoswald $ // $Date: 2003/12/20 23:47:55 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.scanners; import org.htmlparser.lexer.Lexer; import org.htmlparser.tags.Tag; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; /** * Generic interface for scanning. * Tags needing specialized operations can provide an object that implements * this interface via getThisScanner(). * By default non-composite tags simply perform the semantic action and * return while composite tags will gather their children. */ public interface Scanner { /** * Scan the tag. * The Lexer is provided in order to do a lookahead operation. * @param tag HTML tag to be scanned for identification. * @param lexer Provides html page access. * @param stack The parse stack. May contain pending tags that enclose * this tag. Nodes on the stack should be considered incomplete. * @return The resultant tag (may be unchanged). * @exception ParserException if an unrecoverable problem occurs. */ public Tag scan (Tag tag, Lexer lexer, NodeList stack) throws ParserException; } Index: CompositeTagScanner.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagScanner.java,v retrieving revision 1.83 retrieving revision 1.84 diff -C2 -d -r1.83 -r1.84 *** CompositeTagScanner.java 8 Dec 2003 13:13:59 -0000 1.83 --- CompositeTagScanner.java 20 Dec 2003 23:47:55 -0000 1.84 *************** *** 1,4 **** ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML ! // Copyright (C) Dec 31, 2000 Somik Raha // // This library is free software; you can redistribute it and/or --- 1,12 ---- ! // HTMLParser Library $Name$ - A java-based parser for HTML ! // http://sourceforge.org/projects/htmlparser ! // Copyright (C) 2003 Somik Raha ! // ! // Revision Control Information ! // ! // $Source$ ! // $Author$ ! // $Date$ ! // $Revision$ // // This library is free software; you can redistribute it and/or *************** *** 9,29 **** // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of ! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software ! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA ! // ! // For any questions or suggestions, you can write to me at : ! // Email :so...@in... // - // Postal Address : - // Somik Raha - // Extreme Programmer & Coach - // Industrial Logic Corporation - // 2583 Cedar Street, Berkeley, - // CA 94708, USA - // Website : http://www.industriallogic.com package org.htmlparser.scanners; --- 17,27 ---- // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of ! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software ! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.scanners; *************** *** 37,40 **** --- 35,39 ---- import org.htmlparser.lexer.Page; import org.htmlparser.lexer.nodes.Attribute; + import org.htmlparser.scanners.Scanner; import org.htmlparser.tags.CompositeTag; import org.htmlparser.tags.Tag; *************** *** 43,167 **** /** ! * To create your own scanner that can create tags tht hold children, create a subclass of this class. ! * The composite tag scanner can be configured with:<br> ! * <ul> ! * <li>Tags which will trigger a match</li> ! * <li>Tags which when encountered before a legal end tag, should force a correction</li> ! * </ul> ! * Here are examples of each:<BR> ! * <B>Tags which will trigger a match</B> ! * If we wish to recognize <mytag>, ! * <pre> ! * MyScanner extends CompositeTagScanner { ! * private static final String [] MATCH_IDS = { "MYTAG" }; ! * MyScanner() { ! * super(MATCH_IDS); ! * } ! * ... ! * } ! * </pre> ! * <B>Tags which force correction</B> ! * If we wish to insert end tags if we get a </BODY> or </HTML> without recieving ! * </mytag> ! * <pre> ! * MyScanner extends CompositeTagScanner { ! * private static final String [] MATCH_IDS = { "MYTAG" }; ! * private static final String [] ENDERS = {}; ! * private static final String [] END_TAG_ENDERS = { "BODY", "HTML" }; ! * MyScanner() { ! * super(MATCH_IDS, ENDERS, END_TAG_ENDERS, true); ! * } ! * ... ! * } ! * </pre> ! * <B>Preventing children of same type</B> ! * This is useful when you know that a certain tag can never hold children of its own type. ! * e.g. <FORM> can never have more form tags within it. If it does, it is an error and should ! * be corrected. Specify the tagEnders set to contain (at least) the match ids. ! * <pre> ! * MyScanner extends CompositeTagScanner { ! * private static final String [] MATCH_IDS = { "FORM" }; ! * private static final String [] END_TAG_ENDERS = { "BODY", "HTML" }; ! * MyScanner() { ! * super(MATCH_IDS, MATCH_IDS, END_TAG_ENDERS, false); ! * } ! * ... ! * } ! * </pre> ! * Inside the scanner, use createTag() to specify what tag needs to be created. */ public class CompositeTagScanner extends TagScanner { ! protected Set tagEnderSet; ! private Set endTagEnderSet; ! private boolean balance_quotes; ! ! public CompositeTagScanner() ! { ! this(new String[] {}); ! } ! ! public CompositeTagScanner(String [] tagEnders) ! { ! this("",tagEnders); ! } ! ! public CompositeTagScanner(String filter) ! { ! this(filter,new String [] {}); ! } ! ! public CompositeTagScanner( ! String filter, ! String [] tagEnders) ! { ! this(filter,tagEnders,new String[] {}); ! } ! public CompositeTagScanner( ! String filter, ! String [] tagEnders, ! String [] endTagEnders) ! { ! this(filter,tagEnders,endTagEnders, false); ! } ! /** ! * Constructor specifying all member fields. ! * @param filter A string that is used to match which tags are to be allowed ! * to pass through. This can be useful when one wishes to dynamically filter ! * out all tags except one type which may be programmed later than the parser. ! * @param tagEnders The non-endtag tag names which signal that no closing ! * end tag was found. For example, encountering <FORM> while ! * scanning a <A> link tag would mean that no </A> was found ! * and needs to be corrected. ! * @param endTagEnders The endtag names which signal that no closing end ! * tag was found. For example, encountering </HTML> while ! * scanning a <BODY> tag would mean that no </BODY> was found ! * and needs to be corrected. These items are not prefixed by a '/'. ! * @param balance_quotes <code>true</code> if scanning string nodes needs to ! * honour quotes. For example, ScriptScanner defines this <code>true</code> ! * so that text within <SCRIPT></SCRIPT> ignores tag-like text ! * within quotes. ! */ ! public CompositeTagScanner( ! String filter, ! String [] tagEnders, ! String [] endTagEnders, ! boolean balance_quotes) { - super(filter); - this.balance_quotes = balance_quotes; - this.tagEnderSet = new HashSet(); - for (int i=0;i<tagEnders.length;i++) - tagEnderSet.add(tagEnders[i]); - this.endTagEnderSet = new HashSet(); - for (int i=0;i<endTagEnders.length;i++) - endTagEnderSet.add(endTagEnders[i]); } /** * Collect the children. ! * An initial test is performed for an empty XML tag, in which case * the start tag and end tag of the returned tag are the same and it has * no children.<p> --- 42,78 ---- /** ! * The main scanning logic for nested tags. ! * When asked to scan, this class gathers nodes into a heirarchy of tags. */ public class CompositeTagScanner extends TagScanner { ! /** ! * Determine whether to use JVM or NodeList stack. ! * This can be set to true to get the original behaviour of ! * recursion into composite tags on the JVM stack. ! * This may lead to StackOverFlowException problems in some cases ! * i.e. Windows. ! */ ! private static final boolean mUseJVMStack = false; ! /** ! * Determine whether unexpected end tags should cause stack roll-up. ! * This can be set to true to get the original behaviour of gathering ! * end tags into whatever tag is open. ! * This can be expensive, but should only be needed in the presence of ! * bad HTML. ! */ ! private static final boolean mLeaveEnds = false; ! /** ! * Create a composite tag scanner. ! */ ! public CompositeTagScanner () { } /** * Collect the children. ! * <p>An initial test is performed for an empty XML tag, in which case * the start tag and end tag of the returned tag are the same and it has * no children.<p> *************** *** 171,221 **** * In the latter case, a virtual end tag is created. * Each node found that is not the end tag is added to ! * the list of children.<p> ! * The scanner's {@link #createTag} method is called with details about ! * the start tag, end tag and children. The attributes from the start tag ! * will wind up duplicated in the newly created tag, so the start tag is ! * kind of redundant (and may be removed in subsequent refactoring). ! * @param tag The tag this scanner is responsible for. This will be the ! * start (and possibly end) tag passed to {@link #createTag}. ! * @param url The url for the page the tag is discovered on. * @param lexer The source of subsequent nodes. ! * @return The scanner specific tag from the call to {@link #createTag}. */ ! public Tag scan (Tag tag, String url, Lexer lexer) throws ParserException { Node node; ! NodeList nodeList; ! Tag endTag; ! String match; String name; ! TagScanner scanner; CompositeTag ret; ! nodeList = new NodeList (); ! endTag = null; ! match = tag.getTagName (); ! if (tag.isEmptyXmlTag ()) ! endTag = tag; else do { ! node = lexer.nextNode (balance_quotes); if (null != node) { if (node instanceof Tag) { ! Tag next = (Tag)node; name = next.getTagName (); // check for normal end tag ! if (next.isEndTag () && name.equals (match)) { ! endTag = next; node = null; } ! else if (isTagToBeEndedFor (tag, next)) // check DTD { ! // insert a virtual end tag and backup one node ! endTag = createVirtualEndTag (tag, lexer.getPage (), next.getStartPosition ()); lexer.setPosition (next.getStartPosition ()); node = null; --- 82,131 ---- * In the latter case, a virtual end tag is created. * Each node found that is not the end tag is added to ! * the list of children. The end tag is special and not a child.<p> ! * Nodes that also have a CompositeTagScanner as their scanner are ! * recursed into, which provides the nested structure of an HTML page. ! * This method operates in two possible modes, depending on a private boolean. ! * It can recurse on the JVM stack, which has caused some overflow problems ! * in the past, or it can use the supplied stack argument to nest scanning ! * of child tags within itself. The former is left as an option in the code, ! * mostly to help subsequent modifiers visualize what the internal nesting ! * is doing. ! * @param tag The tag this scanner is responsible for. * @param lexer The source of subsequent nodes. ! * @param stack The parse stack. May contain pending tags that enclose ! * this tag. ! * @return The resultant tag (may be unchanged). */ ! public Tag scan (Tag tag, Lexer lexer, NodeList stack) throws ParserException { Node node; ! Tag next; String name; ! Scanner scanner; CompositeTag ret; ! ret = (CompositeTag)tag; ! if (ret.isEmptyXmlTag ()) ! ret.setEndTag (ret); else do { ! node = lexer.nextNode (false); if (null != node) { if (node instanceof Tag) { ! next = (Tag)node; name = next.getTagName (); // check for normal end tag ! if (next.isEndTag () && name.equals (ret.getTagName ())) { ! ret.setEndTag (next); node = null; } ! else if (isTagToBeEndedFor (ret, next)) // check DTD { ! // backup one node. insert a virtual end tag later lexer.setPosition (next.getStartPosition ()); node = null; *************** *** 225,249 **** // now recurse if there is a scanner for this type of tag scanner = next.getThisScanner (); ! if ((null != scanner) && scanner.evaluate (next, null)) ! node = scanner.scan (next, lexer.getPage ().getUrl (), lexer); } } ! if (null != node) ! nodeList.add (node); } } while (null != node); ! if (null == endTag) ! endTag = createVirtualEndTag (tag, lexer.getPage (), lexer.getCursor ().getPosition ()); ! ! ret = (CompositeTag)tag; ! ret.setEndTag (endTag); ! ret.setChildren (nodeList); ! for (int i = 0; i < ret.getChildCount (); i++) ! ret.childAt (i).setParent (ret); ! endTag.setParent (ret); ! ret.doSemanticAction (); return (ret); --- 135,275 ---- // now recurse if there is a scanner for this type of tag scanner = next.getThisScanner (); ! if (null != scanner) ! { ! if (mUseJVMStack) ! { // JVM stack recursion ! node = scanner.scan (next, lexer, stack); ! addChild (ret, node); ! } ! else ! { ! // fake recursion: ! if ((scanner == this) && (next instanceof CompositeTag)) ! { ! CompositeTag ondeck = (CompositeTag)next; ! if (ondeck.isEmptyXmlTag ()) ! { ! ondeck.setEndTag (ondeck); ! finishTag (ondeck, lexer); ! addChild (ret, ondeck); ! } ! else ! { ! stack.add (ret); ! ret = ondeck; ! } ! } ! else ! { // normal recursion if switching scanners ! node = scanner.scan (next, lexer, stack); ! addChild (ret, node); ! } ! } ! } ! else ! addChild (ret, next); ! } ! else ! { ! if (!mUseJVMStack && !mLeaveEnds) ! { ! // Since all non-end tags are consumed by the ! // previous clause, we're here because we have an ! // end tag with no opening tag... this could be bad. ! // There are two cases... ! // 1) The tag hasn't been registered, in which case ! // we just add it as a simple child, like it's ! // opening tag ! // 2) There may be an opening tag further up the ! // parse stack that needs closing. ! // So, we ask the factory for a node like this one ! // (since end tags never have scanners) and see ! // if it's scanner is a composite tag scanner. ! // If it is we walk up the parse stack looking for ! // something that needs this end tag to finish it. ! // If there is something, we close off all the tags ! // walked over and continue on as if nothing ! // happened. ! Vector attributes = new Vector (); ! attributes.addElement (new Attribute (name, null)); ! Tag opener = (Tag)lexer.getNodeFactory ().createTagNode ( ! next.getPage (), next.getStartPosition (), next.getEndPosition (), ! attributes); ! ! scanner = opener.getThisScanner (); ! if ((null != scanner) && (scanner == this)) ! { ! // uh-oh ! int index = -1; ! for (int i = stack.size () - 1; (-1 == index) && (i >= 0); i--) ! { ! // short circuit here... assume everything on the stack is a CompositeTag and has this as it's scanner ! // we'll need to stop if either of those conditions isn't met ! CompositeTag boffo = (CompositeTag)stack.elementAt (i); ! if (name.equals (boffo.getTagName ())) ! index = i; ! else if (isTagToBeEndedFor (boffo, next)) // check DTD ! index = i; ! } ! if (-1 != index) ! { ! // finish off the current one first ! finishTag (ret, lexer); ! addChild ((CompositeTag)stack.elementAt (stack.size () - 1), ret); ! for (int i = stack.size () - 1; i > index; i--) ! { ! CompositeTag fred = (CompositeTag)stack.remove (i); ! finishTag (fred, lexer); ! addChild ((CompositeTag)stack.elementAt (i - 1), fred); ! } ! ret = (CompositeTag)stack.remove (index); ! node = null; ! } ! else ! addChild (ret, next); // default behaviour ! } ! else ! addChild (ret, next); // default behaviour ! } ! else ! addChild (ret, next); } } + else + addChild (ret, node); + } ! if (!mUseJVMStack) ! { ! // handle coming out of fake recursion ! if (null == node) ! { ! int depth = stack.size (); ! if (0 != depth) ! { ! node = stack.elementAt (depth - 1); ! if (node instanceof CompositeTag) ! { ! CompositeTag precursor = (CompositeTag)node; ! scanner = precursor.getThisScanner (); ! if (scanner == this) ! { ! stack.remove (depth - 1); ! finishTag (ret, lexer); ! addChild (precursor, ret); ! ret = precursor; ! } ! else ! node = null; // normal recursion ! } ! else ! node = null; // normal recursion ! } ! } } } while (null != node); ! finishTag (ret, lexer); return (ret); *************** *** 251,264 **** /** * Creates an end tag with the same name as the given tag. - * NOTE: This does not call the {@link #createTag} method, but may in the - * future after refactoring. * @param tag The tag to end. * @param page The page the tag is on (virtually). * @param position The offset into the page at which the tag is to * be anchored. ! * @return An end tag with the name "/" + tag.getTagName() and a start ! * and end position at the given position. The fact these are equal may ! * be used to distinguish it as a virtual tag. */ protected Tag createVirtualEndTag (Tag tag, Page page, int position) --- 277,319 ---- /** + * Add a child to the given tag. + * @param parent The parent tag. + * @param child The child node. + */ + protected void addChild (Tag parent, Node child) + { + if (null == parent.getChildren ()) + parent.setChildren (new NodeList ()); + child.setParent (parent); + parent.getChildren ().add (child); + } + + /** + * Finish off a tag. + * Perhap add a virtual end tag. + * Set the end tag parent as this tag. + * Perform the semantic acton. + * @param tag The tag to finish off. + * @param lexer A lexer positioned at the end of the tag. + */ + protected void finishTag (CompositeTag tag, Lexer lexer) + throws + ParserException + { + if (null == tag.getEndTag ()) + tag.setEndTag (createVirtualEndTag (tag, lexer.getPage (), lexer.getCursor ().getPosition ())); + tag.getEndTag ().setParent (tag); + tag.doSemanticAction (); + } + + /** * Creates an end tag with the same name as the given tag. * @param tag The tag to end. * @param page The page the tag is on (virtually). * @param position The offset into the page at which the tag is to * be anchored. ! * @return An end tag with the name '"/" + tag.getTagName()' and a start ! * and end position at the given position. The fact these positions are ! * equal may be used to distinguish it as a virtual tag later on. */ protected Tag createVirtualEndTag (Tag tag, Page page, int position) *************** *** 277,287 **** /** ! * For composite tags this shouldn't be used and hence throws an exception. */ - public Tag createTag (Page page, int start, int end, Vector attributes, Tag tag, String url) throws ParserException - { - throw new ParserException ("composite tags shouldn't be using this"); - } - public final boolean isTagToBeEndedFor (Tag current, Tag tag) { --- 332,344 ---- /** ! * Determine if the current tag should be terminated by the given tag. ! * Examines the 'enders' or 'end tag enders' lists of the current tag ! * for a match with the given tag. Which list is chosen depends on whether ! * tag is an end tag ('end tag enders') or not ('enders'). ! * @param current The tag that might need to be ended. ! * @param tag The candidate tag that might end the current one. ! * @return <code>true</code> if the name of the given tag is a member of ! * the appropriate list. */ public final boolean isTagToBeEndedFor (Tag current, Tag tag) { Index: JspScanner.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/JspScanner.java,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** JspScanner.java 8 Dec 2003 01:31:52 -0000 1.33 --- JspScanner.java 20 Dec 2003 23:47:55 -0000 1.34 *************** *** 1,4 **** ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML ! // Copyright (C) Dec 31, 2000 Somik Raha // // This library is free software; you can redistribute it and/or --- 1,12 ---- ! // HTMLParser Library $Name$ - A java-based parser for HTML ! // http://sourceforge.org/projects/htmlparser ! // Copyright (C) 2003 Somik Raha ! // ! // Revision Control Information ! // ! // $Source$ ! // $Author$ ! // $Date$ ! // $Revision$ // // This library is free software; you can redistribute it and/or *************** *** 9,71 **** // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of ! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software ! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA ! // ! // For any questions or suggestions, you can write to me at : ! // Email :so...@in... // - // Postal Address : - // Somik Raha - // Extreme Programmer & Coach - // Industrial Logic Corporation - // 2583 Cedar Street, Berkeley, - // CA 94708, USA - // Website : http://www.industriallogic.com - package org.htmlparser.scanners; ! import java.util.Vector; ! import org.htmlparser.lexer.Page; ! ///////////////////////// ! // HTML Parser Imports // ! ///////////////////////// ! import org.htmlparser.tags.JspTag; ! import org.htmlparser.tags.Tag; ! import org.htmlparser.util.ParserException; ! ! public class JspScanner extends TagScanner { ! ! public JspScanner() { ! super(); ! } ! ! public JspScanner(String filter) { ! super(filter); ! } ! ! public String [] getID() { ! String [] ids = new String[3]; ! ids[0] = "%"; ! ids[1] = "%="; ! ids[2] = "%@"; ! return ids; ! } ! ! public Tag createTag (Page page, int start, int end, Vector attributes, Tag tag, String url) throws ParserException { - JspTag ret; - - ret = new JspTag (); - ret.setPage (page); - ret.setStartPosition (start); - ret.setEndPosition (end); - ret.setAttributesEx (attributes); - - return (ret); } } --- 17,41 ---- // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of ! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software ! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.scanners; ! /** ! * Placeholder for <em>yet to be written</em> scanner for JSP tags. ! * This vacuous class does nothing special at the moment. ! */ ! public class JspScanner extends TagScanner ! { ! /** ! * Create a new JspScanner. ! */ ! public JspScanner () { } } Index: ScriptScanner.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.java,v retrieving revision 1.53 retrieving revision 1.54 diff -C2 -d -r1.53 -r1.54 *** ScriptScanner.java 8 Dec 2003 01:31:52 -0000 1.53 --- ScriptScanner.java 20 Dec 2003 23:47:55 -0000 1.54 *************** *** 1,4 **** ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML ! // Copyright (C) Dec 31, 2000 Somik Raha // // This library is free software; you can redistribute it and/or --- 1,12 ---- ! // HTMLParser Library $Name$ - A java-based parser for HTML ! // http://sourceforge.org/projects/htmlparser ! // Copyright (C) 2003 Somik Raha ! // ! // Revision Control Information ! // ! // $Source$ ! // $Author$ ! // $Date$ ! // $Revision$ // // This library is free software; you can redistribute it and/or *************** *** 9,29 **** // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of ! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software ! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA ! // ! // For any questions or suggestions, you can write to me at : ! // Email :so...@in... // - // Postal Address : - // Somik Raha - // Extreme Programmer & Coach - // Industrial Logic Corporation - // 2583 Cedar Street, Berkeley, - // CA 94708, USA - // Website : http://www.industriallogic.com package org.htmlparser.scanners; --- 17,27 ---- // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of ! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software ! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.scanners; *************** *** 53,85 **** CompositeTagScanner { ! private static final String SCRIPT_END_TAG = "</SCRIPT>"; ! private static final String MATCH_NAME [] = {"SCRIPT"}; ! private static final String ENDERS [] = {"BODY", "HTML"}; ! ! public ScriptScanner() { ! super("",ENDERS); ! } ! ! public ScriptScanner(String filter) { ! super(filter,ENDERS); ! } ! ! public String [] getID() { ! return MATCH_NAME; ! } ! ! public Tag createTag(Page page, int start, int end, Vector attributes, Tag startTag, Tag endTag, NodeList children) throws ParserException { - ScriptTag ret; - - ret = new ScriptTag (); - ret.setPage (page); - ret.setStartPosition (start); - ret.setEndPosition (end); - ret.setAttributesEx (attributes); - ret.setEndTag (endTag); - ret.setChildren (children); - - return (ret); } --- 51,59 ---- CompositeTagScanner { ! /** ! * Create a script scanner. ! */ ! public ScriptScanner() { } *************** *** 88,95 **** * Accumulates nodes returned from the lexer, until </SCRIPT>, * <BODY> or <HTML> is encountered. Replaces the node factory ! * in the lexer with a new Parser to avoid other scanners missing their ! * end tags and accumulating even the </SCRIPT>. */ ! public Tag scan (Tag tag, String url, Lexer lexer) throws ParserException { --- 62,72 ---- * Accumulates nodes returned from the lexer, until </SCRIPT>, * <BODY> or <HTML> is encountered. Replaces the node factory ! * in the lexer with a new (empty) one to avoid other scanners missing their ! * end tags and accumulating even the </SCRIPT> tag. ! * @param tag The tag this scanner is responsible for. ! * @param lexer The source of subsequent nodes. ! * @param stack The parse stack, <em>not used</em>. */ ! public Tag scan (Tag tag, Lexer lexer, NodeList stack) throws ParserException { *************** *** 118,122 **** if (node instanceof Tag) if ( ((Tag)node).isEndTag () ! && ((Tag)node).getTagName ().equals (MATCH_NAME[0])) { end = (Tag)node; --- 95,99 ---- if (node instanceof Tag) if ( ((Tag)node).isEndTag () ! && ((Tag)node).getTagName ().equals (tag.getIds ()[0])) { end = (Tag)node; *************** *** 181,194 **** return (ret); - } - - /** - * Gets the end tag that the scanner uses to stop scanning. Subclasses of - * <code>ScriptScanner</code> you should override this method. - * @return String containing the end tag to search for, i.e. </SCRIPT> - */ - public String getEndTag() - { - return SCRIPT_END_TAG; } } --- 158,161 ---- Index: TagScanner.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/TagScanner.java,v retrieving revision 1.52 retrieving revision 1.53 diff -C2 -d -r1.52 -r1.53 *** TagScanner.java 8 Dec 2003 13:13:59 -0000 1.52 --- TagScanner.java 20 Dec 2003 23:47:55 -0000 1.53 *************** *** 1,4 **** ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML ! // Copyright (C) Dec 31, 2000 Somik Raha // // This library is free software; you can redistribute it and/or --- 1,12 ---- ! // HTMLParser Library $Name$ - A java-based parser for HTML ! // http://sourceforge.org/projects/htmlparser ! // Copyright (C) 2003 Somik Raha ! // ! // Revision Control Information ! // ! // $Source$ ! // $Author$ ! // $Date$ ! // $Revision$ // // This library is free software; you can redistribute it and/or *************** *** 9,147 **** // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of ! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software ! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA ! // ! // For any questions or suggestions, you can write to me at : ! // Email :so...@in... // - // Postal Address : - // Somik Raha - // Extreme Programmer & Coach - // Industrial Logic Corporation - // 2583 Cedar Street, Berkeley, - // CA 94708, USA - // Website : http://www.industriallogic.com package org.htmlparser.scanners; ! ////////////////// ! // Java Imports // ! ////////////////// import java.io.Serializable; - import java.util.Hashtable; - import java.util.Map; - import java.util.Vector; - import org.htmlparser.AbstractNode; - import org.htmlparser.Node; - import org.htmlparser.Parser; - import org.htmlparser.StringNode; import org.htmlparser.lexer.Lexer; - import org.htmlparser.lexer.Page; - import org.htmlparser.lexer.nodes.Attribute; import org.htmlparser.tags.Tag; ! import org.htmlparser.util.NodeIterator; import org.htmlparser.util.ParserException; - import org.htmlparser.util.ParserFeedback; /** ! * TagScanner is an abstract superclass which is subclassed to create specific ! * scanners. ! * This isn't much use other than creating a specific tag type since scanning ! * is mostly done by the lexer level. If you want to match end tags and ! * handle special syntax between tags, then you'll probably want to subclass ! * {@link CompositeTagScanner} instead. Use TagScanner when you have meta task ! * to do like setting the BASE url for the page when a BASE tag is encountered. ! * <br> ! * If you wish to write your own scanner, then you must implement scan(). ! * You MAY implement evaluate() as well, if your evaluation logic is not based ! * on a match of the tag name. ! * You MUST implement getID() - which identifies your scanner uniquely in the hashtable of scanners. ! * ! * <br> ! * Also, you have a feedback object provided to you, should you want to send log messages. This object is ! * instantiated by Parser when a scanner is added to its collection. ! * */ public class TagScanner implements Serializable { /** ! * A filter which is used to associate this tag. The filter contains a string ! * that is used to match which tags are to be allowed to pass through. This can ! * be useful when one wishes to dynamically filter out all tags except one type ! * which may be programmed later than the parser. Is also useful for command line ! * implementations of the parser. ! */ ! protected String filter; ! ! /** ! * Default Constructor, automatically registers the scanner into a static array of ! * scanners inside Tag */ public TagScanner () { - this (""); } /** ! * This constructor automatically registers the scanner, and sets the filter for this ! * tag. ! * @param filter The filter which will allow this tag to pass through. ! */ ! public TagScanner (String filter) ! { ! this.filter=filter; ! } ! ! /** ! * This method is used to decide if this scanner can handle this tag type. If the ! * evaluation returns true, the calling side makes a call to scan(). ! * <strong>This method has to be implemented meaningfully only if a first-word match with ! * the scanner id does not imply a match (or extra processing needs to be done). ! * Default returns true</strong> ! * @param tag The tag with a name that matches a value from {@link #getID}. ! * @param previousOpenScanner Indicates any previous scanner which hasn't ! * completed, before the current scan has begun, and hence allows us to ! * write scanners that can work with dirty html. ! */ ! public boolean evaluate (Tag tag, TagScanner previousOpenScanner) ! { ! return (true); ! } ! ! public String getFilter() ! { ! return filter; ! } ! ! /** ! * Scan the tag and extract the information related to this type. The url of the ! * initiating scan has to be provided in case relative links are found. The initial ! * url is then prepended to it to give an absolute link. ! * The Lexer is provided in order to do a lookahead operation. We assume that ! * the identification has already been performed using the evaluate() method. ! * @param tag HTML Tag to be scanned for identification. ! * @param url The initiating url of the scan (Where the html page lies). * @param lexer Provides html page access. * @return The resultant tag (may be unchanged). */ ! public Tag scan (Tag tag, String url, Lexer lexer) throws ParserException { ! Tag ret; ! ! ret = tag; ! ret.doSemanticAction (); ! ! return (ret); ! } ! public String [] getID () ! { ! return (new String[0]); } } --- 17,73 ---- // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of ! // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software ! // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.scanners; ! import java.io.Serializable; import org.htmlparser.lexer.Lexer; import org.htmlparser.tags.Tag; ! import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; /** ! * TagScanner is an abstract superclass, subclassed to create specific scanners. ! * When asked to scan the tag, this class does nothing other than perform the ! * tag's semantic action. ! * Use TagScanner when you have a meta task to do like setting the BASE url for ! * the page when a BASE tag is encountered. ! * If you want to match end tags and handle special syntax between tags, ! * then you'll probably want to subclass {@link CompositeTagScanner} instead. */ public class TagScanner implements + Scanner, Serializable { /** ! * Create a (non-composite) tag scanner. */ public TagScanner () { } /** ! * Scan the tag. ! * For this implementation, the only operation is to perform the tag's ! * semantic action. ! * @param tag The tag to scan. * @param lexer Provides html page access. + * @param stack The parse stack. May contain pending tags that enclose + * this tag. * @return The resultant tag (may be unchanged). */ ! public Tag scan (Tag tag, Lexer lexer, NodeList stack) throws ParserException { ! tag.doSemanticAction (); ! return (tag); } } Index: package.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/package.html,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** package.html 8 Dec 2003 01:31:52 -0000 1.18 --- package.html 20 Dec 2003 23:47:55 -0000 1.19 *************** *** 3,54 **** <head> <!-- ! @(#)package.html 1.60 98/01/27 ! ! HTMLParser Library v1_4_20031207 - A java-based parser for HTML ! Copyright (C) Dec 31, 2000 Somik Raha ! ! This library is free software; you can redistribute it and/or ! modify it under the terms of the GNU Lesser General Public ! License as published by the Free Software Foundation; either ! version 2.1 of the License, or (at your option) any later version. ! ! This library is distributed in the hope that it will be useful, ! but WITHOUT ANY WARRANTY; without even the implied warranty of ! MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ! Lesser General Public License for more details. ! ! You should have received a copy of the GNU Lesser General Public ! License along with this library; if not, write to the Free Software ! Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA ! ! For any questions or suggestions, you can write to me at : ! Email :so...@in... ! ! Postal Address : ! Somik Raha ! Extreme Programmer & Coach ! Industrial Logic Corporation ! 2583 Cedar Street, Berkeley, ! CA 94708, USA ! Website : http://www.industriallogic.com --> </head> <body bgcolor="white"> ! The scanners package contains scanners that can be fired automatically upon the identification of tags. ! Developers should familiarize themselves with this package, as extension to this framework will be mostly in the form of ! addition of custom scanners. ! ! ! <h2>Related Documentation</h2> ! ! For overviews, tutorials, examples, guides, and tool documentation, please see: ! <ul> ! <li><a href="http://htmlparser.sourceforge.net">HTML Parser Home Page</a> ! </ul> ! ! <!-- Put @see and @since tags down here. --> ! </body> </html> --- 3,51 ---- <head> <!-- + HTMLParser Library $Name$ - A java-based parser for HTML + http://sourceforge.org/projects/htmlparser + Copyright (C) 2003 Somik Raha ! Revision Control Information + $Source$ + $Author$ + $Date$ + $Revision$ + // + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + // + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + // + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + // --> </head> <body bgcolor="white"> ! The scanners package contains classes responsible for the tertiary ! identification of tags. The lower level classes in the {@link ! org.htmlparser.lexer.Lexer lexer} package convert ! byte streams to characters and characters to nodes (via the {@link ! org.htmlparser.lexer.nodes.NodeFactory NodeFactory}). In the case of tags, the ! scanners in this package can then complete the tag or override the current tag ! and return an augmented tag. The existing implementation of the {@link ! org.htmlparser.scanners.CompositeTagScanner composite tag ! scanner}, for example, gathers the children of composite tags, identifying the ! nested structure of HTML documents. The {@link ! org.htmlparser.scanners.ScriptScanner script scanner} overrides the nodes ! returned by the lexer and creates a tag containing a single string that is the ! script code.<br> ! You might need to create a scanner (that implements the {@link Scanner Scanner} interface) if ! the text you are trying to parse doesn't look like HTML, as is the case for the ! script scanner, or the normal processing of tags by nesting their structure is ! inadequate. </body> </html> |