[Htmlparser-cvs] htmlparser/src/org/htmlparser AbstractNode.java,1.12,1.13 Node.java,1.38,1.39 NodeR
Brought to you by:
derrickoswald
From: <der...@us...> - 2003-09-10 03:38:59
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser In directory sc8-pr-cvs1:/tmp/cvs-serv24483/src/org/htmlparser Modified Files: AbstractNode.java Node.java NodeReader.java Parser.java RemarkNode.java RemarkNodeParser.java StringNode.java StringNodeFactory.java package.html Log Message: Add style checking target to ant build script: ant checkstyle It uses a jar from http://checkstyle.sourceforge.net which is dropped in the lib directory. The rules are in the file htmlparser_checks.xml in the src directory. Added lexerapplications package with Tabby as the first app. It performs whitespace manipulation on source files to follow the style rules. This reduced the number of style violations to roughly 14,000. There are a few issues with the style checker that need to be resolved before it should be taken too seriously. For example: It thinks all method arguments should be final, even if they are modified by the code (which the compiler frowns on). It complains about long lines, even when there is no possibility of wrapping the line, i.e. a URL in a comment that's more than 80 characters long. It considers all naked integers as 'magic numbers', even when they are obvious, i.e. the 4 corners of a box. It complains about whitespace following braces, even in array initializers, i.e. X[][] = { {a, b} { } } But it points out some really interesting things, even if you don't agree with the style guidelines, so it's worth a look. Index: AbstractNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/AbstractNode.java,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** AbstractNode.java 8 Sep 2003 02:26:28 -0000 1.12 --- AbstractNode.java 10 Sep 2003 03:38:17 -0000 1.13 *************** *** 11,15 **** // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software --- 11,15 ---- // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software *************** *** 18,27 **** // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com --- 18,27 ---- // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com *************** *** 37,41 **** */ public abstract class AbstractNode implements Node, Serializable { ! /** * The beginning position of the tag in the line */ --- 37,41 ---- */ public abstract class AbstractNode implements Node, Serializable { ! /** * The beginning position of the tag in the line */ *************** *** 55,59 **** * The children of this node. */ ! protected NodeList children; /** --- 55,59 ---- * The children of this node. */ ! protected NodeList children; /** *************** *** 85,89 **** /** * This method will make it easier when using html parser to reproduce html pages (with or without modifications) ! * Applications reproducing html can use this method on nodes which are to be used or transferred as they were * recieved, with the original html */ --- 85,89 ---- /** * This method will make it easier when using html parser to reproduce html pages (with or without modifications) ! * Applications reproducing html can use this method on nodes which are to be used or transferred as they were * recieved, with the original html */ *************** *** 101,105 **** * Collect this node and its child nodes (if-applicable) into the collection parameter, provided the node * satisfies the filtering criteria. <P/> ! * * This mechanism allows powerful filtering code to be written very easily, without bothering about collection * of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it --- 101,105 ---- * Collect this node and its child nodes (if-applicable) into the collection parameter, provided the node * satisfies the filtering criteria. <P/> ! * * This mechanism allows powerful filtering code to be written very easily, without bothering about collection * of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it *************** *** 107,120 **** * out by checking if the current node is a form tag, and going through its contents. However, this ties us down * to specific tags, and is not a very clean approach. <P/> ! * ! * Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look * like : * <pre> ! * NodeList collectionList = new NodeList(); ! * Node node; ! * String filter = LinkTag.LINK_TAG_FILTER; * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { * node = e.nextNode(); ! * node.collectInto (collectionVector, filter); * } * </pre> --- 107,120 ---- * out by checking if the current node is a form tag, and going through its contents. However, this ties us down * to specific tags, and is not a very clean approach. <P/> ! * ! * Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look * like : * <pre> ! * NodeList collectionList = new NodeList(); ! * Node node; ! * String filter = LinkTag.LINK_TAG_FILTER; * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { * node = e.nextNode(); ! * node.collectInto (collectionVector, filter); * } * </pre> *************** *** 122,132 **** * deep the links are embedded. This of course implies that tags must * fulfill their responsibilities toward honouring certain filters. ! * ! * <B>Important:</B> In order to keep performance optimal, <B>do not create</B> you own filter strings, as * the internal matching occurs with the pre-existing filter string object (in the relevant class). i.e. do not ! * make calls like : * <I>collectInto(collectionList,"-l")</I>, instead, make calls only like : * <I>collectInto(collectionList,LinkTag.LINK_TAG_FILTER)</I>.<P/> ! * * To find out if your desired tag has filtering support, check the API of the tag. */ --- 122,132 ---- * deep the links are embedded. This of course implies that tags must * fulfill their responsibilities toward honouring certain filters. ! * ! * <B>Important:</B> In order to keep performance optimal, <B>do not create</B> you own filter strings, as * the internal matching occurs with the pre-existing filter string object (in the relevant class). i.e. do not ! * make calls like : * <I>collectInto(collectionList,"-l")</I>, instead, make calls only like : * <I>collectInto(collectionList,LinkTag.LINK_TAG_FILTER)</I>.<P/> ! * * To find out if your desired tag has filtering support, check the API of the tag. */ *************** *** 136,140 **** * Collect this node and its child nodes (if-applicable) into the collection parameter, provided the node * satisfies the filtering criteria. <P/> ! * * This mechanism allows powerful filtering code to be written very easily, without bothering about collection * of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it --- 136,140 ---- * Collect this node and its child nodes (if-applicable) into the collection parameter, provided the node * satisfies the filtering criteria. <P/> ! * * This mechanism allows powerful filtering code to be written very easily, without bothering about collection * of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it *************** *** 142,151 **** * out by checking if the current node is a form tag, and going through its contents. However, this ties us down * to specific tags, and is not a very clean approach. <P/> ! * ! * Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look * like : * <pre> ! * NodeList collectionList = new NodeList(); ! * Node node; * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { * node = e.nextNode(); --- 142,151 ---- * out by checking if the current node is a form tag, and going through its contents. However, this ties us down * to specific tags, and is not a very clean approach. <P/> ! * ! * Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look * like : * <pre> ! * NodeList collectionList = new NodeList(); ! * Node node; * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { * node = e.nextNode(); *************** *** 154,158 **** * </pre> * Thus, collectionList will hold all the link nodes, irrespective of how ! * deep the links are embedded. */ public void collectInto(NodeList collectionList, Class nodeType) { --- 154,158 ---- * </pre> * Thus, collectionList will hold all the link nodes, irrespective of how ! * deep the links are embedded. */ public void collectInto(NodeList collectionList, Class nodeType) { *************** *** 184,188 **** return toHtml(); } ! /** * Get the parent of this node. --- 184,188 ---- return toHtml(); } ! /** * Get the parent of this node. *************** *** 205,209 **** parent = node; } ! /** * Get the children of this node. --- 205,209 ---- parent = node; } ! /** * Get the children of this node. *************** *** 230,234 **** return null; } ! /** * Sets the string contents of the node. --- 230,234 ---- return null; } ! /** * Sets the string contents of the node. Index: Node.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/Node.java,v retrieving revision 1.38 retrieving revision 1.39 diff -C2 -d -r1.38 -r1.39 *** Node.java 8 Sep 2003 02:26:28 -0000 1.38 --- Node.java 10 Sep 2003 03:38:17 -0000 1.39 *************** *** 11,15 **** // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software --- 11,15 ---- // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software *************** *** 18,27 **** // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com --- 18,27 ---- // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com *************** *** 49,53 **** /** * This method will make it easier when using html parser to reproduce html pages (with or without modifications) ! * Applications reproducing html can use this method on nodes which are to be used or transferred as they were * recieved, with the original html */ --- 49,53 ---- /** * This method will make it easier when using html parser to reproduce html pages (with or without modifications) ! * Applications reproducing html can use this method on nodes which are to be used or transferred as they were * recieved, with the original html */ *************** *** 63,67 **** * Collect this node and its child nodes (if-applicable) into the collection parameter, provided the node * satisfies the filtering criteria. <P/> ! * * This mechanism allows powerful filtering code to be written very easily, without bothering about collection * of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it --- 63,67 ---- * Collect this node and its child nodes (if-applicable) into the collection parameter, provided the node * satisfies the filtering criteria. <P/> ! * * This mechanism allows powerful filtering code to be written very easily, without bothering about collection * of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it *************** *** 69,82 **** * out by checking if the current node is a form tag, and going through its contents. However, this ties us down * to specific tags, and is not a very clean approach. <P/> ! * ! * Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look * like : * <pre> ! * NodeList collectionList = new NodeList(); ! * Node node; ! * String filter = LinkTag.LINK_TAG_FILTER; * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { * node = e.nextNode(); ! * node.collectInto (collectionVector, filter); * } * </pre> --- 69,82 ---- * out by checking if the current node is a form tag, and going through its contents. However, this ties us down * to specific tags, and is not a very clean approach. <P/> ! * ! * Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look * like : * <pre> ! * NodeList collectionList = new NodeList(); ! * Node node; ! * String filter = LinkTag.LINK_TAG_FILTER; * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { * node = e.nextNode(); ! * node.collectInto (collectionVector, filter); * } * </pre> *************** *** 84,94 **** * deep the links are embedded. This of course implies that tags must * fulfill their responsibilities toward honouring certain filters. ! * ! * <B>Important:</B> In order to keep performance optimal, <B>do not create</B> you own filter strings, as * the internal matching occurs with the pre-existing filter string object (in the relevant class). i.e. do not ! * make calls like : * <I>collectInto(collectionList,"-l")</I>, instead, make calls only like : * <I>collectInto(collectionList,LinkTag.LINK_TAG_FILTER)</I>.<P/> ! * * To find out if your desired tag has filtering support, check the API of the tag. */ --- 84,94 ---- * deep the links are embedded. This of course implies that tags must * fulfill their responsibilities toward honouring certain filters. ! * ! * <B>Important:</B> In order to keep performance optimal, <B>do not create</B> you own filter strings, as * the internal matching occurs with the pre-existing filter string object (in the relevant class). i.e. do not ! * make calls like : * <I>collectInto(collectionList,"-l")</I>, instead, make calls only like : * <I>collectInto(collectionList,LinkTag.LINK_TAG_FILTER)</I>.<P/> ! * * To find out if your desired tag has filtering support, check the API of the tag. */ *************** *** 97,101 **** * Collect this node and its child nodes (if-applicable) into the collection parameter, provided the node * satisfies the filtering criteria. <P/> ! * * This mechanism allows powerful filtering code to be written very easily, without bothering about collection * of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it --- 97,101 ---- * Collect this node and its child nodes (if-applicable) into the collection parameter, provided the node * satisfies the filtering criteria. <P/> ! * * This mechanism allows powerful filtering code to be written very easily, without bothering about collection * of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it *************** *** 103,112 **** * out by checking if the current node is a form tag, and going through its contents. However, this ties us down * to specific tags, and is not a very clean approach. <P/> ! * ! * Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look * like : * <pre> ! * NodeList collectionList = new NodeList(); ! * Node node; * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { * node = e.nextNode(); --- 103,112 ---- * out by checking if the current node is a form tag, and going through its contents. However, this ties us down * to specific tags, and is not a very clean approach. <P/> ! * ! * Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look * like : * <pre> ! * NodeList collectionList = new NodeList(); ! * Node node; * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { * node = e.nextNode(); *************** *** 115,119 **** * </pre> * Thus, collectionList will hold all the link nodes, irrespective of how ! * deep the links are embedded. */ public abstract void collectInto(NodeList collectionList, Class nodeType); --- 115,119 ---- * </pre> * Thus, collectionList will hold all the link nodes, irrespective of how ! * deep the links are embedded. */ public abstract void collectInto(NodeList collectionList, Class nodeType); *************** *** 126,130 **** */ public abstract int elementEnd(); ! public abstract void accept(Object visitor); --- 126,130 ---- */ public abstract int elementEnd(); ! public abstract void accept(Object visitor); *************** *** 159,168 **** * Returns the text of the string line */ ! public String getText(); ! /** * Sets the string contents of the node. * @param text The new text for the node. */ ! public void setText(String text); } --- 159,168 ---- * Returns the text of the string line */ ! public String getText(); ! /** * Sets the string contents of the node. * @param text The new text for the node. */ ! public void setText(String text); } Index: NodeReader.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/NodeReader.java,v retrieving revision 1.42 retrieving revision 1.43 diff -C2 -d -r1.42 -r1.43 *** NodeReader.java 8 Sep 2003 02:26:28 -0000 1.42 --- NodeReader.java 10 Sep 2003 03:38:17 -0000 1.43 *************** *** 11,15 **** // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software --- 11,15 ---- // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software *************** *** 18,27 **** // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com --- 18,27 ---- // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com *************** *** 72,76 **** super(in, len); this.url = url; ! this.parser = null; this.lineCount = 1; } --- 72,76 ---- super(in, len); this.url = url; ! this.parser = null; this.lineCount = 1; } *************** *** 95,99 **** this(in, 8192, url); } ! /** * Get the url for this reader. --- 95,99 ---- this(in, 8192, url); } ! /** * Get the url for this reader. *************** *** 106,110 **** /** ! * This method is intended to be called only by scanners, when a situation of dirty html has arisen, * and action has been taken to correct the parsed tags. For e.g. if we have html of the form : * <pre> --- 106,110 ---- /** ! * This method is intended to be called only by scanners, when a situation of dirty html has arisen, * and action has been taken to correct the parsed tags. For e.g. if we have html of the form : * <pre> *************** *** 113,117 **** * Now to salvage the first link, we'd probably like to insert an end tag somewhere (typically before the * second begin link tag). So that the parsing continues uninterrupted, we will need to change the existing ! * line being parsed, to contain the end tag in it. */ public void changeLine(String line) { --- 113,117 ---- * Now to salvage the first link, we'd probably like to insert an end tag somewhere (typically before the * second begin link tag). So that the parsing continues uninterrupted, we will need to change the existing ! * line being parsed, to contain the end tag in it. */ public void changeLine(String line) { *************** *** 124,128 **** * Get the last line number that the reader has read * @return int last line number read by the reader ! */ public int getLastLineNumber() { return lineCount-1; --- 124,128 ---- * Get the last line number that the reader has read * @return int last line number read by the reader ! */ public int getLastLineNumber() { return lineCount-1; *************** *** 186,192 **** char ch; boolean ret; ! ret = false; ! if (pos + 2 <= line.length ()) if ('<' == line.charAt (pos)) --- 186,192 ---- char ch; boolean ret; ! ret = false; ! if (pos + 2 <= line.length ()) if ('<' == line.charAt (pos)) *************** *** 223,227 **** node = nextParsedNode.elementAt(0); nextParsedNode.remove(0); ! return node; } if (readNextLine()) { --- 223,227 ---- node = nextParsedNode.elementAt(0); nextParsedNode.remove(0); ! return node; } if (readNextLine()) { *************** *** 231,235 **** } while (line!=null && line.length()==0); ! } else if (dontReadNextLine) { --- 231,235 ---- } while (line!=null && line.length()==0); ! } else if (dontReadNextLine) { *************** *** 239,243 **** if (line==null) return null; ! if (beginTag (line, posInLine)) { --- 239,243 ---- if (line==null) return null; ! if (beginTag (line, posInLine)) { *************** *** 255,264 **** } catch (Exception e) ! { StringBuffer msgBuffer = new StringBuffer(); msgBuffer.append(DECIPHER_ERROR+"\n" + " Tag being processed : "+tag.getTagName()+"\n" + " Current Tag Line : "+tag.getTagLine() ! ); appendLineDetails(msgBuffer); ParserException ex = new ParserException(msgBuffer.toString(),e); --- 255,264 ---- } catch (Exception e) ! { StringBuffer msgBuffer = new StringBuffer(); msgBuffer.append(DECIPHER_ERROR+"\n" + " Tag being processed : "+tag.getTagName()+"\n" + " Current Tag Line : "+tag.getTagLine() ! ); appendLineDetails(msgBuffer); ParserException ex = new ParserException(msgBuffer.toString(),e); *************** *** 277,281 **** if (node!=null) return node; } ! return null; } --- 277,281 ---- if (node!=null) return node; } ! return null; } *************** *** 292,296 **** ParserException ex = new ParserException(msgBuffer.toString(),e); parser.getFeedback().error(msgBuffer.toString(),ex); ! throw ex; } } --- 292,296 ---- ParserException ex = new ParserException(msgBuffer.toString(),e); parser.getFeedback().error(msgBuffer.toString(),ex); ! throw ex; } } *************** *** 330,334 **** this.previousOpenScanner = previousOpenScanner; } ! /** * @param lineSeparator New Line separator to be used --- 330,334 ---- this.previousOpenScanner = previousOpenScanner; } ! /** * @param lineSeparator New Line separator to be used *************** *** 336,346 **** public static void setLineSeparator(String lineSeparator) { ! Parser.setLineSeparator(lineSeparator); } ! /** * Gets the line seperator that is being used * @return String ! */ public static String getLineSeparator() { --- 336,346 ---- public static void setLineSeparator(String lineSeparator) { ! Parser.setLineSeparator(lineSeparator); } ! /** * Gets the line seperator that is being used * @return String ! */ public static String getLineSeparator() { *************** *** 405,411 **** */ public void addNextParsedNode(Node nextParsedNode) { ! this.nextParsedNode.prepend(nextParsedNode); } ! public boolean isDontReadNextLine() { return dontReadNextLine; --- 405,411 ---- */ public void addNextParsedNode(Node nextParsedNode) { ! this.nextParsedNode.prepend(nextParsedNode); } ! public boolean isDontReadNextLine() { return dontReadNextLine; Index: Parser.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/Parser.java,v retrieving revision 1.59 retrieving revision 1.60 diff -C2 -d -r1.59 -r1.60 *** Parser.java 8 Sep 2003 02:26:28 -0000 1.59 --- Parser.java 10 Sep 2003 03:38:17 -0000 1.60 *************** *** 11,15 **** // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software --- 11,15 ---- // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software *************** *** 18,27 **** // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com --- 18,27 ---- // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com *************** *** 78,82 **** /** ! * This is the class that the user will use, either to get an iterator into * the html page or to directly parse the page and print the results * <BR> --- 78,82 ---- /** ! * This is the class that the user will use, either to get an iterator into * the html page or to directly parse the page and print the results * <BR> *************** *** 84,93 **** * [1] Create a parser object - passing the URL and a feedback object to the parser<BR> * [2] Register the common scanners. See {@link #registerScanners()} <BR> ! * You wouldnt do this if you want to configure a custom lightweight parser. In that case, * you would add the scanners of your choice using {@link #addScanner(TagScanner)}<BR> * [3] Enumerate through the elements from the parser object <BR> ! * It is important to note that the parsing occurs when you enumerate, ON DEMAND. This is a thread-safe way, * and you only get the control back after a particular element is parsed and returned. ! * * <BR> * Below is some sample code to parse Yahoo.com and print all the tags. --- 84,93 ---- * [1] Create a parser object - passing the URL and a feedback object to the parser<BR> * [2] Register the common scanners. See {@link #registerScanners()} <BR> ! * You wouldnt do this if you want to configure a custom lightweight parser. In that case, * you would add the scanners of your choice using {@link #addScanner(TagScanner)}<BR> * [3] Enumerate through the elements from the parser object <BR> ! * It is important to note that the parsing occurs when you enumerate, ON DEMAND. This is a thread-safe way, * and you only get the control back after a particular element is parsed and returned. ! * * <BR> * Below is some sample code to parse Yahoo.com and print all the tags. *************** *** 95,99 **** * Parser parser = new Parser("http://www.yahoo.com",new DefaultHTMLParserFeedback()); * // In this example, we are registering all the common scanners ! * parser.registerScanners(); * for (NodeIterator i = parser.elements();i.hasMoreNodes();) { * Node node = i.nextNode(); --- 95,99 ---- * Parser parser = new Parser("http://www.yahoo.com",new DefaultHTMLParserFeedback()); * // In this example, we are registering all the common scanners ! * parser.registerScanners(); * for (NodeIterator i = parser.elements();i.hasMoreNodes();) { * Node node = i.nextNode(); *************** *** 109,121 **** * for (NodeIterator i = parser.elements();i.hasMoreNodes();) { * Node node = i.nextNode(); ! * if (node instanceof StringNode) { * StringNode stringNode = ! * (StringNode)node; ! * System.out.println(stringNode.getText()); ! * } * } * </pre> * The above snippet will print out only the text contents in the html document.<br> ! * Here's another snippet that will only print out the link urls in a document. * This is an example of adding a link scanner. * <pre> --- 109,121 ---- * for (NodeIterator i = parser.elements();i.hasMoreNodes();) { * Node node = i.nextNode(); ! * if (node instanceof StringNode) { * StringNode stringNode = ! * (StringNode)node; ! * System.out.println(stringNode.getText()); ! * } * } * </pre> * The above snippet will print out only the text contents in the html document.<br> ! * Here's another snippet that will only print out the link urls in a document. * This is an example of adding a link scanner. * <pre> *************** *** 123,134 **** * parser.addScanner(new LinkScanner("-l")); * for (NodeIterator i = parser.elements();i.hasMoreNodes();) { ! * Node node = i.nextNode(); * if (node instanceof LinkTag) { ! * LinkTag linkTag = (LinkTag)node; ! * System.out.println(linkTag.getLink()); ! * } * } * </pre> ! * @see Parser#elements() */ public class Parser --- 123,134 ---- * parser.addScanner(new LinkScanner("-l")); * for (NodeIterator i = parser.elements();i.hasMoreNodes();) { ! * Node node = i.nextNode(); * if (node instanceof LinkTag) { ! * LinkTag linkTag = (LinkTag)node; ! * System.out.println(linkTag.getLink()); ! * } * } * </pre> ! * @see Parser#elements() */ public class Parser *************** *** 163,167 **** * The display version. */ ! public final static String VERSION_STRING = "" + VERSION_NUMBER + " (" + VERSION_TYPE + " " + VERSION_DATE + ")" ; --- 163,167 ---- * The display version. */ ! public final static String VERSION_STRING = "" + VERSION_NUMBER + " (" + VERSION_TYPE + " " + VERSION_DATE + ")" ; *************** *** 184,188 **** /** ! * This object is used by the StringParser to create new StringNodes at runtime, based on * use configurations of the factory */ --- 184,188 ---- /** ! * This object is used by the StringParser to create new StringNodes at runtime, based on * use configurations of the factory */ *************** *** 193,203 **** */ protected ParserFeedback feedback; ! /** * The URL or filename to be parsed. */ protected String resourceLocn; ! ! /** * The html reader associated with this parser. */ --- 193,203 ---- */ protected ParserFeedback feedback; ! /** * The URL or filename to be parsed. */ protected String resourceLocn; ! ! /** * The html reader associated with this parser. */ *************** *** 237,241 **** */ public static ParserFeedback noFeedback = new DefaultParserFeedback (DefaultParserFeedback.QUIET); ! /** * A verbose message sink. --- 237,241 ---- */ public static ParserFeedback noFeedback = new DefaultParserFeedback (DefaultParserFeedback.QUIET); ! /** * A verbose message sink. *************** *** 253,259 **** public static void setLineSeparator(String lineSeparatorString) { ! lineSeparator = lineSeparatorString; } ! /** * Return the version string of this parser. --- 253,259 ---- public static void setLineSeparator(String lineSeparatorString) { ! lineSeparator = lineSeparatorString; } ! /** * Return the version string of this parser. *************** *** 320,324 **** * is provided. */ ! public Parser(NodeReader rd, ParserFeedback fb) { setFeedback (fb); --- 320,324 ---- * is provided. */ ! public Parser(NodeReader rd, ParserFeedback fb) { setFeedback (fb); *************** *** 332,336 **** Tag.setTagParser(new TagParser(feedback)); } ! /** * Constructor for custom HTTP access. --- 332,336 ---- Tag.setTagParser(new TagParser(feedback)); } ! /** * Constructor for custom HTTP access. *************** *** 378,384 **** this (resourceLocn, stdout); } ! /** ! * This constructor is present to enable users to plugin their own readers. * A DefaultHTMLParserFeedback object is used for feedback. It can also be used with readers of the user's choice * streaming data into the parser.<p/> --- 378,384 ---- this (resourceLocn, stdout); } ! /** ! * This constructor is present to enable users to plugin their own readers. * A DefaultHTMLParserFeedback object is used for feedback. It can also be used with readers of the user's choice * streaming data into the parser.<p/> *************** *** 394,401 **** * @param reader The source for HTML to be parsed. */ ! public Parser(NodeReader reader) { ! this (reader, stdout); ! } /** --- 394,401 ---- * @param reader The source for HTML to be parsed. */ ! public Parser(NodeReader reader) { ! this (reader, stdout); ! } /** *************** *** 602,606 **** * and <code>reader</code>. It does not adjust the <code>scanners</code> list * or <code>feedback</code> object. The <code>url_conn</code> is set to ! * null since this cannot be determined from the reader. The * <code>character_set</code> is set to the default character set since * this cannot be determined from the reader. --- 602,606 ---- * and <code>reader</code>. It does not adjust the <code>scanners</code> list * or <code>feedback</code> object. The <code>url_conn</code> is set to ! * null since this cannot be determined from the reader. The * <code>character_set</code> is set to the default character set since * this cannot be determined from the reader. *************** *** 634,640 **** */ public int getNumScanners() { ! return scanners.size(); } ! /** * This method is to be used to change the set of scanners in the current parser. --- 634,640 ---- */ public int getNumScanners() { ! return scanners.size(); } ! /** * This method is to be used to change the set of scanners in the current parser. *************** *** 645,649 **** scanners = (null == newScanners) ? new HashMap() : newScanners; } ! /** * Get an enumeration of scanners registered currently in the parser --- 645,649 ---- scanners = (null == newScanners) ? new HashMap() : newScanners; } ! /** * Get an enumeration of scanners registered currently in the parser *************** *** 696,700 **** StringBuffer msg; String message; ! msg = new StringBuffer (1024); msg.append (url_conn.getURL ().toExternalForm ()); --- 696,700 ---- StringBuffer msg; String message; ! msg = new StringBuffer (1024); msg.append (url_conn.getURL ().toExternalForm ()); *************** *** 708,715 **** ret = new InputStreamReader (input, character_set); } ! return (ret); } ! /** * Create a new reader for the URLConnection object. --- 708,715 ---- ret = new InputStreamReader (input, character_set); } ! return (ret); } ! /** * Create a new reader for the URLConnection object. *************** *** 762,766 **** } } ! /** * Try and extract the character set from the HTTP header. --- 762,766 ---- } } ! /** * Try and extract the character set from the HTTP header. *************** *** 774,778 **** String string; String ret; ! ret = DEFAULT_CHARSET; string = connection.getHeaderField (field); --- 774,778 ---- String string; String ret; ! ret = DEFAULT_CHARSET; string = connection.getHeaderField (field); *************** *** 816,820 **** { index = content.indexOf(CHARSET_STRING); ! if (index != -1) { --- 816,820 ---- { index = content.indexOf(CHARSET_STRING); ! if (index != -1) { *************** *** 862,866 **** * In typical situations where you require a no-frills parser, use the registerScanners() method to add the most * common parsers. But when you wish to either compose a parser with only certain scanners registered, use this method. ! * It is advantageous to register only the scanners you want, in order to achieve faster parsing speed. This method * would also be of use when you have developed custom scanners, and need to register them into the parser. * @param scanner TagScanner object (or derivative) to be added to the list of registered scanners --- 862,866 ---- * In typical situations where you require a no-frills parser, use the registerScanners() method to add the most * common parsers. But when you wish to either compose a parser with only certain scanners registered, use this method. ! * It is advantageous to register only the scanners you want, in order to achieve faster parsing speed. This method * would also be of use when you have developed custom scanners, and need to register them into the parser. * @param scanner TagScanner object (or derivative) to be added to the list of registered scanners *************** *** 873,877 **** scanner.setFeedback(feedback); } ! /** * Returns an iterator (enumeration) to the html nodes. Each node can be a tag/endtag/ --- 873,877 ---- scanner.setFeedback(feedback); } ! /** * Returns an iterator (enumeration) to the html nodes. Each node can be a tag/endtag/ *************** *** 925,929 **** remove_scanner = true; } ! /* pre-read up to </HEAD> looking for charset directive */ while (null != (node = ret.peek ())) --- 925,929 ---- remove_scanner = true; } ! /* pre-read up to </HEAD> looking for charset directive */ while (null != (node = ret.peek ())) *************** *** 976,987 **** return ret; } ! /** * Flush the current scanners registered. The registered scanners list becomes empty with this call. */ public void flushScanners() { ! scanners = new Hashtable(); } ! /** * Return the scanner registered in the parser having the --- 976,987 ---- return ret; } ! /** * Flush the current scanners registered. The registered scanners list becomes empty with this call. */ public void flushScanners() { ! scanners = new Hashtable(); } ! /** * Return the scanner registered in the parser having the *************** *** 1006,1010 **** { if (filter==null) ! System.out.println(node.toString()); else { --- 1006,1010 ---- { if (filter==null) ! System.out.println(node.toString()); else { *************** *** 1014,1025 **** Tag tag=(Tag)node; TagScanner scanner = tag.getThisScanner(); ! if (scanner==null) continue; ! String tagFilter = scanner.getFilter(); if (tagFilter==null) continue; if (tagFilter.equals(filter)) ! System.out.println(node.toString()); ! } } else System.out.println("Node is null"); --- 1014,1025 ---- Tag tag=(Tag)node; TagScanner scanner = tag.getThisScanner(); ! if (scanner==null) continue; ! String tagFilter = scanner.getFilter(); if (tagFilter==null) continue; if (tagFilter.equals(filter)) ! System.out.println(node.toString()); ! } } else System.out.println("Node is null"); *************** *** 1027,1031 **** } ! /** * This method should be invoked in order to register some common scanners. The scanners that get added are : <br> --- 1027,1031 ---- } ! /** * This method should be invoked in order to register some common scanners. The scanners that get added are : <br> *************** *** 1048,1052 **** * parser.registerScanners(); * </pre> ! */ public void registerScanners() { if (scanners.size()>0) { --- 1048,1052 ---- * parser.registerScanners(); * </pre> ! */ public void registerScanners() { if (scanners.size()>0) { *************** *** 1069,1073 **** addScanner(new DoctypeScanner("-d")); addScanner(new FormScanner("-f",this)); ! addScanner(new FrameSetScanner("-r")); addScanner(linkScanner.createBaseHREFScanner("-b")); addScanner(new BulletListScanner("-bulletList",this)); --- 1069,1073 ---- addScanner(new DoctypeScanner("-d")); addScanner(new FormScanner("-f",this)); ! addScanner(new FrameSetScanner("-r")); addScanner(linkScanner.createBaseHREFScanner("-b")); addScanner(new BulletListScanner("-bulletList",this)); *************** *** 1076,1086 **** addScanner(new TableScanner(this)); } ! /** * Make a call to registerDomScanners(), instead of registerScanners(), * when you are interested in retrieving a Dom representation of the html * page. Upon parsing, you will receive an Html object - which will contain ! * children, one of which would be the body. This is still evolving, and in ! * future releases, you might see consolidation of Html - to provide you * with methods to access the body and the head. */ --- 1076,1086 ---- addScanner(new TableScanner(this)); } ! /** * Make a call to registerDomScanners(), instead of registerScanners(), * when you are interested in retrieving a Dom representation of the html * page. Upon parsing, you will receive an Html object - which will contain ! * children, one of which would be the body. This is still evolving, and in ! * future releases, you might see consolidation of Html - to provide you * with methods to access the body and the head. */ *************** *** 1091,1099 **** addScanner(new HeadScanner()); } ! /** * Removes a specified scanner object. You can create * an anonymous object as a parameter. This method ! * will use the scanner's key and remove it from the * registry of scanners. * e.g. --- 1091,1099 ---- addScanner(new HeadScanner()); } ! /** * Removes a specified scanner object. You can create * an anonymous object as a parameter. This method ! * will use the scanner's key and remove it from the * registry of scanners. * e.g. *************** *** 1123,1128 **** System.out.println(" -t Show only the Style code extracted from the document"); System.out.println(" -a Show only the Applet tag extracted from the document"); ! System.out.println(" -j Parse JSP tags"); ! System.out.println(" -m Parse Meta tags"); System.out.println(" -T Extract the Title"); System.out.println(" -f Extract forms"); --- 1123,1128 ---- System.out.println(" -t Show only the Style code extracted from the document"); System.out.println(" -a Show only the Applet tag extracted from the document"); ! System.out.println(" -j Parse JSP tags"); ! System.out.println(" -m Parse Meta tags"); System.out.println(" -T Extract the Title"); System.out.println(" -f Extract forms"); *************** *** 1156,1160 **** } } ! public void visitAllNodesWith(NodeVisitor visitor) throws ParserException { Node node; --- 1156,1160 ---- } } ! public void visitAllNodesWith(NodeVisitor visitor) throws ParserException { Node node; *************** *** 1165,1169 **** visitor.finishedParsing(); } ! /** * Initializes the parser with the given input HTML String. --- 1165,1169 ---- visitor.finishedParsing(); } ! /** * Initializes the parser with the given input HTML String. *************** *** 1173,1179 **** { if (!"".equals (inputHTML)) ! reader = new NodeReader (new StringReader (inputHTML), ""); ! } ! public Node [] extractAllNodesThatAre(Class nodeType) throws ParserException { NodeList nodeList = new NodeList(); --- 1173,1179 ---- { if (!"".equals (inputHTML)) ! reader = new NodeReader (new StringReader (inputHTML), ""); ! } ! public Node [] extractAllNodesThatAre(Class nodeType) throws ParserException { NodeList nodeList = new NodeList(); *************** *** 1183,1187 **** return nodeList.toNodeArray(); } ! /** * Creates the parser on an input string. --- 1183,1187 ---- return nodeList.toNodeArray(); } ! /** * Creates the parser on an input string. *************** *** 1190,1198 **** */ public static Parser createParser(String inputHTML) { ! NodeReader reader = new NodeReader(new StringReader(inputHTML),""); return new Parser(reader); } ! public static Parser createLinkRecognizingParser(String inputHTML) { Parser parser = createParser(inputHTML); --- 1190,1198 ---- */ public static Parser createParser(String inputHTML) { ! NodeReader reader = new NodeReader(new StringReader(inputHTML),""); return new Parser(reader); } ! public static Parser createLinkRecognizingParser(String inputHTML) { Parser parser = createParser(inputHTML); *************** *** 1213,1219 **** return stringNodeFactory; } ! public void setStringNodeFactory(StringNodeFactory stringNodeFactory) { ! this.stringNodeFactory = stringNodeFactory; ! } } --- 1213,1219 ---- return stringNodeFactory; } ! public void setStringNodeFactory(StringNodeFactory stringNodeFactory) { ! this.stringNodeFactory = stringNodeFactory; ! } } Index: RemarkNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/RemarkNode.java,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** RemarkNode.java 8 Sep 2003 02:26:28 -0000 1.29 --- RemarkNode.java 10 Sep 2003 03:38:17 -0000 1.30 *************** *** 11,15 **** // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software --- 11,15 ---- // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software *************** *** 18,27 **** // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com --- 18,27 ---- // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com *************** *** 39,43 **** { public final static String REMARK_NODE_FILTER="-r"; ! /** * Tag contents will have the contents of the comment tag. --- 39,43 ---- { public final static String REMARK_NODE_FILTER="-r"; ! /** * Tag contents will have the contents of the comment tag. *************** *** 57,61 **** } ! /** * Returns the text contents of the comment tag. */ --- 57,61 ---- } ! /** * Returns the text contents of the comment tag. */ Index: RemarkNodeParser.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/RemarkNodeParser.java,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** RemarkNodeParser.java 8 Sep 2003 02:26:28 -0000 1.29 --- RemarkNodeParser.java 10 Sep 2003 03:38:17 -0000 1.30 *************** *** 11,15 **** // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software --- 11,15 ---- // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software *************** *** 18,27 **** // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com --- 18,27 ---- // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com *************** *** 34,44 **** public final static int REMARK_NODE_EXCLAMATION_RECEIVED_STATE=2; public final static int REMARK_NODE_FIRST_DASH_RECEIVED_STATE=3; ! public final static int REMARK_NODE_ACCEPTING_STATE=4; ! public final static int REMARK_NODE_CLOSING_FIRST_DASH_RECEIVED_STATE=5; ! public final static int REMARK_NODE_CLOSING_SECOND_DASH_RECEIVED_STATE=6; ! public final static int REMARK_NODE_ACCEPTED_STATE=7; public final static int REMARK_NODE_ILLEGAL_STATE=8; ! public final static int REMARK_NODE_FINISHED_PARSING_STATE=2; ! /** * Locate the remark tag withing the input string, by parsing from the given position --- 34,44 ---- public final static int REMARK_NODE_EXCLAMATION_RECEIVED_STATE=2; public final static int REMARK_NODE_FIRST_DASH_RECEIVED_STATE=3; ! public final static int REMARK_NODE_ACCEPTING_STATE=4; ! public final static int REMARK_NODE_CLOSING_FIRST_DASH_RECEIVED_STATE=5; ! public final static int REMARK_NODE_CLOSING_SECOND_DASH_RECEIVED_STATE=6; ! public final static int REMARK_NODE_ACCEPTED_STATE=7; public final static int REMARK_NODE_ILLEGAL_STATE=8; ! public final static int REMARK_NODE_FINISHED_PARSING_STATE=2; ! /** * Locate the remark tag withing the input string, by parsing from the given position *************** *** 46,50 **** * @param input Input String * @param position Position to start parsing from ! */ public RemarkNode find(NodeReader reader,String input,int position) { --- 46,50 ---- * @param input Input String * @param position Position to start parsing from ! */ public RemarkNode find(NodeReader reader,String input,int position) { *************** *** 91,95 **** tagContents.append(prevChar); } ! } if (state==REMARK_NODE_ACCEPTING_STATE) { if (ch == '-') { --- 91,95 ---- tagContents.append(prevChar); } ! } if (state==REMARK_NODE_ACCEPTING_STATE) { if (ch == '-') { *************** *** 103,111 **** if (state==REMARK_NODE_ACCEPTING_STATE) { ! // We can append contents now tagContents.append(ch); ! } - if (state==REMARK_NODE_FIRST_DASH_RECEIVED_STATE) { --- 103,111 ---- if (state==REMARK_NODE_ACCEPTING_STATE) { ! // We can append contents now tagContents.append(ch); ! } ! if (state==REMARK_NODE_FIRST_DASH_RECEIVED_STATE) { *************** *** 118,122 **** } else state=REMARK_NODE_ILLEGAL_STATE; ! } if (state==REMARK_NODE_EXCLAMATION_RECEIVED_STATE) { --- 118,122 ---- } else state=REMARK_NODE_ILLEGAL_STATE; ! } if (state==REMARK_NODE_EXCLAMATION_RECEIVED_STATE) { *************** *** 129,133 **** } else state=REMARK_NODE_ILLEGAL_STATE; ! } if (state==REMARK_NODE_OPENING_ANGLE_BRACKET_STATE) { --- 129,133 ---- } else state=REMARK_NODE_ILLEGAL_STATE; ! } if (state==REMARK_NODE_OPENING_ANGLE_BRACKET_STATE) { *************** *** 135,139 **** state=REMARK_NODE_EXCLAMATION_RECEIVED_STATE; else state = REMARK_NODE_ILLEGAL_STATE; // This is not a remark tag ! } if (state == REMARK_NODE_BEFORE_PARSING_STATE) { --- 135,139 ---- state=REMARK_NODE_EXCLAMATION_RECEIVED_STATE; else state = REMARK_NODE_ILLEGAL_STATE; // This is not a remark tag ! } if (state == REMARK_NODE_BEFORE_PARSING_STATE) { *************** *** 147,153 **** state = REMARK_NODE_ILLEGAL_STATE; } ! } // if... [truncated message content] |