[Htmlparser-cvs] htmlparser/src/org/htmlparser/parserapplications LinkExtractor.java,1.43,1.44 MailR
Brought to you by:
derrickoswald
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications In directory sc8-pr-cvs1:/tmp/cvs-serv24483/src/org/htmlparser/parserapplications Modified Files: LinkExtractor.java MailRipper.java Robot.java StringExtractor.java package.html Log Message: Add style checking target to ant build script: ant checkstyle It uses a jar from http://checkstyle.sourceforge.net which is dropped in the lib directory. The rules are in the file htmlparser_checks.xml in the src directory. Added lexerapplications package with Tabby as the first app. It performs whitespace manipulation on source files to follow the style rules. This reduced the number of style violations to roughly 14,000. There are a few issues with the style checker that need to be resolved before it should be taken too seriously. For example: It thinks all method arguments should be final, even if they are modified by the code (which the compiler frowns on). It complains about long lines, even when there is no possibility of wrapping the line, i.e. a URL in a comment that's more than 80 characters long. It considers all naked integers as 'magic numbers', even when they are obvious, i.e. the 4 corners of a box. It complains about whitespace following braces, even in array initializers, i.e. X[][] = { {a, b} { } } But it points out some really interesting things, even if you don't agree with the style guidelines, so it's worth a look. Index: LinkExtractor.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications/LinkExtractor.java,v retrieving revision 1.43 retrieving revision 1.44 diff -C2 -d -r1.43 -r1.44 *** LinkExtractor.java 8 Sep 2003 02:26:29 -0000 1.43 --- LinkExtractor.java 10 Sep 2003 03:38:18 -0000 1.44 *************** *** 11,15 **** // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software --- 11,15 ---- // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software *************** *** 18,27 **** // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com --- 18,27 ---- // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com *************** *** 50,54 **** e.printStackTrace(); } ! } public void extractLinks() throws ParserException { --- 50,54 ---- e.printStackTrace(); } ! } public void extractLinks() throws ParserException { Index: MailRipper.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications/MailRipper.java,v retrieving revision 1.44 retrieving revision 1.45 diff -C2 -d -r1.44 -r1.45 *** MailRipper.java 8 Sep 2003 02:26:29 -0000 1.44 --- MailRipper.java 10 Sep 2003 03:38:18 -0000 1.45 *************** *** 11,15 **** // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software --- 11,15 ---- // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software *************** *** 18,27 **** // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com --- 18,27 ---- // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com *************** *** 77,85 **** System.out.println("If you have any doubts, please join the HTMLParser mailing list (user/developer) from the HTML Parser home page instead of mailing any of the contributors directly. You will be surprised with the quality of open source support. "); System.exit(-1); ! } String resourceLocation = "http://htmlparser.sourceforge.net"; if (args.length!=0) resourceLocation = args[0]; ! ! MailRipper ripper = new MailRipper(resourceLocation); System.out.println("Ripping Site "+resourceLocation); try { --- 77,85 ---- System.out.println("If you have any doubts, please join the HTMLParser mailing list (user/developer) from the HTML Parser home page instead of mailing any of the contributors directly. You will be surprised with the quality of open source support. "); System.exit(-1); ! } String resourceLocation = "http://htmlparser.sourceforge.net"; if (args.length!=0) resourceLocation = args[0]; ! ! MailRipper ripper = new MailRipper(resourceLocation); System.out.println("Ripping Site "+resourceLocation); try { *************** *** 109,113 **** } } ! return mailAddresses.elements(); } } --- 109,113 ---- } } ! return mailAddresses.elements(); } } Index: Robot.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications/Robot.java,v retrieving revision 1.46 retrieving revision 1.47 diff -C2 -d -r1.46 -r1.47 *** Robot.java 8 Sep 2003 02:26:29 -0000 1.46 --- Robot.java 10 Sep 2003 03:38:18 -0000 1.47 *************** *** 11,15 **** // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software --- 11,15 ---- // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software *************** *** 18,27 **** // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com --- 18,27 ---- // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com *************** *** 41,45 **** private org.htmlparser.Parser parser; /** ! * Robot crawler - Provide the starting url */ public Robot(String resourceLocation) { --- 41,45 ---- private org.htmlparser.Parser parser; /** ! * Robot crawler - Provide the starting url */ public Robot(String resourceLocation) { *************** *** 82,86 **** if (!linkTag.isMailLink()) { ! if (linkTag.getLink().toUpperCase().indexOf("HTM")!=-1 || linkTag.getLink().toUpperCase().indexOf("COM")!=-1 || linkTag.getLink().toUpperCase().indexOf("ORG")!=-1) --- 82,86 ---- if (!linkTag.isMailLink()) { ! if (linkTag.getLink().toUpperCase().indexOf("HTM")!=-1 || linkTag.getLink().toUpperCase().indexOf("COM")!=-1 || linkTag.getLink().toUpperCase().indexOf("ORG")!=-1) *************** *** 101,105 **** } ! public static void main(String[] args) { System.out.println("Robot Crawler v" + Parser.getVersion ()); --- 101,105 ---- } ! public static void main(String[] args) { System.out.println("Robot Crawler v" + Parser.getVersion ()); *************** *** 120,131 **** System.out.println("If you have any doubts, please join the HTMLParser mailing list (user/developer) from the HTML Parser home page instead of mailing any of the contributors directly. You will be surprised with the quality of open source support. "); System.exit(-1); ! } String resourceLocation=""; int crawlDepth = 1; if (args.length!=0) resourceLocation = args[0]; if (args.length==2) crawlDepth=Integer.valueOf(args[1]).intValue(); ! ! ! Robot robot = new Robot(resourceLocation); System.out.println("Crawling Site "+resourceLocation); try { --- 120,131 ---- System.out.println("If you have any doubts, please join the HTMLParser mailing list (user/developer) from the HTML Parser home page instead of mailing any of the contributors directly. You will be surprised with the quality of open source support. "); System.exit(-1); ! } String resourceLocation=""; int crawlDepth = 1; if (args.length!=0) resourceLocation = args[0]; if (args.length==2) crawlDepth=Integer.valueOf(args[1]).intValue(); ! ! ! Robot robot = new Robot(resourceLocation); System.out.println("Crawling Site "+resourceLocation); try { Index: StringExtractor.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications/StringExtractor.java,v retrieving revision 1.40 retrieving revision 1.41 diff -C2 -d -r1.40 -r1.41 *** StringExtractor.java 8 Sep 2003 02:26:29 -0000 1.40 --- StringExtractor.java 10 Sep 2003 03:38:19 -0000 1.41 *************** *** 11,15 **** // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software --- 11,15 ---- // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. ! // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software *************** *** 18,27 **** // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com --- 18,27 ---- // For any questions or suggestions, you can write to me at : // Email :so...@in... ! // ! // Postal Address : // Somik Raha // Extreme Programmer & Coach // Industrial Logic Corporation ! // 2583 Cedar Street, Berkeley, // CA 94708, USA // Website : http://www.industriallogic.com *************** *** 39,48 **** * Construct a StringExtractor to read from the given resource. * @param resource Either a URL or a file name. ! */ public StringExtractor (String resource) { this.resource = resource; } ! /** * Extract the text from a page. --- 39,48 ---- * Construct a StringExtractor to read from the given resource. * @param resource Either a URL or a file name. ! */ public StringExtractor (String resource) { this.resource = resource; } ! /** * Extract the text from a page. *************** *** 55,59 **** { StringBean sb; ! sb = new StringBean (); sb.setLinks (links); --- 55,59 ---- { StringBean sb; ! sb = new StringBean (); sb.setLinks (links); *************** *** 72,76 **** String url; StringExtractor se; ! links = false; url = null; --- 72,76 ---- String url; StringExtractor se; ! links = false; url = null; Index: package.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications/package.html,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** package.html 8 Sep 2003 02:26:29 -0000 1.13 --- package.html 10 Sep 2003 03:38:19 -0000 1.14 *************** *** 17,21 **** MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. ! You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software --- 17,21 ---- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. ! You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software *************** *** 24,33 **** For any questions or suggestions, you can write to me at : Email :so...@in... ! ! Postal Address : Somik Raha Extreme Programmer & Coach Industrial Logic Corporation ! 2583 Cedar Street, Berkeley, CA 94708, USA Website : http://www.industriallogic.com --- 24,33 ---- For any questions or suggestions, you can write to me at : Email :so...@in... ! ! Postal Address : Somik Raha Extreme Programmer & Coach Industrial Logic Corporation ! 2583 Cedar Street, Berkeley, CA 94708, USA Website : http://www.industriallogic.com *************** *** 36,40 **** <body bgcolor="white"> Developers and users alike should try out the applications in this package. The code of these applications will give ! a good idea about the capabilities of the HTML Parser, and its intended usage. The binary releases of html parser would typically contain these applications in runnable form. --- 36,40 ---- <body bgcolor="white"> Developers and users alike should try out the applications in this package. The code of these applications will give ! a good idea about the capabilities of the HTML Parser, and its intended usage. The binary releases of html parser would typically contain these applications in runnable form. |