Thread: [Htmlparser-user] Help with Filters
Brought to you by:
derrickoswald
From: Claude D. <CD...@ar...> - 2002-06-11 16:05:38
|
Greetings. I have started developing a solution with the HTMLParser and wanted to ask about a few specifics. The application extracts text, the title and some metadata (author, description, keywords - if present) from HTML documents for indexing purposes. I have successfully written code to access the content, title, and meta information but now need to put it in context. To do this, I would like to recognize the BODY tag's start and end. If I understand the architecture correctly, HTMLParser should allow me to register a simple HTMLTagScanner, but since this is an abstract class and the existing scanners don't suit my purpose, I presume I need to implement a subclass. Can someone show me how to subclass HTMLTagScanner to watch for a specific tag? PS: I've found the design and implementation to be quit nice as I use it, very simple to apply in practice. If the download bundle include source I would probably have just taken a look. I'm not adverse to using CVS but the setup time is sometimes prohibitive. Having a source bundle for download might be useful in future distributions. Thanks. |
From: Somik R. <so...@ya...> - 2002-06-12 00:04:16
|
Hi Claude, PS: I've found the design and implementation to be quit nice as I use it, very simple to apply in practice. If the download bundle include source I would probably have just taken a look. I'm not adverse to using CVS but the setup time is sometimes prohibitive. Having a source bundle for download might be useful in future distributions. =20 There is a source bundle in the distribution. When you unzip the = downloaded file, in the main htmlparser directory, you should be able to = see src.zip. The application extracts text, the title and some metadata (author, description, keywords - if present) from HTML documents for indexing purposes. I have successfully written code to access the content, title, and meta information but now need to put it in context. To do this, I would like to recognize the BODY tag's start and end. If I understand the architecture correctly, HTMLParser should allow me to register a simple HTMLTagScanner, but since this is an abstract class and the existing scanners don't suit my purpose, I presume I need to implement a subclass. Yes, you can write your own scanner as a subclass of HTMLTagScanner. = Check the scanners package for all the existing scanner code. They = follow the same pattern usually. Also check the docs at = http://htmlparser.sourceforge.net/design/scanner.html (its also in the = download bundle, in the docs directory). You would typically want to create a tag specific to your needs, which = is created by the scan() Factory Method/ Template Method. Your tag will = derive from HTMLTag or implement HTMLNode if it is not a tag. For your = purposes, I'd imagine a body tag class holding a vector of HTMLNode = elements. Can someone show me how to subclass HTMLTagScanner to watch for a specific tag? Its very easy.=20 public class MyScanner extends HTMLTagScanner { // This method is called to check if your scanner should be used. = Here's where you have to check if the scanner // should start public boolean evaluate(String s) { // check if s contains the word body in it. if (s.toUpperCase().indexOf("BODY")=3D=3D0) return true; else = return false; } // This method is automatically called to ask your scanner to do the = creation. Remember, the onus to do the=20 // scanning and take the scanner to the next correct location for = scanning is on you. public HTMLNode scan(...) { // .... your logic to create the return object (perhaps = HTMLBodyTag) return bodyTag; } } To register the scanner, when you create HTMLParser, you will need to do = this : HTMLParser parser =3D new HTMLParser("..."); parser.registerScanners(); // To register the standard scanners parser.addScanner(new MyScanner()); Thats all - it gets registered and used. Since you are tapping into low-level parsing, it is imperative that you = write test cases. The parserTests.scannersTests package contains sample = test code - which you can copy as a template to setup your testcases. = Its very easy, you can create dummy html code liked <BODY><STRONG>HELLO = WORLD</STRONG></BODY>, and register your body scanner to see if it is = extracting data as you would expect.=20 Also - it is very important that you run parserTests.AllTests - which = will run the 100+ testcases in the existing parser to check if you broke = anything. These tests are what ensure this parser is bug free and = usable, and make programming it manageable. One tip - when you are writing the scanner, although you are tapping = into low-level parsing, you dont have to write low-level code - you can = reuse code that might be in the other scanners. For an example of this, = see HTMLTitleScanner. I'd expect all scanners to be written like this. = But the other scanners are currently a bit archaic. Maybe I will get = around to refactoring all the scanners to be as elegant as the = HTMLTitleScanner.=20 Feel free to post any further questions. Good luck with your coding! Regards, Somik **************************************** Somik Raha System Architect Kizna Corporation Hiroo ON Bldg. 2F, 5-19-9 Hiroo, Shibuya-ku, Tokyo,=20 150-0012,=20 JAPAN Tel : +81-3-54752646 Fax : +81-3-5449-4870 Website : www.kizna.com Mail : so...@ki... *************************************************************************= ********** C makes it easy to shoot yourself in the foot. C++ makes it harder, but=20 when you do, it blows away your whole leg.=20 - Bjarne Stroustrup=20 *************************************************************************= ********** |