Thread: [Htmlparser-user] help in htmlparser(I need to retrieve Snippets)
Brought to you by:
derrickoswald
From: anumodh n. k. <anu...@ho...> - 2003-02-17 17:28:18
|
Hello there, I am doing a project in "Intelligent Document Clustering" and I need to get only "Title and snippets(summary)" from Google search page.I used StringExtractor but it is returning all the contents of the page which i again need to clean so as get each link and its corresponding snippets.Is there any method for getting it done through your codes. Waiting for the reply With best regards ANUMODH _________________________________________________________________ The new MSN 8: advanced junk mail protection and 2 months FREE* http://join.msn.com/?page=features/junkmail |
From: Somik R. <so...@ya...> - 2003-02-17 19:06:46
|
> I need to get only "Title and snippets(summary)" > from Google search page. Write your own visitor. The visitor should override visitTag(). Check if the tag is a HTMLTitleTag, and if it is - you have the title contents. To get snippets, override visitStringNode(), and collect all the string data. To clean them up, use HTMLParserUtils.removeEscapeCharacters(). Check http://htmlparser.sourceforge.net/docs/index.php/LinkExtraction (point 3), for an example of writing your own visitor. Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Shopping - Send Flowers for Valentine's Day http://shopping.yahoo.com |
From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-02-22 18:45:38
|
hi, I would like to write a program that can harvest certain information (mostly text) on the web page. Some of the web page requires feedback from the user (existence of <form> tag) to get more information on the page. Some of the page is just a plain text and some of the page is in frames. How can I wrote a single harvester that can harvest these three types of pages with one harvester code. below is the sample pages that I want to harvest. (harvest question and get the correct answers.) i)with the form: http://developer.java.sun.com/developer/Quizzes/jbasics1-1/ ii)plain text: http://www.jchq.net/mockexams/exam3.htm iii) with frames: http://www.angelfire.com/or/abhilash/Main.html hope you can give me some advice on how to do this. thank you. |
From: Somik R. <so...@ya...> - 2003-02-23 05:22:19
|
You could go thru the docs at http://htmlparser.sourceforge.net/docs/index.php/LinkExtraction Forms and Frames are represented by HTMLFormTag, and HTMLFrameTag. You could write your own visitor that could collect form tags, string nodes, and on encountering a frame tag, could open a new parser object for the frame url and visit it with the same visitor (different object probably). Try out the programs on this page, and it should be easy. Feel free to post here if you face any problems. Regards, Somik ----- Original Message ----- From: "Mohd-Taqiyuddin Zalfan" <mt...@ec...> To: <htm...@li...> Sent: Saturday, February 22, 2003 10:44 AM Subject: [Htmlparser-user] Harvester > hi, > > I would like to write a program that can harvest certain information (mostly > text) on the web page. Some of the web page requires feedback from the user > (existence of <form> tag) to get more information on the page. Some of the > page is just a plain text and some of the page is in frames. How can I wrote > a single harvester that can harvest these three types of pages with one > harvester code. > > below is the sample pages that I want to harvest. (harvest question and get > the correct answers.) > > i)with the form: http://developer.java.sun.com/developer/Quizzes/jbasics1-1/ > ii)plain text: http://www.jchq.net/mockexams/exam3.htm > iii) with frames: http://www.angelfire.com/or/abhilash/Main.html > > hope you can give me some advice on how to do this. thank you. > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. > The most comprehensive and flexible code editor you can use. > Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. > www.slickedit.com/sourceforge > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-02-23 14:47:17
|
hi, sorry to bother you. I know that the input tag is in the HTMLFormTag. However when I try to parse this page with HTMLFormScanner http://developer.java.sun.com/developer/Quizzes/jbasics1-1/ it returns an error and the process has been terminate. Below is my testing code.(Just to see if HTMLFormTag exist in the page) public String extractStrings() throws HTMLParserException { HTMLParser parser = new HTMLParser(resource); parser.addScanner(new HTMLFormScanner("")); HTMLNode node; String check; StringBuffer results= new StringBuffer(); for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { node = e.nextHTMLNode(); if (node instanceof HTMLFormTag){//check the existence of HTMLFormTag System.out.print(node.toString());} check=node.toPlainTextString(); results.append(check); } return results.toString(); } however this error printed in the console. Its can compile but generate a runtime error. below is the error: ERROR: HTMLReader.readElement() : Error occurred while trying to decipher the tag using scannersat Line 72 : <form method="get" action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109"> ERROR: HTMLReader.readElement() : Error occurred while trying to read the next element,at Line 72 : <form method="get" action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109"> ERROR: Unexpected Exception occurred while reading http://developer.java.sun.com /developer/Quizzes/jbasics1-1/, in nextHTMLNode at Line 72 : <form method="get" action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109"> org.htmlparser.util.HTMLParserException: Unexpected Exception occurred while reading http://developer.java.sun.com/developer/Quizzes/jbasics1-1/, in nextHTMLNode at Line 72 : <form method="get" action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109">; org.htmlparser.util.HTMLParserException: HTMLReader.readElement() : Error occurred while trying to read the next element, at Line 72 : <form method="get" action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109">; org.htmlparser.util.HTMLParserException: HTMLReader.readElement() : Error occurred while trying to decipher the tag using scanners at Line 72 : <form method="get"action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109">; org.htmlparser.util.HTMLParserException: HTMLTag.scan() : Error while scanning tag, tag contents = form method="get" action="http://servlet.java.sun.com/logRedi rect/frontpage-head/http://search.java.sun.com/search/java/", tagLine = <form method="get" action="http://servlet.java.sun.com/logRedirect/frontpage- head/http://search.java.sun.com/search/java/">; org.htmlparser.util.HTMLParserException: HTMLFormScanner.scan() : Error while scanning the form tag, current line = <form method="get" action="http://servlet.ja va.sun.com/logRedirect/frontpage- head/http://search.java.sun.com/search/java/">; java.lang.NullPointerException at org.htmlparser.HTMLParser.addScanner(HTMLParser.java:863) at org.htmlparser.scanners.HTMLFormScanner.scan (HTMLFormScanner.java:164) at org.htmlparser.scanners.HTMLTagScanner.createScannedNode (HTMLTagScanner.java:193) at org.htmlparser.tags.HTMLTag.scan(HTMLTag.java:266) at org.htmlparser.HTMLReader.readElement(HTMLReader.java:193) at org.htmlparser.util.HTMLEnumerationImpl.peek (HTMLEnumerationImpl.java:60) at org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes (HTMLEnumerationImpl.java:91) at StringExtractor.extractStrings(StringExtractor.java:27) at StringExtractor.main(StringExtractor.java:49) there is two form in the page, one is for the searching part of the site and the other one is what i'm interested in that is form with questions. Please help me on this. Is this a bug? thank you. |