Thread: [Htmlparser-user] help in htmlparser(I need to retrieve Snippets)

Brought to you by: derrickoswald

htmlparser-user

[Htmlparser-user] help in htmlparser(I need to retrieve Snippets)

From: anumodh n. k. <anu...@ho...> - 2003-02-17 17:28:18

Hello there,

           I am doing a project in "Intelligent Document Clustering" and
I need to get only "Title and snippets(summary)" from Google search page.I 
used StringExtractor but it is returning all the contents of the page which 
i again need to clean so as get each link and its corresponding snippets.Is 
there any method for getting it done through your codes.

Waiting for the reply

With best regards


ANUMODH





_________________________________________________________________
The new MSN 8: advanced junk mail protection and 2 months FREE* 
http://join.msn.com/?page=features/junkmail

Re: [Htmlparser-user] help in htmlparser(I need to retrieve Snippets)

From: Somik R. <so...@ya...> - 2003-02-17 19:06:46

> I need to get only "Title and snippets(summary)"
> from Google search page.

Write your own visitor. The visitor should override
visitTag(). Check if the tag is a HTMLTitleTag, and if
it is - you have the title contents.

To get snippets, override visitStringNode(), and
collect all the string data.

To clean them up, use
HTMLParserUtils.removeEscapeCharacters().

Check
http://htmlparser.sourceforge.net/docs/index.php/LinkExtraction
(point 3), for an example of writing your own visitor.

Regards,
Somik

__________________________________________________
Do you Yahoo!?
Yahoo! Shopping - Send Flowers for Valentine's Day
http://shopping.yahoo.com

[Htmlparser-user] Harvester

From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-02-22 18:45:38

hi,

I would like to write a program that can harvest certain information (mostly 
text) on the web page. Some of the web page requires feedback from the user 
(existence of <form> tag) to get more information on the page. Some of the 
page is just a plain text and some of the page is in frames. How can I wrote 
a single harvester that can harvest these three types of pages with one 
harvester code. 

below is the sample pages that I want to harvest. (harvest question and get 
the correct answers.)

i)with the form: http://developer.java.sun.com/developer/Quizzes/jbasics1-1/
ii)plain text: http://www.jchq.net/mockexams/exam3.htm
iii) with frames: http://www.angelfire.com/or/abhilash/Main.html

hope you can give me some advice on how to do this. thank you.

Re: [Htmlparser-user] Harvester

From: Somik R. <so...@ya...> - 2003-02-23 05:22:19

You could go thru the docs at
http://htmlparser.sourceforge.net/docs/index.php/LinkExtraction
Forms and Frames are represented by HTMLFormTag, and HTMLFrameTag. You could
write your own visitor that could collect form tags, string nodes, and on
encountering a frame tag, could open a new parser object for the frame url
and visit it with the same visitor (different object probably).

Try out the programs on this page, and it should be easy. Feel free to post
here if you face any problems.

Regards,
Somik
----- Original Message -----
From: "Mohd-Taqiyuddin Zalfan" <mt...@ec...>
To: <htm...@li...>
Sent: Saturday, February 22, 2003 10:44 AM
Subject: [Htmlparser-user] Harvester


> hi,
>
> I would like to write a program that can harvest certain information
(mostly
> text) on the web page. Some of the web page requires feedback from the
user
> (existence of <form> tag) to get more information on the page. Some of the
> page is just a plain text and some of the page is in frames. How can I
wrote
> a single harvester that can harvest these three types of pages with one
> harvester code.
>
> below is the sample pages that I want to harvest. (harvest question and
get
> the correct answers.)
>
> i)with the form:
http://developer.java.sun.com/developer/Quizzes/jbasics1-1/
> ii)plain text: http://www.jchq.net/mockexams/exam3.htm
> iii) with frames: http://www.angelfire.com/or/abhilash/Main.html
>
> hope you can give me some advice on how to do this. thank you.
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
> The most comprehensive and flexible code editor you can use.
> Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
> www.slickedit.com/sourceforge
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user

Re: [Htmlparser-user] Harvester

From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-02-23 14:47:17

hi,

sorry to bother you. I know that the input tag is in the HTMLFormTag. 
However when I try to parse this page with HTMLFormScanner 
http://developer.java.sun.com/developer/Quizzes/jbasics1-1/
it returns an error and the process has been terminate. Below is my testing 
code.(Just to see if HTMLFormTag exist in the page)

public String extractStrings() throws HTMLParserException {
    HTMLParser parser = new HTMLParser(resource);
parser.addScanner(new HTMLFormScanner(""));

    HTMLNode node;
	String check;
    StringBuffer results= new StringBuffer();
    for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
	node = e.nextHTMLNode();
	if (node instanceof HTMLFormTag){//check the existence of HTMLFormTag
	System.out.print(node.toString());}
	
	check=node.toPlainTextString();
		   results.append(check);
    }
    return results.toString();
  }

however this error printed in the console. Its can compile but generate a 
runtime error. below is the error:

ERROR: HTMLReader.readElement() : Error occurred while trying to decipher 
the tag using scannersat Line 72 : <form method="get" 
action="http://servlet.java.sun.com/logRedirect/
frontpage-head/http://search.java.sun.com/search/java/">
Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" 
width="100%" height="109">
ERROR: HTMLReader.readElement() : Error occurred while trying to read the 
next element,at Line 72 : <form method="get" 
action="http://servlet.java.sun.com/logRedirect/
frontpage-head/http://search.java.sun.com/search/java/">
Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" 
width="100%" height="109">
ERROR: Unexpected Exception occurred while reading 
http://developer.java.sun.com
/developer/Quizzes/jbasics1-1/, in nextHTMLNode
at Line 72 : <form method="get" 
action="http://servlet.java.sun.com/logRedirect/
frontpage-head/http://search.java.sun.com/search/java/">
Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" 
width="100%" height="109">
org.htmlparser.util.HTMLParserException: Unexpected Exception occurred while 
reading http://developer.java.sun.com/developer/Quizzes/jbasics1-1/, in 
nextHTMLNode at Line 72 : <form method="get" 
action="http://servlet.java.sun.com/logRedirect/
frontpage-head/http://search.java.sun.com/search/java/">
Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" 
width="100%" height="109">;
org.htmlparser.util.HTMLParserException: HTMLReader.readElement() : Error 
occurred while trying to read the next element,
at Line 72 : <form method="get" 
action="http://servlet.java.sun.com/logRedirect/
frontpage-head/http://search.java.sun.com/search/java/">
Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" 
width="100%" height="109">;
org.htmlparser.util.HTMLParserException: HTMLReader.readElement() : Error 
occurred while trying to decipher the tag using scanners at Line 72 : <form 
method="get"action="http://servlet.java.sun.com/logRedirect/
frontpage-head/http://search.java.sun.com/search/java/">
Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" 
width="100%" height="109">;
org.htmlparser.util.HTMLParserException: HTMLTag.scan() : Error while 
scanning tag, tag contents = form method="get" 
action="http://servlet.java.sun.com/logRedi
rect/frontpage-head/http://search.java.sun.com/search/java/", tagLine = 
<form method="get" action="http://servlet.java.sun.com/logRedirect/frontpage-
head/http://search.java.sun.com/search/java/">;
org.htmlparser.util.HTMLParserException: HTMLFormScanner.scan() : Error 
while scanning the form tag, current line = <form method="get" 
action="http://servlet.ja
va.sun.com/logRedirect/frontpage-
head/http://search.java.sun.com/search/java/">;

java.lang.NullPointerException
        at org.htmlparser.HTMLParser.addScanner(HTMLParser.java:863)
        at org.htmlparser.scanners.HTMLFormScanner.scan
(HTMLFormScanner.java:164)
        at org.htmlparser.scanners.HTMLTagScanner.createScannedNode
(HTMLTagScanner.java:193)
        at org.htmlparser.tags.HTMLTag.scan(HTMLTag.java:266)
        at org.htmlparser.HTMLReader.readElement(HTMLReader.java:193)
        at org.htmlparser.util.HTMLEnumerationImpl.peek
(HTMLEnumerationImpl.java:60)
        at org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes
(HTMLEnumerationImpl.java:91)
        at StringExtractor.extractStrings(StringExtractor.java:27)
        at StringExtractor.main(StringExtractor.java:49)

there is two form in the page, one is for the searching part of the site and 
the other one is what i'm interested in that is form with questions. Please 
help me on this. Is this a bug? thank you.