[Htmlparser-user] Bug found
Brought to you by:
derrickoswald
|
From: Cheng J. <c....@sm...> - 2002-07-01 01:51:22
|
Firstly I have to say thank you to Somik Raha. You really do a=
good job to give us a new integration.
I am writing a program to parse webpage and retrieve the links=
in the pages.
I have tried the lastest version(6/30) and found there may be a=
bug.
The following is the part of the code and output.
System.out.println("Starting parsing...... " );
com.kizna.html.HTMLParser Parser =3D new=
com.kizna.html.HTMLParser("E://My paper/EdCrawler/page.htm");
Parser.registerScanners() ;
//Parser.parse(null);
// Parse the HTML file by Tag types
Enumeration e =3D Parser.elements();
while(HasMore)
{
try
{
HasMore =3D e.hasMoreElements(); //HasMore is a=
boolean var
}catch (Exception e2){ System.out.println(=
e2.toString()) ; HasMore =3D false; }; //have to stop parsing=
this HTML file
if( HasMore )
{
com.kizna.html.HTMLNode node=
=3D(com.kizna.html.HTMLNode)e.nextElement();
// HTML DoctypeTag
if (node instanceof=
com.kizna.html.tags.HTMLDoctypeTag)
{
com.kizna.html.tags.HTMLDoctypeTag=
DoctypeNode =3D (com.kizna.html.tags.HTMLDoctypeTag)node;
System.out.println("Doctype: " +=
DoctypeNode.toPlainTextString());
}//if
//title
if (node instanceof=
com.kizna.html.tags.HTMLTitleTag)
{
com.kizna.html.tags.HTMLTitleTag TitleNode =3D=
(com.kizna.html.tags.HTMLTitleTag)node;
System.out.println("Title: "+=
TitleNode.toPlainTextString() );
}
//MATA
if (node instanceof=
com.kizna.html.tags.HTMLMetaTag)
{
com.kizna.html.tags.HTMLMetaTag MataNode =3D=
(com.kizna.html.tags.HTMLMetaTag)node;
System.out.println("MATA HTTP-EQUIV: " +=
MataNode.getHttpEquiv() +" MATA name: "+=
MataNode.getMetaTagName() + " CONTENT :" +=
MataNode.getMetaTagContents());
}//if
// Links
if (node instanceof HTMLLinkTag)
{
HTMLLinkTag LinkNode =3D (HTMLLinkTag)node;
// Retrieve the data from the object and=
print it
System.out.println("LINK:=
"+LinkNode.toPlainTextString() +" " + " toHTML " +=
LinkNode.toHTML());
}//if
//Parser end
} // if(HasMore )
}//while
System.out.println("Parising END.");
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
part of the output
Doctype:
Title: The University of Edinburgh
MATA HTTP-EQUIV: null MATA name: Description CONTENT :The=
University of Edinburgh, promoting excellence in teaching and=
research.
MATA HTTP-EQUIV: null MATA name: keywords CONTENT :edinburgh=
,university ,degree ,study, studying, research, Scotland, uk,=
alumni, graduate, postgraduate, PhD, masters, grad ,post=
,edinboro, college, school
MATA HTTP-EQUIV: null MATA name: publisher CONTENT :The=
University of Edinburgh
MATA HTTP-EQUIV: null MATA name: author CONTENT :University Web=
Editor
LINK: Prospective Students toHTML <a=
href=3D"studying/">Prospective Students</A>
LINK: News & Events toHTML <a href=3D"news/">News &=
Events</A>
LINK: Faculties & Departments toHTML <a=
href=3D"/misc/depts.html">Faculties & Departments</A>
LINK: Present Students toHTML <a=
href=3D"/presentstudents/">Present Students</A>
LINK: Research toHTML <a href=3D"research/">Research</A>
LINK: Support Services toHTML <a=
href=3D"/misc/support.html">Support Services</A>
LINK: Staff toHTML <a href=3D"staff/">Staff</A>
INK: Lifelong Learning toHTML <a=
href=3D"http://www.lifelong.ed.ac.uk/">Lifelong Learning</A>
LINK: The Library toHTML <a href=3D"http://www.lib.ed.ac.uk/">The=
Library</A>
=A1=A1=A1=A1
Now we could see the links with the same domain name would only=
be displayed as part of the linkself.
So please check the toHTML() method.
Cheng Jun
c....@sm...
2002-07-01 02:38:51
|