[Htmlparser-developer] Re: [Htmlparser-user] Bug found
Brought to you by:
derrickoswald
|
From: Somik R. <so...@ya...> - 2002-07-01 11:44:58
|
Hi Cheng
Thanks for the kind words.
Regarding the bug, I would call it a feature :)
When you parse a link within a url - if the link is relative, it =
gets processed appropriately. If you want to get the absolute link, you =
should do :
linkTag.getLink(). The toHTML() method however tries to reconstruct the =
html as it appeared (so relative links show up as relative, and absolute =
links show up as absolute). There might be a controversy regarding the =
purpose of toHTML() itself - do you think toHTML() should not do an =
accurate rendition in the case of the HTMLTag ? I am open to opinions =
from everyone on this..
For your purposes, you will need to modify the code of toHTMLTag() =
in HTMLLinkTag.java.=20
Original Code :
public String toHTML() {
StringBuffer sb =3D new StringBuffer();
sb.append("<");
sb.append(tagContents.toString());
sb.append(">");
HTMLNode node;
for (Enumeration e =3D linkData();e.hasMoreElements();) {
node =3D (HTMLNode)e.nextElement();
sb.append(node.toHTML());
}
sb.append("</A>");
return sb.toString();
}
Modified Code :
public String toHTML() {
StringBuffer sb =3D new StringBuffer();
sb.append("<");
sb.append(getLink()); // Modification Occurs here
sb.append(">");
HTMLNode node;
for (Enumeration e =3D linkData();e.hasMoreElements();) {
node =3D (HTMLNode)e.nextElement();
sb.append(node.toHTML());
}
sb.append("</A>");
return sb.toString();
}
=20
Let me know if I might have misunderstood the problem, or this does not =
fix it.
Cheers,
Somik
(Note : If you checkout the code from CVS, you will get the ant build =
script - this will make it really simple for you to just get the =
htmlparser.jar and use it in your app.)
----- Original Message -----=20
From: Cheng Jun=20
To: htm...@li... ; =
htm...@li...=20
Sent: Monday, July 01, 2002 3:51 AM
Subject: [Htmlparser-user] Bug found
Firstly I have to say thank you to Somik Raha. You really do a good =
job to give us a new integration.=20
I am writing a program to parse webpage and retrieve the links in the =
pages.
I have tried the lastest version(6/30) and found there may be a bug.
The following is the part of the code and output.
System.out.println("Starting parsing...... " );
com.kizna.html.HTMLParser Parser =3D new =
com.kizna.html.HTMLParser("E://My paper/EdCrawler/page.htm");
Parser.registerScanners() ;
//Parser.parse(null);
// Parse the HTML file by Tag types
Enumeration e =3D Parser.elements();
while(HasMore)
{
try
{
HasMore =3D e.hasMoreElements(); //HasMore is a boolean =
var
}catch (Exception e2){ System.out.println( e2.toString()) ; =
HasMore =3D false; }; //have to stop parsing this HTML file
if( HasMore )
{
com.kizna.html.HTMLNode node =
=3D(com.kizna.html.HTMLNode)e.nextElement();
// HTML DoctypeTag
if (node instanceof =
com.kizna.html.tags.HTMLDoctypeTag)
{
com.kizna.html.tags.HTMLDoctypeTag DoctypeNode =3D =
(com.kizna.html.tags.HTMLDoctypeTag)node;
System.out.println("Doctype: " + =
DoctypeNode.toPlainTextString());
}//if
//title
if (node instanceof com.kizna.html.tags.HTMLTitleTag)
{
com.kizna.html.tags.HTMLTitleTag TitleNode =3D =
(com.kizna.html.tags.HTMLTitleTag)node;
System.out.println("Title: "+ =
TitleNode.toPlainTextString() );
}
//MATA
if (node instanceof com.kizna.html.tags.HTMLMetaTag)
{
com.kizna.html.tags.HTMLMetaTag MataNode =3D =
(com.kizna.html.tags.HTMLMetaTag)node;
System.out.println("MATA HTTP-EQUIV: " + =
MataNode.getHttpEquiv() +" MATA name: "+ MataNode.getMetaTagName() + " =
CONTENT :" + MataNode.getMetaTagContents());
}//if
// Links
if (node instanceof HTMLLinkTag)
{
HTMLLinkTag LinkNode =3D (HTMLLinkTag)node;
// Retrieve the data from the object and print it
System.out.println("LINK: =
"+LinkNode.toPlainTextString() +" " + " toHTML " + LinkNode.toHTML());
}//if
//Parser end
} // if(HasMore )
}//while
System.out.println("Parising END.");
=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D
part of the output=20
Doctype:=20
Title: The University of Edinburgh
MATA HTTP-EQUIV: null MATA name: Description CONTENT :The University =
of Edinburgh, promoting excellence in teaching and research.
MATA HTTP-EQUIV: null MATA name: keywords CONTENT :edinburgh =
,university ,degree ,study, studying, research, Scotland, uk, alumni, =
graduate, postgraduate, PhD, masters, grad ,post ,edinboro, college, =
school
MATA HTTP-EQUIV: null MATA name: publisher CONTENT :The University of =
Edinburgh
MATA HTTP-EQUIV: null MATA name: author CONTENT :University Web =
Editor
LINK: Prospective Students toHTML <a href=3D"studying/">Prospective =
Students</A>
LINK: News & Events toHTML <a href=3D"news/">News & =
Events</A>
LINK: Faculties & Departments toHTML <a =
href=3D"/misc/depts.html">Faculties & Departments</A>
LINK: Present Students toHTML <a href=3D"/presentstudents/">Present =
Students</A>
LINK: Research toHTML <a href=3D"research/">Research</A>
LINK: Support Services toHTML <a href=3D"/misc/support.html">Support =
Services</A>
LINK: Staff toHTML <a href=3D"staff/">Staff</A>
INK: Lifelong Learning toHTML <a =
href=3D"http://www.lifelong.ed.ac.uk/">Lifelong Learning</A>
LINK: The Library toHTML <a href=3D"http://www.lib.ed.ac.uk/">The =
Library</A>
=A1=A1=A1=A1
Now we could see the links with the same domain name would only be =
displayed as part of the linkself.=20
So please check the toHTML() method.=20
=20
=20
Cheng Jun
c....@sm...
2002-07-01 02:38:51
|