Thread: [Htmlparser-developer] Re: [Htmlparser-user] Bug found
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2002-07-01 11:44:58
|
Hi Cheng Thanks for the kind words. Regarding the bug, I would call it a feature :) When you parse a link within a url - if the link is relative, it = gets processed appropriately. If you want to get the absolute link, you = should do : linkTag.getLink(). The toHTML() method however tries to reconstruct the = html as it appeared (so relative links show up as relative, and absolute = links show up as absolute). There might be a controversy regarding the = purpose of toHTML() itself - do you think toHTML() should not do an = accurate rendition in the case of the HTMLTag ? I am open to opinions = from everyone on this.. For your purposes, you will need to modify the code of toHTMLTag() = in HTMLLinkTag.java.=20 Original Code : public String toHTML() { StringBuffer sb =3D new StringBuffer(); sb.append("<"); sb.append(tagContents.toString()); sb.append(">"); HTMLNode node; for (Enumeration e =3D linkData();e.hasMoreElements();) { node =3D (HTMLNode)e.nextElement(); sb.append(node.toHTML()); } sb.append("</A>"); return sb.toString(); } Modified Code : public String toHTML() { StringBuffer sb =3D new StringBuffer(); sb.append("<"); sb.append(getLink()); // Modification Occurs here sb.append(">"); HTMLNode node; for (Enumeration e =3D linkData();e.hasMoreElements();) { node =3D (HTMLNode)e.nextElement(); sb.append(node.toHTML()); } sb.append("</A>"); return sb.toString(); } =20 Let me know if I might have misunderstood the problem, or this does not = fix it. Cheers, Somik (Note : If you checkout the code from CVS, you will get the ant build = script - this will make it really simple for you to just get the = htmlparser.jar and use it in your app.) ----- Original Message -----=20 From: Cheng Jun=20 To: htm...@li... ; = htm...@li...=20 Sent: Monday, July 01, 2002 3:51 AM Subject: [Htmlparser-user] Bug found Firstly I have to say thank you to Somik Raha. You really do a good = job to give us a new integration.=20 I am writing a program to parse webpage and retrieve the links in the = pages. I have tried the lastest version(6/30) and found there may be a bug. The following is the part of the code and output. System.out.println("Starting parsing...... " ); com.kizna.html.HTMLParser Parser =3D new = com.kizna.html.HTMLParser("E://My paper/EdCrawler/page.htm"); Parser.registerScanners() ; //Parser.parse(null); // Parse the HTML file by Tag types Enumeration e =3D Parser.elements(); while(HasMore) { try { HasMore =3D e.hasMoreElements(); //HasMore is a boolean = var }catch (Exception e2){ System.out.println( e2.toString()) ; = HasMore =3D false; }; //have to stop parsing this HTML file if( HasMore ) { com.kizna.html.HTMLNode node = =3D(com.kizna.html.HTMLNode)e.nextElement(); // HTML DoctypeTag if (node instanceof = com.kizna.html.tags.HTMLDoctypeTag) { com.kizna.html.tags.HTMLDoctypeTag DoctypeNode =3D = (com.kizna.html.tags.HTMLDoctypeTag)node; System.out.println("Doctype: " + = DoctypeNode.toPlainTextString()); }//if //title if (node instanceof com.kizna.html.tags.HTMLTitleTag) { com.kizna.html.tags.HTMLTitleTag TitleNode =3D = (com.kizna.html.tags.HTMLTitleTag)node; System.out.println("Title: "+ = TitleNode.toPlainTextString() ); } //MATA if (node instanceof com.kizna.html.tags.HTMLMetaTag) { com.kizna.html.tags.HTMLMetaTag MataNode =3D = (com.kizna.html.tags.HTMLMetaTag)node; System.out.println("MATA HTTP-EQUIV: " + = MataNode.getHttpEquiv() +" MATA name: "+ MataNode.getMetaTagName() + " = CONTENT :" + MataNode.getMetaTagContents()); }//if // Links if (node instanceof HTMLLinkTag) { HTMLLinkTag LinkNode =3D (HTMLLinkTag)node; // Retrieve the data from the object and print it System.out.println("LINK: = "+LinkNode.toPlainTextString() +" " + " toHTML " + LinkNode.toHTML()); }//if //Parser end } // if(HasMore ) }//while System.out.println("Parising END."); = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D part of the output=20 Doctype:=20 Title: The University of Edinburgh MATA HTTP-EQUIV: null MATA name: Description CONTENT :The University = of Edinburgh, promoting excellence in teaching and research. MATA HTTP-EQUIV: null MATA name: keywords CONTENT :edinburgh = ,university ,degree ,study, studying, research, Scotland, uk, alumni, = graduate, postgraduate, PhD, masters, grad ,post ,edinboro, college, = school MATA HTTP-EQUIV: null MATA name: publisher CONTENT :The University of = Edinburgh MATA HTTP-EQUIV: null MATA name: author CONTENT :University Web = Editor LINK: Prospective Students toHTML <a href=3D"studying/">Prospective = Students</A> LINK: News & Events toHTML <a href=3D"news/">News & = Events</A> LINK: Faculties & Departments toHTML <a = href=3D"/misc/depts.html">Faculties & Departments</A> LINK: Present Students toHTML <a href=3D"/presentstudents/">Present = Students</A> LINK: Research toHTML <a href=3D"research/">Research</A> LINK: Support Services toHTML <a href=3D"/misc/support.html">Support = Services</A> LINK: Staff toHTML <a href=3D"staff/">Staff</A> INK: Lifelong Learning toHTML <a = href=3D"http://www.lifelong.ed.ac.uk/">Lifelong Learning</A> LINK: The Library toHTML <a href=3D"http://www.lib.ed.ac.uk/">The = Library</A> =A1=A1=A1=A1 Now we could see the links with the same domain name would only be = displayed as part of the linkself.=20 So please check the toHTML() method.=20 =20 =20 Cheng Jun c....@sm... 2002-07-01 02:38:51 |