Thread: Re: [Htmlparser-user] Using Breadth first Search (BFS) to visit WebPages
Brought to you by:
derrickoswald
From: J A. <ja...@ho...> - 2003-08-07 00:20:51
|
Hi Derrick, It works but it does not do this recursively by following child's list preseving bredth first search order. Like a crawler in a bredth first search way. How can this be done.... Thanks JAK >From: Derrick Oswald <Der...@ro...> >Reply-To: htm...@li... >To: htm...@li... >Subject: Re: [Htmlparser-user] Using Breadth first Search (BFS) to visit >WebPages >Date: Tue, 05 Aug 2003 21:02:12 -0400 > >JAK, > >It works for me with version 1.4, perhaps the method you use to create the >spreadsheet is in error: > >Visiting..http://www.redgreen.com/ >3 links found on this page >y=0 Root List=3 >y=1 Root List=3 >y=2 Root List=3 >================================================================== >==========================43================ >Link=http://www.ezpost.com >Link=web...@re... web...@re... >Link=http://www.northstarwebsites.com >Link=http://www.redgreen.com/menu.asp >Link=; >Link=http://www.redgreen.com/menu.asp >Link=http://www.redgreen.com/cast.htm >Link=http://www.redgreen.com/episodes/season1to3.htm >Link=http://www.redgreen.com/news.asp >Link=http://www.redgreen.com/merchandise.asp >Link=http://www.redgreen.com/chat.htm >Link=http://www.3m.com/intl/CA/english/centres/home_leisure/duct_tape/ >Link=http://www.redgreen.com/links.htm >Link=http://www.redgreen.com/tickets.htm >Link=re...@re... >Link=http://www.redgreen.com/scrapbook.htm >Link=http://www.redgreen.com/on_air.asp >Link=http://www.ssp.ca/index.htm >Link=http://www.ssp.ca/contact.htm >Link=http://www.ssp.ca/private/index.htm >Link=http://www.ssp.ca/news.htm >Link=http://www.ssp.ca/profile.htm >Link=http://www.ssp.ca/programs.htm >Link=http://www.ssp.ca/distribution.htm >Link=http://www.ssp.ca/proposals.htm >Link=http://www.ssp.ca/production.htm >Link=http://www.ezpost.com >Link=web...@re... web...@re... >Link=http://www.northstarwebsites.com >Link=http://www.redgreen.com/menu.asp >Link=; >Link=http://www.redgreen.com/menu.asp >Link=http://www.redgreen.com/cast.htm >Link=http://www.redgreen.com/episodes/season1to3.htm >Link=http://www.redgreen.com/news.asp >Link=http://www.redgreen.com/merchandise.asp >Link=http://www.redgreen.com/chat.htm >Link=http://www.3m.com/intl/CA/english/centres/home_leisure/duct_tape/ >Link=http://www.redgreen.com/links.htm >Link=http://www.redgreen.com/tickets.htm >Link=re...@re... >Link=http://www.redgreen.com/scrapbook.htm >Link=http://www.redgreen.com/on_air.asp > >Is there a reason you want to stick with 1.3? >Looking at the change log for NodeLog some fixes went in fairly recently >like: > >revision 1.34 >date: 2003/07/12 00:33:59; author: jkerievsky; state: Exp; lines: +8 -5 >added more support for string node factory, fixed an error in the >NodeArray class > >---------------------------- >revision 1.24 >date: 2003/05/11 04:48:11; author: derrickoswald; state: Exp; lines: >+13 -0 >Fixed bug #735183 Problem in Label Scanning >A NodeReader now prepends the pre-read tags onto the internal list, >maintaining the correct order in recursively analysing unclosed tags. > >Derrick > >JAK wrote: > >>Hi Derrick, >>I am trying to analyse Java Documentation via Parser. >> >>I want to do the following: >> >>1- Visit the root URL (i.e. >>http://htmlparser.sourceforge.net/javadoc_1_3/) >>2- Find links on the root URL and add to the root List. >>3- Visit first/next URL on root List & collect corresponding child links >>in >>a object Array. (NodeList) >>4- Visit the child Links List and repeat the process from Step 1, in a >>breadth first search manner. >> >>I have managed to do the step 1-2 and 3 but I am having difficulty in step >>4. When I run the program, parser identifies the links but the order the >>links are visited and output is wrong. (please view attached Excel file- >>the >>red highlighted are right links but appeared in incorrect sequence). >> >>when I create a wholeCollectionList (to store all child Collection Lists), >>the order of appearance is not i.e. the links >> >>on the root page are deep down in the collecion list. >> >>My code is below: >>=================================================== >>public class CaptureHyper13 { >> >> public static Node rootCollectionList[]; // to test >>collectionList.toNodeArray(); >> public NodeList collectionList; >> public static String nextURL, url; >> public static int numOfLinksOnRoot; >> public static int a; >> public static NodeList wholeCollectionList[]; >> public static String rootLinksList[]; >> >> public CaptureHyper13(String website) { //constructor >> collectionList = new NodeList(); >> //numOfLinksOnRoot =4; >> } >> >> >>public static void main(String[] args) throws Exception { >> >> url = >>"http://java.sun.com/j2se/1.4.2/docs/api/overview-summary.html"; //non >>frame >> >> CaptureHyper13 captureWiki = new CaptureHyper13(url); >> >> String tempList[] = captureWiki.extractLinks(url); >> captureWiki.visitChild(); >> captureWiki.showCollectionList(); >> //captureWiki.showRootNodes(rootCollectionList); >> //captureWiki.showRootLinks(); >> } >> >> >>public String[] extractLinks(String url) throws Exception { //To extract >>the links from root URL >> System.out.println("Visiting.." +url); >> Parser parser = new Parser(url);//, new >>DefaultParserFeedback(2)); >> parser.registerScanners(); >> Node node; >> for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { >> node = e.nextNode(); >> node.collectInto (collectionList, LinkTag.class); >>//collecting >>the node list with the filter >> } >> System.out.println(collectionList.size()+ " links found on this >>page"); >> wholeCollectionList = new NodeList[numOfLinksOnRoot]; >> //rootCollectionList = collectionList.toNodeArray(); >> numOfLinksOnRoot = collectionList.size(); >> wholeCollectionList = new NodeList[numOfLinksOnRoot]; >> rootCollectionList = new >>Node[numOfLinksOnRoot];//List[numOfLinksOnRoot+1]; >> rootLinksList = new String[numOfLinksOnRoot]; >> Node nodeStored; >> int i=0; >> for (SimpleNodeIterator itr=collectionList.elements(); >>itr.hasMoreNodes();) >> { >> nodeStored = itr.nextNode(); >> if (nodeStored instanceof LinkTag) { >> LinkTag linkTag = (LinkTag)nodeStored; >> rootLinksList[i]=linkTag.getLink(); >> // System.out.println("i="+i+" Link: "+rootLinksList[i]); >> i++; >> } >> } >> return rootLinksList; >> } >> >>////////////////////////////////////////////////////////////////////// >>//////////////// THIS METHOD (BELOW) IS NOT WORKING ////////////////// >>////////////////////////////////////////////////////////////////////// >> >> public void visitChild() throws Exception { >> NodeList childCollectionList = new NodeList(); >> for (int y=0; y<rootLinksList.length; y++) { >> String childURL = rootLinksList[y]; >> Parser cParser = new Parser(childURL); >> cParser.registerScanners(); >> Node cNode; >> for (NodeIterator itr=cParser.elements(); >>itr.hasMoreNodes();){ >> cNode = itr.nextNode(); >> cNode.collectInto(childCollectionList, LinkTag.class); >>//collecting link tag in the list >> int numOfLinks = childCollectionList.size(); >> //Node childList[] = new Node[numOfLinks]; >> } >> wholeCollectionList[y]=childCollectionList; //adds the child >>collection list to the wholecollectionList >> System.out.println("y="+y+" Root >>List="+wholeCollectionList.length); >> } >> } >> >> public void showCollectionList() >> { >> // for (int z=0; z<wholeCollectionList.length; z++) //commenting: >>keep repeating collection List >> //{ >> NodeList tempChildList = (NodeList)wholeCollectionList[0]; >> Node cNodeStored; >> >>System.out.println("======================================================== >>=========="); >> >>System.out.println("=========================="+tempChildList.size()+"====== >>=========="); >> >> >> for (SimpleNodeIterator childItr=tempChildList.elements(); >>childItr.hasMoreNodes();) >> { >> cNodeStored = childItr.nextNode(); >> if (cNodeStored instanceof LinkTag) >> { >> LinkTag tempLinkTag = (LinkTag)cNodeStored; >> System.out.println(" >>Link="+tempLinkTag.getLink()+"\t"+tempLinkTag.getLinkText()); >> } >> else {System.out.println("not link Tag");} >> >> } >> >> //} >> } >> >>=================================================== >>HTMLParser 1.3 >> >>Can you please alter the code so the parser visits root URL, find the >>links >>(childs) and the child nodes' list and then visits their collection list. >>(in >>breadth first search manner). >> >>If anything which is not clear or I did not explain. Please let me know. >> >>Thank you very much >>JAK >> > > > > > >------------------------------------------------------- >This SF.Net email sponsored by: Free pre-built ASP.NET sites including >Data Reports, E-commerce, Portals, and Forums are available now. >Download today and enter to win an XBOX or Visual Studio .NET. >http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user _________________________________________________________________ Find a cheaper internet access deal - choose one to suit you. http://www.msn.co.uk/internetaccess |
From: Derrick O. <Der...@ro...> - 2003-08-07 11:31:04
|
JAK, Sorry, I don't have time to write your code for you. If you have the basic operation working, it should be a simple matter to extend it with a FIFO queue to get the breadth first recursion you sedire. Derrick J AK wrote: > Hi Derrick, > It works but it does not do this recursively by following child's list > preseving bredth first search order. Like a crawler in a bredth first > search way. > How can this be done.... > > Thanks > JAK > > |