Thread: Re: [Htmlparser-user] Using Breadth first Search (BFS) to visit WebPages

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Derrick,
It works but it does not do this recursively by following child's list 
preseving bredth first search order. Like a crawler in a bredth first search 
way.
How can this be done....

Thanks
JAK

>From: Derrick Oswald <Der...@ro...>
>Reply-To: htm...@li...
>To: htm...@li...
>Subject: Re: [Htmlparser-user] Using Breadth first Search (BFS) to visit 
>WebPages
>Date: Tue, 05 Aug 2003 21:02:12 -0400
>
>JAK,
>
>It works for me with version 1.4, perhaps the method you use to create the 
>spreadsheet is in error:
>
>Visiting..http://www.redgreen.com/
>3 links found on this page
>y=0 Root List=3
>y=1 Root List=3
>y=2 Root List=3
>==================================================================
>==========================43================
>Link=http://www.ezpost.com
>Link=web...@re...     web...@re...
>Link=http://www.northstarwebsites.com
>Link=http://www.redgreen.com/menu.asp
>Link=;
>Link=http://www.redgreen.com/menu.asp
>Link=http://www.redgreen.com/cast.htm
>Link=http://www.redgreen.com/episodes/season1to3.htm
>Link=http://www.redgreen.com/news.asp
>Link=http://www.redgreen.com/merchandise.asp
>Link=http://www.redgreen.com/chat.htm
>Link=http://www.3m.com/intl/CA/english/centres/home_leisure/duct_tape/
>Link=http://www.redgreen.com/links.htm
>Link=http://www.redgreen.com/tickets.htm
>Link=re...@re...
>Link=http://www.redgreen.com/scrapbook.htm
>Link=http://www.redgreen.com/on_air.asp
>Link=http://www.ssp.ca/index.htm
>Link=http://www.ssp.ca/contact.htm
>Link=http://www.ssp.ca/private/index.htm
>Link=http://www.ssp.ca/news.htm
>Link=http://www.ssp.ca/profile.htm
>Link=http://www.ssp.ca/programs.htm
>Link=http://www.ssp.ca/distribution.htm
>Link=http://www.ssp.ca/proposals.htm
>Link=http://www.ssp.ca/production.htm
>Link=http://www.ezpost.com
>Link=web...@re...     web...@re...
>Link=http://www.northstarwebsites.com
>Link=http://www.redgreen.com/menu.asp
>Link=;
>Link=http://www.redgreen.com/menu.asp
>Link=http://www.redgreen.com/cast.htm
>Link=http://www.redgreen.com/episodes/season1to3.htm
>Link=http://www.redgreen.com/news.asp
>Link=http://www.redgreen.com/merchandise.asp
>Link=http://www.redgreen.com/chat.htm
>Link=http://www.3m.com/intl/CA/english/centres/home_leisure/duct_tape/
>Link=http://www.redgreen.com/links.htm
>Link=http://www.redgreen.com/tickets.htm
>Link=re...@re...
>Link=http://www.redgreen.com/scrapbook.htm
>Link=http://www.redgreen.com/on_air.asp
>
>Is there a reason you want to stick with 1.3?
>Looking at the change log for NodeLog some fixes went in fairly recently
>like:
>
>revision 1.34
>date: 2003/07/12 00:33:59;  author: jkerievsky;  state: Exp;  lines: +8 -5
>added more support for string node factory, fixed an error in the
>NodeArray class
>
>----------------------------
>revision 1.24
>date: 2003/05/11 04:48:11;  author: derrickoswald;  state: Exp;  lines:
>+13 -0
>Fixed bug #735183 Problem in Label Scanning
>A NodeReader now prepends the pre-read tags onto the internal list,
>maintaining the correct order in recursively analysing unclosed tags.
>
>Derrick
>
>JAK wrote:
>
>>Hi Derrick,
>>I am trying to analyse Java Documentation via Parser.
>>
>>I want to do the following:
>>
>>1- Visit the root URL (i.e. 
>>http://htmlparser.sourceforge.net/javadoc_1_3/)
>>2- Find links on the root URL and add to the root List.
>>3- Visit first/next URL on root List & collect corresponding child links 
>>in
>>a object Array. (NodeList)
>>4- Visit the child Links List and repeat the process from Step 1, in a
>>breadth first search manner.
>>
>>I have managed to do the step 1-2 and 3 but I am having difficulty in step
>>4. When I run the program, parser identifies the links but the order the
>>links are visited and output is wrong. (please view attached Excel file- 
>>the
>>red highlighted are right links but appeared in incorrect sequence).
>>
>>when I create a wholeCollectionList (to store all child Collection Lists),
>>the order of appearance is not i.e. the links
>>
>>on the root page are deep down in the collecion list.
>>
>>My code is below:
>>===================================================
>>public class CaptureHyper13 {
>>
>>      public static Node rootCollectionList[]; // to test
>>collectionList.toNodeArray();
>>      public NodeList collectionList;
>>      public static String nextURL, url;
>>      public static  int numOfLinksOnRoot;
>>      public static int a;
>>      public static NodeList wholeCollectionList[];
>>      public static String rootLinksList[];
>>
>>      public CaptureHyper13(String website) {   //constructor
>>         collectionList = new NodeList();
>>         //numOfLinksOnRoot =4;
>>      }
>>
>>
>>public static void main(String[] args) throws Exception {
>>
>>         url =
>>"http://java.sun.com/j2se/1.4.2/docs/api/overview-summary.html";  //non
>>frame
>>
>>         CaptureHyper13 captureWiki =  new CaptureHyper13(url);
>>
>>         String tempList[] = captureWiki.extractLinks(url);
>>         captureWiki.visitChild();
>>         captureWiki.showCollectionList();
>>         //captureWiki.showRootNodes(rootCollectionList);
>>         //captureWiki.showRootLinks();
>>      }
>>
>>
>>public String[] extractLinks(String url) throws Exception {  //To extract
>>the links from root URL
>>         System.out.println("Visiting.." +url);
>>         Parser parser = new Parser(url);//, new 
>>DefaultParserFeedback(2));
>>         parser.registerScanners();
>>         Node node;
>>         for (NodeIterator e = parser.elements(); e.hasMoreNodes();) {
>>            node = e.nextNode();
>>            node.collectInto (collectionList, LinkTag.class);  
>>//collecting
>>the node list with the filter
>>         }
>>         System.out.println(collectionList.size()+ " links found on this
>>page");
>>         wholeCollectionList = new NodeList[numOfLinksOnRoot];
>>         //rootCollectionList = collectionList.toNodeArray();
>>         numOfLinksOnRoot = collectionList.size();
>>         wholeCollectionList = new NodeList[numOfLinksOnRoot];
>>         rootCollectionList = new
>>Node[numOfLinksOnRoot];//List[numOfLinksOnRoot+1];
>>         rootLinksList = new String[numOfLinksOnRoot];
>>         Node nodeStored;
>>         int i=0;
>>         for (SimpleNodeIterator itr=collectionList.elements();
>>itr.hasMoreNodes();)
>>         {
>>            nodeStored = itr.nextNode();
>>            if (nodeStored instanceof LinkTag) {
>>               LinkTag linkTag = (LinkTag)nodeStored;
>>               rootLinksList[i]=linkTag.getLink();
>>              // System.out.println("i="+i+" Link: "+rootLinksList[i]);
>>               i++;
>>            }
>>         }
>>         return rootLinksList;
>>      }
>>
>>//////////////////////////////////////////////////////////////////////
>>//////////////// THIS METHOD (BELOW) IS NOT WORKING //////////////////
>>//////////////////////////////////////////////////////////////////////
>>
>>      public void visitChild() throws Exception {
>>        NodeList childCollectionList  = new NodeList();
>>        for (int y=0; y<rootLinksList.length; y++) {
>>            String childURL = rootLinksList[y];
>>            Parser cParser = new Parser(childURL);
>>            cParser.registerScanners();
>>            Node cNode;
>>            for (NodeIterator itr=cParser.elements(); 
>>itr.hasMoreNodes();){
>>               cNode = itr.nextNode();
>>               cNode.collectInto(childCollectionList, LinkTag.class);
>>//collecting link tag in the list
>>               int numOfLinks = childCollectionList.size();
>>               //Node childList[] = new Node[numOfLinks];
>>            }
>>            wholeCollectionList[y]=childCollectionList; //adds the child
>>collection list to the wholecollectionList
>>            System.out.println("y="+y+" Root
>>List="+wholeCollectionList.length);
>>         }
>>      }
>>
>>  public void showCollectionList()
>>      {
>>        // for (int z=0; z<wholeCollectionList.length; z++)  //commenting:
>>keep repeating collection List
>>         //{
>>         NodeList tempChildList = (NodeList)wholeCollectionList[0];
>>         Node cNodeStored;
>>
>>System.out.println("========================================================
>>==========");
>>
>>System.out.println("=========================="+tempChildList.size()+"======
>>==========");
>>
>>
>>         for (SimpleNodeIterator childItr=tempChildList.elements();
>>childItr.hasMoreNodes();)
>>         {
>>            cNodeStored = childItr.nextNode();
>>            if (cNodeStored instanceof LinkTag)
>>            {
>>               LinkTag tempLinkTag = (LinkTag)cNodeStored;
>>               System.out.println("
>>Link="+tempLinkTag.getLink()+"\t"+tempLinkTag.getLinkText());
>>            }
>>            else {System.out.println("not link Tag");}
>>
>>         }
>>
>>         //}
>>      }
>>
>>===================================================
>>HTMLParser 1.3
>>
>>Can you please alter the code so the parser visits root URL, find the 
>>links
>>(childs) and the child nodes' list and then visits their collection list.
>>(in
>>breadth first search manner).
>>
>>If anything which is not clear or I did not explain. Please let me know.
>>
>>Thank you very much
>>JAK
>>
>
>
>
>
>
>-------------------------------------------------------
>This SF.Net email sponsored by: Free pre-built ASP.NET sites including
>Data Reports, E-commerce, Portals, and Forums are available now.
>Download today and enter to win an XBOX or Visual Studio .NET.
>http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user

_________________________________________________________________
Find a cheaper internet access deal - choose one to suit you. 
http://www.msn.co.uk/internetaccess