[Htmlparser-user] Help with a link extraction program
Brought to you by:
derrickoswald
From: <Sri...@ba...> - 2008-05-20 07:13:52
|
Hi everyone, I am a new user of the HTMLParser API. I have found the link extraction features to be very useful even in this short space of time. I would like to seek help with a program that I have to write. It involves link extraction, but the logic is slightly more convoluted. Currently, I know how to use the LinkExtractor to supply a HTML document as input and output the links in that document to either the command prompt or a text file (with suitable modifications where required of course). I have a HTML document in which there is a hierarchy of links in the form of lists. I would like the output of the link information given by LinkExtractor to reflect this hierarchy in some way. For example, I have a list of items in a <ul> tag. Each of these items may/may not contain their own sub-items with their own links, so that the HTML looks something like: <ul> <li> <a href="...."> Item 1 </a> <ul> <li> <a href="...."> Sub-Item 1 </a> </li> <li> <a href="...."> Sub-Item 2 </a> </li> </ul> <li> Item 2 </li> </ul> I would like to know how I can parse a document full of lists like these and extract the links while having some indication of the hierarchy, either the "tree path" of the link (i.e. if I extract the link underyling Sub-Item 1 in my example, my text file should contain something along the lines of "Item 1 > Sub-Item 1" before printing the actual link path) or outputting a page identical to the one I am parsing but with the full path of the link printed beside each of those list items. Thanks for all your help in this regard. Warm Regards, Sridhar Venkataraman Summer Analyst, Global Technology (Asia-Pacific) Barclays Capital Services Ltd 60B Orchard Road #10-00, TheAtrium@Orchard, Singapore - 238891 + (65) 6828 4609 (O) + (65) 9871 0076 (m) | sri...@ba... _______________________________________________ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. _______________________________________________ |