Thread: [Htmlparser-user] Help with a link extraction program

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi everyone,

I am a new user of the HTMLParser API. I have found the link extraction
features to be very useful even in this short space of time.

I would like to seek help with a program that I have to write. It
involves link extraction, but the logic is slightly more convoluted.

Currently, I know how to use the LinkExtractor to supply a HTML document
as input and output the links in that document to either the command
prompt or a text file (with suitable modifications where required of
course). I have a HTML document in which there is a hierarchy of links
in the form of lists. I would like the output of the link information
given by LinkExtractor to reflect this hierarchy in some way.

For example, I have a list of items in a <ul> tag. Each of these items
may/may not contain their own sub-items with their own links, so that
the HTML looks something like:

<ul>
<li> <a href="...."> Item 1 </a>
	<ul> 
	<li> <a href="....">  Sub-Item 1 </a>  </li>
	<li> <a href="....">  Sub-Item 2 </a>  </li> 
	</ul>

<li> Item 2 </li>
</ul>

I would like to know how I can parse a document full of lists like these
and extract the links while having some indication of the hierarchy,
either the "tree path" of the link (i.e. if I extract the link
underyling Sub-Item 1 in my example, my text file should contain
something along the lines of "Item 1 > Sub-Item 1" before printing the
actual link path) or outputting a page identical to the one I am parsing
but with the full path of the link printed beside each of those list
items.

Thanks for all your help in this regard.

Warm Regards,

Sridhar Venkataraman
Summer Analyst, Global Technology (Asia-Pacific)
Barclays Capital Services Ltd
60B Orchard Road #10-00, TheAtrium@Orchard,
Singapore -  238891
+ (65) 6828 4609 (O)
+ (65) 9871 0076 (m) | sri...@ba...

_______________________________________________

This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays Group.
_______________________________________________

Thread: [Htmlparser-user] Help with a link extraction program

htmlparser-user