Re: [Htmlparser-user] Help with a link extraction program
Brought to you by:
derrickoswald
|
From: abdullah <abd...@id...> - 2008-05-20 12:37:28
|
you dont need a linkExtractor you need a listExtractor , if all the links
are inside lists you should get the list and navigate to its children which
is the links .. for this case i suggest you parse the page with filter as
following :
Parser parser = new Parser();
NodeList lists = parser.parse(new NodeClassFilter(BulletList.class));
for(int i=0 i < lists.size() ;i++ ){
BulletList list = lists.elementAt(i);
links = list.getChildern(); // this will give you another NodeList
with children tags
// do whatever you want with the links note that you need to cast each
child them forn Node to LinkTag
}
i didnt test this code , but hopefully it will work
if you gave me a specific example of the html page you want to parse i may
help more
good luck : )
On Tue, May 20, 2008 at 10:13 AM, <Sri...@ba...>
wrote:
>
> Hi everyone,
>
> I am a new user of the HTMLParser API. I have found the link extraction
> features to be very useful even in this short space of time.
>
> I would like to seek help with a program that I have to write. It
> involves link extraction, but the logic is slightly more convoluted.
>
> Currently, I know how to use the LinkExtractor to supply a HTML document
> as input and output the links in that document to either the command
> prompt or a text file (with suitable modifications where required of
> course). I have a HTML document in which there is a hierarchy of links
> in the form of lists. I would like the output of the link information
> given by LinkExtractor to reflect this hierarchy in some way.
>
> For example, I have a list of items in a <ul> tag. Each of these items
> may/may not contain their own sub-items with their own links, so that
> the HTML looks something like:
>
> <ul>
> <li> <a href="...."> Item 1 </a>
> <ul>
> <li> <a href="...."> Sub-Item 1 </a> </li>
> <li> <a href="...."> Sub-Item 2 </a> </li>
> </ul>
>
> <li> Item 2 </li>
> </ul>
>
> I would like to know how I can parse a document full of lists like these
> and extract the links while having some indication of the hierarchy,
> either the "tree path" of the link (i.e. if I extract the link
> underyling Sub-Item 1 in my example, my text file should contain
> something along the lines of "Item 1 > Sub-Item 1" before printing the
> actual link path) or outputting a page identical to the one I am parsing
> but with the full path of the link printed beside each of those list
> items.
>
> Thanks for all your help in this regard.
>
> Warm Regards,
>
> Sridhar Venkataraman
> Summer Analyst, Global Technology (Asia-Pacific)
> Barclays Capital Services Ltd
> 60B Orchard Road #10-00, TheAtrium@Orchard,
> Singapore - 238891
> + (65) 6828 4609 (O)
> + (65) 9871 0076 (m) | sri...@ba...
>
>
> _______________________________________________
>
> This e-mail may contain information that is confidential, privileged or
> otherwise protected from disclosure. If you are not an intended recipient of
> this e-mail, do not duplicate or redistribute it by any means. Please delete
> it and any attachments and notify the sender that you have received it in
> error. Unless specifically indicated, this e-mail is not an offer to buy or
> sell or a solicitation to buy or sell any securities, investment products or
> other financial product or service, an official confirmation of any
> transaction, or an official statement of Barclays. Any views or opinions
> presented are solely those of the author and do not necessarily represent
> those of Barclays. This e-mail is subject to terms available at the
> following link: www.barcap.com/emaildisclaimer. By messaging with Barclays
> you consent to the foregoing. Barclays Capital is the investment banking
> division of Barclays Bank PLC, a company registered in England (number
> 1026167) with its registered offic
> e at 1 Churchill Place, London, E14 5HP. This email may relate to or be
> sent from other members of the Barclays Group.
> _______________________________________________
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
|