Thread: [Htmlparser-user] Help with a link extraction program
Brought to you by:
derrickoswald
From: <Sri...@ba...> - 2008-05-20 07:13:52
|
Hi everyone, I am a new user of the HTMLParser API. I have found the link extraction features to be very useful even in this short space of time. I would like to seek help with a program that I have to write. It involves link extraction, but the logic is slightly more convoluted. Currently, I know how to use the LinkExtractor to supply a HTML document as input and output the links in that document to either the command prompt or a text file (with suitable modifications where required of course). I have a HTML document in which there is a hierarchy of links in the form of lists. I would like the output of the link information given by LinkExtractor to reflect this hierarchy in some way. For example, I have a list of items in a <ul> tag. Each of these items may/may not contain their own sub-items with their own links, so that the HTML looks something like: <ul> <li> <a href="...."> Item 1 </a> <ul> <li> <a href="...."> Sub-Item 1 </a> </li> <li> <a href="...."> Sub-Item 2 </a> </li> </ul> <li> Item 2 </li> </ul> I would like to know how I can parse a document full of lists like these and extract the links while having some indication of the hierarchy, either the "tree path" of the link (i.e. if I extract the link underyling Sub-Item 1 in my example, my text file should contain something along the lines of "Item 1 > Sub-Item 1" before printing the actual link path) or outputting a page identical to the one I am parsing but with the full path of the link printed beside each of those list items. Thanks for all your help in this regard. Warm Regards, Sridhar Venkataraman Summer Analyst, Global Technology (Asia-Pacific) Barclays Capital Services Ltd 60B Orchard Road #10-00, TheAtrium@Orchard, Singapore - 238891 + (65) 6828 4609 (O) + (65) 9871 0076 (m) | sri...@ba... _______________________________________________ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. _______________________________________________ |
From: abdullah <abd...@id...> - 2008-05-20 12:37:28
|
you dont need a linkExtractor you need a listExtractor , if all the links are inside lists you should get the list and navigate to its children which is the links .. for this case i suggest you parse the page with filter as following : Parser parser = new Parser(); NodeList lists = parser.parse(new NodeClassFilter(BulletList.class)); for(int i=0 i < lists.size() ;i++ ){ BulletList list = lists.elementAt(i); links = list.getChildern(); // this will give you another NodeList with children tags // do whatever you want with the links note that you need to cast each child them forn Node to LinkTag } i didnt test this code , but hopefully it will work if you gave me a specific example of the html page you want to parse i may help more good luck : ) On Tue, May 20, 2008 at 10:13 AM, <Sri...@ba...> wrote: > > Hi everyone, > > I am a new user of the HTMLParser API. I have found the link extraction > features to be very useful even in this short space of time. > > I would like to seek help with a program that I have to write. It > involves link extraction, but the logic is slightly more convoluted. > > Currently, I know how to use the LinkExtractor to supply a HTML document > as input and output the links in that document to either the command > prompt or a text file (with suitable modifications where required of > course). I have a HTML document in which there is a hierarchy of links > in the form of lists. I would like the output of the link information > given by LinkExtractor to reflect this hierarchy in some way. > > For example, I have a list of items in a <ul> tag. Each of these items > may/may not contain their own sub-items with their own links, so that > the HTML looks something like: > > <ul> > <li> <a href="...."> Item 1 </a> > <ul> > <li> <a href="...."> Sub-Item 1 </a> </li> > <li> <a href="...."> Sub-Item 2 </a> </li> > </ul> > > <li> Item 2 </li> > </ul> > > I would like to know how I can parse a document full of lists like these > and extract the links while having some indication of the hierarchy, > either the "tree path" of the link (i.e. if I extract the link > underyling Sub-Item 1 in my example, my text file should contain > something along the lines of "Item 1 > Sub-Item 1" before printing the > actual link path) or outputting a page identical to the one I am parsing > but with the full path of the link printed beside each of those list > items. > > Thanks for all your help in this regard. > > Warm Regards, > > Sridhar Venkataraman > Summer Analyst, Global Technology (Asia-Pacific) > Barclays Capital Services Ltd > 60B Orchard Road #10-00, TheAtrium@Orchard, > Singapore - 238891 > + (65) 6828 4609 (O) > + (65) 9871 0076 (m) | sri...@ba... > > > _______________________________________________ > > This e-mail may contain information that is confidential, privileged or > otherwise protected from disclosure. If you are not an intended recipient of > this e-mail, do not duplicate or redistribute it by any means. Please delete > it and any attachments and notify the sender that you have received it in > error. Unless specifically indicated, this e-mail is not an offer to buy or > sell or a solicitation to buy or sell any securities, investment products or > other financial product or service, an official confirmation of any > transaction, or an official statement of Barclays. Any views or opinions > presented are solely those of the author and do not necessarily represent > those of Barclays. This e-mail is subject to terms available at the > following link: www.barcap.com/emaildisclaimer. By messaging with Barclays > you consent to the foregoing. Barclays Capital is the investment banking > division of Barclays Bank PLC, a company registered in England (number > 1026167) with its registered offic > e at 1 Churchill Place, London, E14 5HP. This email may relate to or be > sent from other members of the Barclays Group. > _______________________________________________ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |