Re: [Htmlparser-user] Htmlparser-user Digest, Vol 23, Issue 3
Brought to you by:
derrickoswald
From: <Sri...@ba...> - 2008-05-26 07:08:41
|
Hi Abdullah and everyone else, Thank you for looking into my request for help. I have attached an example of the HTML file I want to parse using HTMLParser. Regards, Sridhar Venkataraman Summer Analyst, Global Technology (Asia-Pacific) Barclays Capital Services Ltd 60B Orchard Road #10-00, TheAtrium@Orchard, Singapore - 238891 + (65) 6828 4609 (O) + (65) 9871 0076 (m) | sri...@ba... -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of htm...@li... Sent: 22 May 2008 21:14 To: htm...@li... Subject: Htmlparser-user Digest, Vol 23, Issue 3 Send Htmlparser-user mailing list submissions to htm...@li... To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/htmlparser-user or, via email, send a message with subject or body 'help' to htm...@li... You can reach the person managing the list at htm...@li... When replying, please edit your Subject line so it is more specific than "Re: Contents of Htmlparser-user digest..." Today's Topics: 1. Help with a link extraction program (Sri...@ba...) 2. Replacing attributes of DOCTYPE tag (?? ??) 3. Re: Help with a link extraction program (abdullah) 4. How to extract table without a nested table in it (answers solutions) 5. Re: How to extract table without a nested table in it (Derrick Oswald) ---------------------------------------------------------------------- Message: 1 Date: Tue, 20 May 2008 15:13:39 +0800 From: <Sri...@ba...> Subject: [Htmlparser-user] Help with a link extraction program To: <htm...@li...> Message-ID: <B89...@SG...RCA PINT.COM> Content-Type: text/plain; charset="us-ascii" Hi everyone, I am a new user of the HTMLParser API. I have found the link extraction features to be very useful even in this short space of time. I would like to seek help with a program that I have to write. It involves link extraction, but the logic is slightly more convoluted. Currently, I know how to use the LinkExtractor to supply a HTML document as input and output the links in that document to either the command prompt or a text file (with suitable modifications where required of course). I have a HTML document in which there is a hierarchy of links in the form of lists. I would like the output of the link information given by LinkExtractor to reflect this hierarchy in some way. For example, I have a list of items in a <ul> tag. Each of these items may/may not contain their own sub-items with their own links, so that the HTML looks something like: <ul> <li> <a href="...."> Item 1 </a> <ul> <li> <a href="...."> Sub-Item 1 </a> </li> <li> <a href="...."> Sub-Item 2 </a> </li> </ul> <li> Item 2 </li> </ul> I would like to know how I can parse a document full of lists like these and extract the links while having some indication of the hierarchy, either the "tree path" of the link (i.e. if I extract the link underyling Sub-Item 1 in my example, my text file should contain something along the lines of "Item 1 > Sub-Item 1" before printing the actual link path) or outputting a page identical to the one I am parsing but with the full path of the link printed beside each of those list items. Thanks for all your help in this regard. Warm Regards, Sridhar Venkataraman Summer Analyst, Global Technology (Asia-Pacific) Barclays Capital Services Ltd 60B Orchard Road #10-00, TheAtrium@Orchard, Singapore - 238891 + (65) 6828 4609 (O) + (65) 9871 0076 (m) | sri...@ba... _______________________________________________ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered offic e at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. _______________________________________________ ------------------------------ Message: 2 Date: Tue, 20 May 2008 17:34:15 +0900 From: ?? ?? <nag...@by...> Subject: [Htmlparser-user] Replacing attributes of DOCTYPE tag To: htm...@li... Message-ID: <483...@by...> Content-Type: text/plain; charset=ISO-2022-JP Dear All, I am new to HTML Parser, and I don't understand well how to handle !DOCTYPE tag. Shortly speaking, I'd like to replace tag like this: <!DOCTYPE html PUBLIC "XXXX" "AAAA"> into: <! DOCTYPE html PUBLIC "YYYY" "BBBB"> I sat on my chair and had a lots of trial and error, but it did'nt work. I'd appreciate it if you could give me advice. (My e-mail address had changed.) ------------------------------ Message: 3 Date: Tue, 20 May 2008 15:37:18 +0300 From: abdullah <abd...@id...> Subject: Re: [Htmlparser-user] Help with a link extraction program To: "htmlparser user list" <htm...@li...> Message-ID: <17d...@ma...> Content-Type: text/plain; charset="iso-8859-1" you dont need a linkExtractor you need a listExtractor , if all the links are inside lists you should get the list and navigate to its children which is the links .. for this case i suggest you parse the page with filter as following : Parser parser = new Parser(); NodeList lists = parser.parse(new NodeClassFilter(BulletList.class)); for(int i=0 i < lists.size() ;i++ ){ BulletList list = lists.elementAt(i); links = list.getChildern(); // this will give you another NodeList with children tags // do whatever you want with the links note that you need to cast each child them forn Node to LinkTag } i didnt test this code , but hopefully it will work if you gave me a specific example of the html page you want to parse i may help more good luck : ) On Tue, May 20, 2008 at 10:13 AM, <Sri...@ba...> wrote: > > Hi everyone, > > I am a new user of the HTMLParser API. I have found the link > extraction features to be very useful even in this short space of time. > > I would like to seek help with a program that I have to write. It > involves link extraction, but the logic is slightly more convoluted. > > Currently, I know how to use the LinkExtractor to supply a HTML > document as input and output the links in that document to either the > command prompt or a text file (with suitable modifications where > required of course). I have a HTML document in which there is a > hierarchy of links in the form of lists. I would like the output of > the link information given by LinkExtractor to reflect this hierarchy in some way. > > For example, I have a list of items in a <ul> tag. Each of these items > may/may not contain their own sub-items with their own links, so that > the HTML looks something like: > > <ul> > <li> <a href="...."> Item 1 </a> > <ul> > <li> <a href="...."> Sub-Item 1 </a> </li> > <li> <a href="...."> Sub-Item 2 </a> </li> > </ul> > > <li> Item 2 </li> > </ul> > > I would like to know how I can parse a document full of lists like > these and extract the links while having some indication of the > hierarchy, either the "tree path" of the link (i.e. if I extract the > link underyling Sub-Item 1 in my example, my text file should contain > something along the lines of "Item 1 > Sub-Item 1" before printing the > actual link path) or outputting a page identical to the one I am > parsing but with the full path of the link printed beside each of > those list items. > > Thanks for all your help in this regard. > > Warm Regards, > > Sridhar Venkataraman > Summer Analyst, Global Technology (Asia-Pacific) Barclays Capital > Services Ltd 60B Orchard Road #10-00, TheAtrium@Orchard, Singapore - > 238891 > + (65) 6828 4609 (O) > + (65) 9871 0076 (m) | sri...@ba... > > > _______________________________________________ > > This e-mail may contain information that is confidential, privileged > or otherwise protected from disclosure. If you are not an intended > recipient of this e-mail, do not duplicate or redistribute it by any > means. Please delete it and any attachments and notify the sender that > you have received it in error. Unless specifically indicated, this > e-mail is not an offer to buy or sell or a solicitation to buy or sell > any securities, investment products or other financial product or > service, an official confirmation of any transaction, or an official > statement of Barclays. Any views or opinions presented are solely > those of the author and do not necessarily represent those of > Barclays. This e-mail is subject to terms available at the following > link: www.barcap.com/emaildisclaimer. By messaging with Barclays you > consent to the foregoing. Barclays Capital is the investment banking > division of Barclays Bank PLC, a company registered in England (number > 1026167) with its registered offic > e at 1 Churchill Place, London, E14 5HP. This email may relate to or > be sent from other members of the Barclays Group. > _______________________________________________ > > ---------------------------------------------------------------------- > --- This SF.net email is sponsored by: Microsoft Defy all challenges. > Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > -------------- next part -------------- An HTML attachment was scrubbed... ------------------------------ Message: 4 Date: Thu, 22 May 2008 18:06:00 +0530 From: "answers solutions" <fas...@gm...> Subject: [Htmlparser-user] How to extract table without a nested table in it To: htm...@li... Message-ID: <992...@ma...> Content-Type: text/plain; charset="iso-8859-1" Hi i am strututre like to extract a table so that it doesnot have nested table inside it . nodefilter filtertable = new AndFilter( new HasParentFilter(new TagNameFilter("table"),new NotFilter(new HasChildFilter(new TagNameFilter("table))); still the o/p i see a table with nested table in it . -------------- next part -------------- An HTML attachment was scrubbed... ------------------------------ Message: 5 Date: Thu, 22 May 2008 06:14:19 -0700 (PDT) From: Derrick Oswald <der...@ro...> Subject: Re: [Htmlparser-user] How to extract table without a nested table in it To: htmlparser user list <htm...@li...> Message-ID: <423...@we...> Content-Type: text/plain; charset="us-ascii" You probably catch these because the inner tables are not direct children of the outer table. You need the HasChildFilter (NodeFilter filter, boolean recursive) constructor with recursive set to true. ----- Original Message ---- From: answers solutions <fas...@gm...> To: htm...@li... Sent: Thursday, May 22, 2008 5:36:00 AM Subject: [Htmlparser-user] How to extract table without a nested table in it Hi i am strututre like to extract a table so that it doesnot have nested table inside it . nodefilter filtertable = new AndFilter( new HasParentFilter(new TagNameFilter("table"),new NotFilter(new HasChildFilter(new TagNameFilter("table))); still the o/p i see a table with nested table in it . -------------- next part -------------- An HTML attachment was scrubbed... ------------------------------ ------------------------------------------------------------------------ - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ------------------------------ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user End of Htmlparser-user Digest, Vol 23, Issue 3 ********************************************** _______________________________________________ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. _______________________________________________ |