Re: [Htmlparser-user] Htmlparser-user Digest, Vol 23, Issue 3

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Abdullah and everyone else,

Thank you for looking into my request for help. I have attached an
example of the HTML file I want to parse using HTMLParser.

Regards,

Sridhar Venkataraman
Summer Analyst, Global Technology (Asia-Pacific)
Barclays Capital Services Ltd
60B Orchard Road #10-00, TheAtrium@Orchard,
Singapore -  238891
+ (65) 6828 4609 (O)
+ (65) 9871 0076 (m) | sri...@ba...

-----Original Message-----
From: htm...@li...
[mailto:htm...@li...] On Behalf Of
htm...@li...
Sent: 22 May 2008 21:14
To: htm...@li...
Subject: Htmlparser-user Digest, Vol 23, Issue 3

Send Htmlparser-user mailing list submissions to
	htm...@li...

To subscribe or unsubscribe via the World Wide Web, visit
	https://lists.sourceforge.net/lists/listinfo/htmlparser-user
or, via email, send a message with subject or body 'help' to
	htm...@li...

You can reach the person managing the list at
	htm...@li...

When replying, please edit your Subject line so it is more specific than
"Re: Contents of Htmlparser-user digest..."

Today's Topics:

   1. Help with a link extraction program
      (Sri...@ba...)
   2. Replacing attributes of DOCTYPE tag (?? ??)
   3. Re: Help with a link extraction program (abdullah)
   4. How to extract table without a nested table in it
      (answers solutions)
   5. Re: How to extract table without a nested table	in it
      (Derrick Oswald)

----------------------------------------------------------------------

Message: 1
Date: Tue, 20 May 2008 15:13:39 +0800
From: <Sri...@ba...>
Subject: [Htmlparser-user] Help with a link extraction program
To: <htm...@li...>
Message-ID:

<B89...@SG...
PINT.COM>

Content-Type: text/plain;	charset="us-ascii"

Hi everyone,

I am a new user of the HTMLParser API. I have found the link extraction
features to be very useful even in this short space of time.

I would like to seek help with a program that I have to write. It
involves link extraction, but the logic is slightly more convoluted.

Currently, I know how to use the LinkExtractor to supply a HTML document
as input and output the links in that document to either the command
prompt or a text file (with suitable modifications where required of
course). I have a HTML document in which there is a hierarchy of links
in the form of lists. I would like the output of the link information
given by LinkExtractor to reflect this hierarchy in some way.

For example, I have a list of items in a <ul> tag. Each of these items
may/may not contain their own sub-items with their own links, so that
the HTML looks something like:

<ul>
<li> <a href="...."> Item 1 </a>
	<ul> 
	<li> <a href="....">  Sub-Item 1 </a>  </li>
	<li> <a href="....">  Sub-Item 2 </a>  </li> 
	</ul>

<li> Item 2 </li>
</ul>

I would like to know how I can parse a document full of lists like these
and extract the links while having some indication of the hierarchy,
either the "tree path" of the link (i.e. if I extract the link
underyling Sub-Item 1 in my example, my text file should contain
something along the lines of "Item 1 > Sub-Item 1" before printing the
actual link path) or outputting a page identical to the one I am parsing
but with the full path of the link printed beside each of those list
items.

Thanks for all your help in this regard.

Warm Regards,

Sridhar Venkataraman
Summer Analyst, Global Technology (Asia-Pacific) Barclays Capital
Services Ltd 60B Orchard Road #10-00, TheAtrium@Orchard, Singapore -
238891
+ (65) 6828 4609 (O)
+ (65) 9871 0076 (m) | sri...@ba...

_______________________________________________

This e-mail may contain information that is confidential, privileged or
otherwise protected from disclosure. If you are not an intended
recipient of this e-mail, do not duplicate or redistribute it by any
means. Please delete it and any attachments and notify the sender that
you have received it in error. Unless specifically indicated, this
e-mail is not an offer to buy or sell or a solicitation to buy or sell
any securities, investment products or other financial product or
service, an official confirmation of any transaction, or an official
statement of Barclays. Any views or opinions presented are solely those
of the author and do not necessarily represent those of Barclays. This
e-mail is subject to terms available at the following link:
www.barcap.com/emaildisclaimer. By messaging with Barclays you consent
to the foregoing.  Barclays Capital is the investment banking division
of Barclays Bank PLC, a company registered in England (number 1026167)
with its registered offic  e at 1 Churchill Place, London, E14 5HP.
This email may relate to or be sent from other members of the Barclays
Group.
_______________________________________________

------------------------------

Message: 2
Date: Tue, 20 May 2008 17:34:15 +0900
From: ?? ?? <nag...@by...>
Subject: [Htmlparser-user] Replacing attributes of DOCTYPE tag
To: htm...@li...
Message-ID: <483...@by...>
Content-Type: text/plain; charset=ISO-2022-JP

Dear All,

I am new to HTML Parser, and I don't understand well how to handle
!DOCTYPE tag.

Shortly speaking, I'd like to replace tag like this:
<!DOCTYPE html PUBLIC "XXXX" "AAAA">

into:
<! DOCTYPE html PUBLIC "YYYY" "BBBB">

I sat on my chair and had a lots of trial and error, but it did'nt work.
I'd appreciate it if you could give me advice.

(My e-mail address had changed.)

------------------------------

Message: 3
Date: Tue, 20 May 2008 15:37:18 +0300
From: abdullah <abd...@id...>
Subject: Re: [Htmlparser-user] Help with a link extraction program
To: "htmlparser user list" <htm...@li...>
Message-ID:
	<17d...@ma...>
Content-Type: text/plain; charset="iso-8859-1"

you dont need a linkExtractor you need a listExtractor , if all the
links are inside lists you should get the list and navigate to its
children which is the links .. for this case i suggest you parse the
page with filter as following :
  Parser parser = new Parser();
 NodeList lists = parser.parse(new NodeClassFilter(BulletList.class));
 for(int i=0 i < lists.size() ;i++ ){
   BulletList list = lists.elementAt(i);
   links = list.getChildern(); // this will give you another NodeList
with children tags
  // do whatever you want with the links note that you need to cast each
child them forn Node to LinkTag  }

i didnt test this code , but hopefully  it will work if  you gave me a
specific example of the html page you want to parse i may help more

good luck : )

On Tue, May 20, 2008 at 10:13 AM,
<Sri...@ba...>
wrote:

>
> Hi everyone,
>
> I am a new user of the HTMLParser API. I have found the link 
> extraction features to be very useful even in this short space of
time.
>
> I would like to seek help with a program that I have to write. It 
> involves link extraction, but the logic is slightly more convoluted.
>
> Currently, I know how to use the LinkExtractor to supply a HTML 
> document as input and output the links in that document to either the 
> command prompt or a text file (with suitable modifications where 
> required of course). I have a HTML document in which there is a 
> hierarchy of links in the form of lists. I would like the output of 
> the link information given by LinkExtractor to reflect this hierarchy
in some way.
>
> For example, I have a list of items in a <ul> tag. Each of these items

> may/may not contain their own sub-items with their own links, so that 
> the HTML looks something like:
>
> <ul>
> <li> <a href="...."> Item 1 </a>
>        <ul>
>        <li> <a href="....">  Sub-Item 1 </a>  </li>
>        <li> <a href="....">  Sub-Item 2 </a>  </li>
>        </ul>
>
> <li> Item 2 </li>
> </ul>
>
> I would like to know how I can parse a document full of lists like 
> these and extract the links while having some indication of the 
> hierarchy, either the "tree path" of the link (i.e. if I extract the 
> link underyling Sub-Item 1 in my example, my text file should contain 
> something along the lines of "Item 1 > Sub-Item 1" before printing the

> actual link path) or outputting a page identical to the one I am 
> parsing but with the full path of the link printed beside each of 
> those list items.
>
> Thanks for all your help in this regard.
>
> Warm Regards,
>
> Sridhar Venkataraman
> Summer Analyst, Global Technology (Asia-Pacific) Barclays Capital 
> Services Ltd 60B Orchard Road #10-00, TheAtrium@Orchard, Singapore -  
> 238891
> + (65) 6828 4609 (O)
> + (65) 9871 0076 (m) | sri...@ba...
>
>
> _______________________________________________
>
> This e-mail may contain information that is confidential, privileged 
> or otherwise protected from disclosure. If you are not an intended 
> recipient of this e-mail, do not duplicate or redistribute it by any 
> means. Please delete it and any attachments and notify the sender that

> you have received it in error. Unless specifically indicated, this 
> e-mail is not an offer to buy or sell or a solicitation to buy or sell

> any securities, investment products or other financial product or 
> service, an official confirmation of any transaction, or an official 
> statement of Barclays. Any views or opinions presented are solely 
> those of the author and do not necessarily represent those of 
> Barclays. This e-mail is subject to terms available at the following 
> link: www.barcap.com/emaildisclaimer. By messaging with Barclays you 
> consent to the foregoing.  Barclays Capital is the investment banking 
> division of Barclays Bank PLC, a company registered in England (number
> 1026167) with its registered offic
>  e at 1 Churchill Place, London, E14 5HP.  This email may relate to or

> be sent from other members of the Barclays Group.
> _______________________________________________
>
> ----------------------------------------------------------------------
> --- This SF.net email is sponsored by: Microsoft Defy all challenges. 
> Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
-------------- next part --------------
An HTML attachment was scrubbed...

------------------------------

Message: 4
Date: Thu, 22 May 2008 18:06:00 +0530
From: "answers solutions" <fas...@gm...>
Subject: [Htmlparser-user] How to extract table without a nested table
	in it
To: htm...@li...
Message-ID:
	<992...@ma...>
Content-Type: text/plain; charset="iso-8859-1"

Hi

i am strututre like to extract a table so that it doesnot have nested
table inside it .

nodefilter filtertable = new AndFilter( new   HasParentFilter(new
TagNameFilter("table"),new NotFilter(new HasChildFilter(new
TagNameFilter("table)));

 still the o/p i see a table with nested table in it .
-------------- next part --------------
An HTML attachment was scrubbed...

------------------------------

Message: 5
Date: Thu, 22 May 2008 06:14:19 -0700 (PDT)
From: Derrick Oswald <der...@ro...>
Subject: Re: [Htmlparser-user] How to extract table without a nested
	table	in it
To: htmlparser user list <htm...@li...>
Message-ID: <423...@we...>
Content-Type: text/plain; charset="us-ascii"

You probably catch these because the inner tables are not direct
children of the outer table.
You need the HasChildFilter (NodeFilter filter, boolean recursive)
constructor with recursive set to true.

----- Original Message ----
From: answers solutions <fas...@gm...>
To: htm...@li...
Sent: Thursday, May 22, 2008 5:36:00 AM
Subject: [Htmlparser-user] How to extract table without a nested table
in it

Hi

i am strututre like to extract a table so that it doesnot have nested
table inside it .

nodefilter filtertable = new AndFilter( new   HasParentFilter(new
TagNameFilter("table"),new NotFilter(new HasChildFilter(new
TagNameFilter("table)));

 still the o/p i see a table with nested table in it . 
-------------- next part --------------
An HTML attachment was scrubbed...

------------------------------

------------------------------------------------------------------------
-
This SF.net email is sponsored by: Microsoft Defy all challenges.
Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

------------------------------

_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user

End of Htmlparser-user Digest, Vol 23, Issue 3
**********************************************

_______________________________________________

This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays Group.
_______________________________________________