[Htmlparser-user] How to extract more than one tag by only once parsering?
Brought to you by:
derrickoswald
|
From: Jesse H. <hp...@gm...> - 2006-08-03 02:21:56
|
Hi All, When I'm using the htmlparser library, I suffered from a
difficulty. In a html there are many tags such as title, div, input,
span and so on. For example:
<title>this is a test </title>
//...... any other tags
<div class="A">
<span class="B"><a href=" www.google.com ">google</a></span>
</div>
//...... any other tags
<div class="C">
<div class="D"><input type="text" id="E" value="msn" /></div>
</div>
//...... any other tags
<div class="C">
<div class="E"><span class="B"><input type="text" id="E" value="aol"
/><a href=" www.live.com ">live</a></span></div>
</div>
In this example maybe the whole html include many tags. if I want to get the
content 'this is a test', maybe I can use a TagNameFilter, I have to parse
the whole html. If I want to get the content 'google' or 'www.google.com'
then I have to parse the whole html for the second time and if I want to get
'msn', 'aol', 'live' maybe I should parse the whole html for several times.
In this way I can get the content what I need but maybe this way will impact
the performance. Is there any other way to do that? Maybe I can also use
OrFilter to get the Nodes but how can I identify a text match which tag? If
I want to store them into DB I have no idea how to do that by only once
parsing the html (the best performance). I beg your help. :-)
Thanks and Best Regards
Jesse
|