Thread: [Htmlparser-user] How to extract more than one tag by only once parsering?
Brought to you by:
derrickoswald
From: Jesse H. <hp...@gm...> - 2006-08-03 02:21:56
|
Hi All, When I'm using the htmlparser library, I suffered from a difficulty. In a html there are many tags such as title, div, input, span and so on. For example: <title>this is a test </title> //...... any other tags <div class="A"> <span class="B"><a href=" www.google.com ">google</a></span> </div> //...... any other tags <div class="C"> <div class="D"><input type="text" id="E" value="msn" /></div> </div> //...... any other tags <div class="C"> <div class="E"><span class="B"><input type="text" id="E" value="aol" /><a href=" www.live.com ">live</a></span></div> </div> In this example maybe the whole html include many tags. if I want to get the content 'this is a test', maybe I can use a TagNameFilter, I have to parse the whole html. If I want to get the content 'google' or 'www.google.com' then I have to parse the whole html for the second time and if I want to get 'msn', 'aol', 'live' maybe I should parse the whole html for several times. In this way I can get the content what I need but maybe this way will impact the performance. Is there any other way to do that? Maybe I can also use OrFilter to get the Nodes but how can I identify a text match which tag? If I want to store them into DB I have no idea how to do that by only once parsing the html (the best performance). I beg your help. :-) Thanks and Best Regards Jesse |
From: Ian M. <ian...@gm...> - 2006-08-04 10:42:24
|
As long as you keep the original reference to the NodeList created by Parser.parse, and you haven't modified that NodeList, you should be able to reuse it, I think. Ian On 8/3/06, Jesse Hou <hp...@gm...> wrote: > > Hi All, When I'm using the htmlparser library, I suffered from a > difficulty. In a html there are many tags such as title, div, input, span > and so on. For example: > > <title>this is a test </title> > > > //...... any other tags > > <div class="A"> > <span class="B"><a href=" www.google.com ">google</a></span> > </div> > > > //...... any other tags > > <div class="C"> > <div class="D"><input type="text" id="E" value="msn" /></div> > </div> > > //...... any other tags > > > <div class="C"> > <div class="E"><span class="B"><input type="text" id="E" value="aol" > /><a href=" www.live.com ">live</a></span></div> > </div> > > In this example maybe the whole html include many tags. if I want to get the > content 'this is a test', maybe I can use a TagNameFilter, I have to parse > the whole html. If I want to get the content 'google' or ' www.google.com' > then I have to parse the whole html for the second time and if I want to get > 'msn', 'aol', 'live' maybe I should parse the whole html for several times. > In this way I can get the content what I need but maybe this way will impact > the performance. Is there any other way to do that? Maybe I can also use > OrFilter to get the Nodes but how can I identify a text match which tag? If > I want to store them into DB I have no idea how to do that by only once > parsing the html (the best performance). I beg your help. :-) > > Thanks and Best Regards > > Jesse > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Derrick O. <Der...@Ro...> - 2006-08-04 11:42:35
|
Jesse, From your example, you can also get all the div tags at once and filter on class in a secondary pass: NodeList divs = nodelist.extractAllTagsThatMatch (new TagNameFilter ("DIV")); DivTag div_a = divs.extractAllTagsThatMatch (new HasAttributeFilter ("class", "A")).element (0); // presuming there is only one DivTag div_b = divs.extractAllTagsThatMatch (new HasAttributeFilter ("class", "B")).element (0); // presuming there is only one and this may be faster than searching the entire page each time. Derrick Ian Macfarlane wrote: >As long as you keep the original reference to the NodeList created by >Parser.parse, and you haven't modified that NodeList, you should be >able to reuse it, I think. > >Ian > >On 8/3/06, Jesse Hou <hp...@gm...> wrote: > > >>Hi All, When I'm using the htmlparser library, I suffered from a >>difficulty. In a html there are many tags such as title, div, input, span >>and so on. For example: >> >><title>this is a test </title> >> >> >>//...... any other tags >> >><div class="A"> >> <span class="B"><a href=" www.google.com ">google</a></span> >></div> >> >> >>//...... any other tags >> >><div class="C"> >> <div class="D"><input type="text" id="E" value="msn" /></div> >></div> >> >>//...... any other tags >> >> >><div class="C"> >> <div class="E"><span class="B"><input type="text" id="E" value="aol" >>/><a href=" www.live.com ">live</a></span></div> >></div> >> >>In this example maybe the whole html include many tags. if I want to get the >>content 'this is a test', maybe I can use a TagNameFilter, I have to parse >>the whole html. If I want to get the content 'google' or ' www.google.com' >>then I have to parse the whole html for the second time and if I want to get >>'msn', 'aol', 'live' maybe I should parse the whole html for several times. >>In this way I can get the content what I need but maybe this way will impact >>the performance. Is there any other way to do that? Maybe I can also use >>OrFilter to get the Nodes but how can I identify a text match which tag? If >>I want to store them into DB I have no idea how to do that by only once >>parsing the html (the best performance). I beg your help. :-) >> >>Thanks and Best Regards >> >>Jesse >> > > |