Thread: [Htmlparser-user] Could you help me?
Brought to you by:
derrickoswald
From: h p. <hp...@gm...> - 2006-07-31 03:35:57
|
Hi all, I have a question when I parsered the html content. In the html content there are many tags, if I want to get a tag text like LinkTag or TableTag , it's very easy to use the LinkRegexFilter or TagNameFilter, but if I want to get more than one tag's content , is there a filter chain ? Maybe the example following will explain what I said directly: <div id=3D"video_infobox_con"> =B7add by:<span class=3D"fcolor_03">2006.07.27 - 01:22</span><br /> =B7Label: <a href=3D"search.do?q=3D%B0%CD%B6%FB%C4%E1%D1%C7%C4%E1" class=3D"lnk_04" target=3D_self><u>test_a</u></a> <a href=3D"search.do?q=3D%D7%B4%D4%AA%D0%E3" class=3D"lnk_= 04" target=3D_self><u>test_b</u></a> <a href=3D"search.do?q=3D%C0%BA%C7%F2" class=3D"lnk_04" target=3D_self><u>test_c</u></a> <a href=3D"search.do?q=3D%CC%E5%D3%FD" class=3D"lnk_04" target=3D_self><u>test_d</u></a> </div> <input type=3D"text" id=3D"htmlurl" name=3D"htmlurl" value=3D'value_test' = /> there are four tags such as div, span, a ,input, and all content in these tags are what I need like 2006.07.27 - 01:22, test_a, test_b, test_c, test_d and value_test How should I do? Maybe I can parser the html for 4 times to get the four tags' content, but I think it'll impact the proformance. Could you help me = ? Thank you very much. Best Regards Jesse |
From: Derrick O. <Der...@Ro...> - 2006-07-31 04:47:16
|
Jesse, The job breaks down into two tasks: 1) get the outermost tag (your <div id="video_infobox_con"> tag) using a filter you construct. 2) use a StringBean as a visitor on that node and it's children to extract the text, like so: Parser parser = new Parser ("http://yadda.yadda"); NodeList list = parser.parse (my_spiffo_DIV_finding_filter); Div div = list.elementAt (0); // now re-create the HTML and pass it into another Parser Parser parser = new Parser (div.toHtml ()); // Note: for older versions you need to use setInputHtml() StringBean bean = new StringBean (); parser.visitAllNodesWith (bean); System.out.println (bean.getStrings ()); Derrick h pq wrote: > Hi all, I have a question when I parsered the html content. In the > html content there are many tags, if I want to get a tag text like > LinkTag or TableTag , it's very easy to use the LinkRegexFilter or > TagNameFilter, but if I want to get more than one tag's content , is > there a filter chain ? Maybe the example following will explain what > I said directly: > > <div id="video_infobox_con"> > ·add by:<span class="fcolor_03">2006.07.27 - 01:22</span><br /> > ·Label: > <a href="search.do?q=%B0%CD%B6%FB%C4%E1%D1%C7%C4%E1" > class="lnk_04" target=_self><u>test_a</u></a> > > <a href="search.do?q=%D7%B4%D4%AA%D0%E3" > class="lnk_04" target=_self><u>test_b</u></a> > > <a href=" search.do?q=%C0%BA%C7%F2" class="lnk_04" > target=_self><u>test_c</u></a> > > <a href="search.do?q=%CC%E5%D3%FD" class="lnk_04" > target=_self><u>test_d</u></a> > > </div> > <input type="text" id="htmlurl" name="htmlurl" value='value_test' /> > > there are four tags such as div, span, a ,input, and all content in > these tags are what I need like 2006.07.27 - 01:22, test_a, test_b, > test_c, test_d and value_test > How should I do? Maybe I can parser the html for 4 times to get the > four tags' content, but I think it'll impact the proformance. Could > you help me ? Thank you very much. > > Best Regards > Jesse > > >------------------------------------------------------------------------ > >------------------------------------------------------------------------- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share your >opinions on IT & business topics through brief surveys -- and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > >------------------------------------------------------------------------ > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <Der...@Ro...> - 2006-07-31 04:52:07
|
Sorry, replied without thinking. You can apply the StringBean directly to a node list: Parser parser = new Parser ("http://yadda.yadda"); NodeList list = parser.parse (my_spiffo_DIV_finding_filter); Div div = list.elementAt (0); StringBean bean = new StringBean (); div.getChildren ().visitAllNodesWith (bean); System.out.println (bean.getStrings ()); Derrick Derrick Oswald wrote: >Jesse, > >The job breaks down into two tasks: > 1) get the outermost tag (your <div id="video_infobox_con"> tag) using >a filter you construct. > 2) use a StringBean as a visitor on that node and it's children to >extract the text, like so: > >Parser parser = new Parser ("http://yadda.yadda"); >NodeList list = parser.parse (my_spiffo_DIV_finding_filter); >Div div = list.elementAt (0); >// now re-create the HTML and pass it into another Parser >Parser parser = new Parser (div.toHtml ()); // Note: for older versions >you need to use setInputHtml() >StringBean bean = new StringBean (); >parser.visitAllNodesWith (bean); >System.out.println (bean.getStrings ()); > >Derrick > >h pq wrote: > > > >>Hi all, I have a question when I parsered the html content. In the >>html content there are many tags, if I want to get a tag text like >>LinkTag or TableTag , it's very easy to use the LinkRegexFilter or >>TagNameFilter, but if I want to get more than one tag's content , is >>there a filter chain ? Maybe the example following will explain what >>I said directly: >> >> <div id="video_infobox_con"> >> ·add by:<span class="fcolor_03">2006.07.27 - 01:22</span><br /> >> ·Label: >> <a href="search.do?q=%B0%CD%B6%FB%C4%E1%D1%C7%C4%E1" >>class="lnk_04" target=_self><u>test_a</u></a> >> >> <a href="search.do?q=%D7%B4%D4%AA%D0%E3" >>class="lnk_04" target=_self><u>test_b</u></a> >> >> <a href=" search.do?q=%C0%BA%C7%F2" class="lnk_04" >>target=_self><u>test_c</u></a> >> >> <a href="search.do?q=%CC%E5%D3%FD" class="lnk_04" >>target=_self><u>test_d</u></a> >> >> </div> >><input type="text" id="htmlurl" name="htmlurl" value='value_test' /> >> >>there are four tags such as div, span, a ,input, and all content in >>these tags are what I need like 2006.07.27 - 01:22, test_a, test_b, >> test_c, test_d and value_test >>How should I do? Maybe I can parser the html for 4 times to get the >>four tags' content, but I think it'll impact the proformance. Could >>you help me ? Thank you very much. >> >>Best Regards >>Jesse >> >> >>------------------------------------------------------------------------ >> >>------------------------------------------------------------------------- >>Take Surveys. Earn Cash. Influence the Future of IT >>Join SourceForge.net's Techsay panel and you'll get the chance to share your >>opinions on IT & business topics through brief surveys -- and earn cash >>http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >> >>------------------------------------------------------------------------ >> >>_______________________________________________ >>Htmlparser-user mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> >> > > >------------------------------------------------------------------------- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share your >opinions on IT & business topics through brief surveys -- and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |