Re: [Htmlparser-user] Could you help me?
Brought to you by:
derrickoswald
|
From: Derrick O. <Der...@Ro...> - 2006-07-31 04:52:07
|
Sorry, replied without thinking.
You can apply the StringBean directly to a node list:
Parser parser = new Parser ("http://yadda.yadda");
NodeList list = parser.parse (my_spiffo_DIV_finding_filter);
Div div = list.elementAt (0);
StringBean bean = new StringBean ();
div.getChildren ().visitAllNodesWith (bean);
System.out.println (bean.getStrings ());
Derrick
Derrick Oswald wrote:
>Jesse,
>
>The job breaks down into two tasks:
> 1) get the outermost tag (your <div id="video_infobox_con"> tag) using
>a filter you construct.
> 2) use a StringBean as a visitor on that node and it's children to
>extract the text, like so:
>
>Parser parser = new Parser ("http://yadda.yadda");
>NodeList list = parser.parse (my_spiffo_DIV_finding_filter);
>Div div = list.elementAt (0);
>// now re-create the HTML and pass it into another Parser
>Parser parser = new Parser (div.toHtml ()); // Note: for older versions
>you need to use setInputHtml()
>StringBean bean = new StringBean ();
>parser.visitAllNodesWith (bean);
>System.out.println (bean.getStrings ());
>
>Derrick
>
>h pq wrote:
>
>
>
>>Hi all, I have a question when I parsered the html content. In the
>>html content there are many tags, if I want to get a tag text like
>>LinkTag or TableTag , it's very easy to use the LinkRegexFilter or
>>TagNameFilter, but if I want to get more than one tag's content , is
>>there a filter chain ? Maybe the example following will explain what
>>I said directly:
>>
>> <div id="video_infobox_con">
>> ·add by:<span class="fcolor_03">2006.07.27 - 01:22</span><br />
>> ·Label:
>> <a href="search.do?q=%B0%CD%B6%FB%C4%E1%D1%C7%C4%E1"
>>class="lnk_04" target=_self><u>test_a</u></a>
>>
>> <a href="search.do?q=%D7%B4%D4%AA%D0%E3"
>>class="lnk_04" target=_self><u>test_b</u></a>
>>
>> <a href=" search.do?q=%C0%BA%C7%F2" class="lnk_04"
>>target=_self><u>test_c</u></a>
>>
>> <a href="search.do?q=%CC%E5%D3%FD" class="lnk_04"
>>target=_self><u>test_d</u></a>
>>
>> </div>
>><input type="text" id="htmlurl" name="htmlurl" value='value_test' />
>>
>>there are four tags such as div, span, a ,input, and all content in
>>these tags are what I need like 2006.07.27 - 01:22, test_a, test_b,
>> test_c, test_d and value_test
>>How should I do? Maybe I can parser the html for 4 times to get the
>>four tags' content, but I think it'll impact the proformance. Could
>>you help me ? Thank you very much.
>>
>>Best Regards
>>Jesse
>>
>>
>>------------------------------------------------------------------------
>>
>>-------------------------------------------------------------------------
>>Take Surveys. Earn Cash. Influence the Future of IT
>>Join SourceForge.net's Techsay panel and you'll get the chance to share your
>>opinions on IT & business topics through brief surveys -- and earn cash
>>http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
>>
>>------------------------------------------------------------------------
>>
>>_______________________________________________
>>Htmlparser-user mailing list
>>Htm...@li...
>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>
>>
>>
>>
>
>
>-------------------------------------------------------------------------
>Take Surveys. Earn Cash. Influence the Future of IT
>Join SourceForge.net's Techsay panel and you'll get the chance to share your
>opinions on IT & business topics through brief surveys -- and earn cash
>http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>
|