Thread: [Htmlparser-user] StringBean: Removing unwanted links
Brought to you by:
derrickoswald
From: Riaz u. <ru...@ya...> - 2006-05-07 18:59:21
|
Hi, I have this code snippet from htmlparser.sourcefourge.net for StringBean: StringBean sb = new StringBean (); sb.setLinks (false); sb.setReplaceNonBreakingSpaces (true); sb.setCollapse (true); sb.setURL ("http://news.yahoo.com/s/ap/20060507/ap_on_re_mi_ea/iraq;_ylt=AoeY5mkiWMfGQ8KbE6W5xxas0NUE;_ylu=X3oDMTA2Z2szazkxBHNlYwN0bQ--"); // the HTTP is performed here String s = sb.getStrings (); How can I get rid of other text and get only the news content from this URL? The unwanted text(links) are like: 'Home', 'U.S.', etc appearing in the output. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Derrick O. <Der...@Ro...> - 2006-05-08 00:03:48
|
Riaz, You will probably need to use a filter to pick out the content you want. Run the FilterBuilder tool (bin/filterbuilder) and create a filter that gets the content you want. It has a little help and a tutorial to get you going. Then use the filter code generated by the tool and pass it to a FilterBean, which has a convenience method, called getText() I think, that will apply a StringBean to the results of the filter. Derrick Riaz uddin wrote: > Hi, > > I have this code snippet from htmlparser.sourcefourge.net for StringBean: > >StringBean sb = new StringBean (); > sb.setLinks (false); > sb.setReplaceNonBreakingSpaces (true); > sb.setCollapse (true); > sb.setURL ("http://news.yahoo.com/s/ap/20060507/ap_on_re_mi_ea/iraq;_ylt=AoeY5mkiWMfGQ8KbE6W5xxas0NUE;_ylu=X3oDMTA2Z2szazkxBHNlYwN0bQ--"); // the HTTP is performed here > String s = sb.getStrings (); > > How can I get rid of other text and get only the news content from > this URL? > The unwanted text(links) are like: 'Home', 'U.S.', etc appearing in > the output. > > |
From: Subramanya S. <sa...@cs...> - 2006-05-08 03:21:58
|
Riaz, For now, check http://cvs.sourceforge.net/viewcvs.py/newsrack/newsrack/WEB-INF/classes/news_rack/archiver/HTMLFilter.java?view=markup This is a CVS version of code that does precisely this task. This code does a lot of what you want. Couple of samples of output is at: http://floss.sarai.net/newsrack/DisplayNewsItem.do?ni=5.5.2006%2Frediff.business%2Fni9.05tata2.htm http://floss.sarai.net/newsrack/DisplayNewsItem.do?ni=2.5.2006%2Fsify.finance%2Fni3.fullstory.php_id%3D14195512 I had written this code that used the built-in JDK swing parser earlier. But, someone else working on this project (newsrack) helped me migrate this over to HTMLParser. I will be checking in a newer version of this code in a couple day's time. If you plan to use this code, please credit 'Subramanya Sastry' and 'Jaikishan Jalan'. At this time, code for the entire project is being released under GPL. In future, other licences (apache) will be incorporated. Would also appreciate any improvements you make to the code. Thanks, Subbu. > Riaz, > > You will probably need to use a filter to pick out the content you want. > Run the FilterBuilder tool (bin/filterbuilder) and create a filter that > gets the content you want. > It has a little help and a tutorial to get you going. > Then use the filter code generated by the tool and pass it to a > FilterBean, which has a convenience method, called getText() I think, > that will apply a StringBean to the results of the filter. > > Derrick > > Riaz uddin wrote: > > > Hi, > > > > I have this code snippet from htmlparser.sourcefourge.net for StringBean: > > > >StringBean sb = new StringBean (); > > sb.setLinks (false); > > sb.setReplaceNonBreakingSpaces (true); > > sb.setCollapse (true); > > sb.setURL ("http://news.yahoo.com/s/ap/20060507/ap_on_re_mi_ea/iraq;_ylt=AoeY5mkiWMfGQ8KbE6W5xxas0NUE;_ylu=X3oDMTA2Z2szazkxBHNlYwN0bQ--"); // the HTTP is performed here > > String s = sb.getStrings (); > > > > How can I get rid of other text and get only the news content from > > this URL? > > The unwanted text(links) are like: 'Home', 'U.S.', etc appearing in > > the output. > > > > > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |