Thread: [Htmlparser-user] StringBean: Removing unwanted links

Brought to you by: derrickoswald

htmlparser-user

[Htmlparser-user] StringBean: Removing unwanted links

From: Riaz u. <ru...@ya...> - 2006-05-07 18:59:21

Hi,

I have this code snippet from htmlparser.sourcefourge.net for StringBean:

StringBean sb = new StringBean ();
     sb.setLinks (false);
     sb.setReplaceNonBreakingSpaces (true);
     sb.setCollapse (true);
     sb.setURL ("http://news.yahoo.com/s/ap/20060507/ap_on_re_mi_ea/iraq;_ylt=AoeY5mkiWMfGQ8KbE6W5xxas0NUE;_ylu=X3oDMTA2Z2szazkxBHNlYwN0bQ--"); // the HTTP is performed here
     String s = sb.getStrings ();
How can I get rid of other  text and get only the news content from this URL?
The unwanted text(links) are like: 'Home', 'U.S.', etc appearing in the output.

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: [Htmlparser-user] StringBean: Removing unwanted links

From: Derrick O. <Der...@Ro...> - 2006-05-08 00:03:48

Riaz,

You will probably need to use a filter to pick out the content you want.
Run the FilterBuilder tool (bin/filterbuilder) and create a filter that 
gets the content you want.
It has a little help and a tutorial to get you going.
Then use the filter code generated by the tool and pass it to a 
FilterBean, which has a convenience method, called getText() I think, 
that will apply a StringBean to the results of the filter.

Derrick

Riaz uddin wrote:

> Hi,
>
> I have this code snippet from htmlparser.sourcefourge.net for StringBean:
>
>StringBean sb = new StringBean ();
>     sb.setLinks (false);
>     sb.setReplaceNonBreakingSpaces (true);
>     sb.setCollapse (true);
>     sb.setURL ("http://news.yahoo.com/s/ap/20060507/ap_on_re_mi_ea/iraq;_ylt=AoeY5mkiWMfGQ8KbE6W5xxas0NUE;_ylu=X3oDMTA2Z2szazkxBHNlYwN0bQ--"); // the HTTP is performed here
>     String s = sb.getStrings ();
>
> How can I get rid of other  text and get only the news content from 
> this URL?
> The unwanted text(links) are like: 'Home', 'U.S.', etc appearing in 
> the output.
>
>

Re: [Htmlparser-user] StringBean: Removing unwanted links

From: Subramanya S. <sa...@cs...> - 2006-05-08 03:21:58

Riaz,

For now, check
http://cvs.sourceforge.net/viewcvs.py/newsrack/newsrack/WEB-INF/classes/news_rack/archiver/HTMLFilter.java?view=markup

This is a CVS version of code that does precisely this task.  This code does a
lot of what you want.  Couple of samples of output is at:
http://floss.sarai.net/newsrack/DisplayNewsItem.do?ni=5.5.2006%2Frediff.business%2Fni9.05tata2.htm
http://floss.sarai.net/newsrack/DisplayNewsItem.do?ni=2.5.2006%2Fsify.finance%2Fni3.fullstory.php_id%3D14195512

I had written this code that used the built-in JDK swing parser earlier.  But,
someone else working on this project (newsrack) helped me migrate this over to
HTMLParser.

I will be checking in a newer version of this code in a couple day's time. If
you plan to use this code, please credit 'Subramanya Sastry' and 'Jaikishan
Jalan'.  At this time, code for the entire project is being released under
GPL.  In future, other licences (apache) will be incorporated.

Would also appreciate any improvements you make to the code.

Thanks,
Subbu.

> Riaz,
>
> You will probably need to use a filter to pick out the content you want.
> Run the FilterBuilder tool (bin/filterbuilder) and create a filter that
> gets the content you want.
> It has a little help and a tutorial to get you going.
> Then use the filter code generated by the tool and pass it to a
> FilterBean, which has a convenience method, called getText() I think,
> that will apply a StringBean to the results of the filter.
>
> Derrick
>
> Riaz uddin wrote:
>
> > Hi,
> >
> > I have this code snippet from htmlparser.sourcefourge.net for StringBean:
> >
> >StringBean sb = new StringBean ();
> >     sb.setLinks (false);
> >     sb.setReplaceNonBreakingSpaces (true);
> >     sb.setCollapse (true);
> >     sb.setURL ("http://news.yahoo.com/s/ap/20060507/ap_on_re_mi_ea/iraq;_ylt=AoeY5mkiWMfGQ8KbE6W5xxas0NUE;_ylu=X3oDMTA2Z2szazkxBHNlYwN0bQ--"); // the HTTP is performed here
> >     String s = sb.getStrings ();
> >
> > How can I get rid of other  text and get only the news content from
> > this URL?
> > The unwanted text(links) are like: 'Home', 'U.S.', etc appearing in
> > the output.
> >
> >
>
>
>
> -------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>