From: Žygimantas M. <zyg...@me...> - 2010-10-11 09:48:24
|
It should not be difficult to write GATE plugin to wrap this project http://code.google.com/p/boilerpipe/ On Fri, Oct 8, 2010 at 7:10 PM, Andrew Martin <and...@al...>wrote: > Hi there, > > Firstly, my apologies if this has been asked before, or if I'm missing > something obvious in the documentation. Only been looking at GATE for a > short time so far :) > > I'm just wondering, are there any GATE plugins that exist that will extract > (or even just mark) the "main text body" from a web page? > > For example, on the web page: > http://www.tomshardware.com/reviews/game-performance-bottleneck,2738.html > > There's a lot of noise on this page. The headers, footers, links at the > side, user comments, etc, are all extra, but the main body of text (the > article itself, starting with "We're back with part 2..." right up to "Intel > 9.1.1" at the end) is just a small part. If I run this HTML file through > GATE, the extra information confuses things significantly. > > I know it is possible for a system to isolate the main content - browser > plugins/javascript tools such as Readability ( > http://lab.arc90.com/experiments/readability/) and Safari's Reader button > are able to do it almost perfectly. Just wondering if GATE can do the same > thing before ANNIE is run. > > Thanks for reading, > -Andrew > Alpoz Ireland > > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today. > http://p.sf.net/sfu/beautyoftheweb > _______________________________________________ > GATE-users mailing list > GAT...@li... > https://lists.sourceforge.net/lists/listinfo/gate-users > > |