<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Recent changes to TextExtraction</title><link>https://sourceforge.net/p/webtextanalysis/wiki/TextExtraction/</link><description>Recent changes to TextExtraction</description><atom:link href="https://sourceforge.net/p/webtextanalysis/wiki/TextExtraction/feed" rel="self"/><language>en</language><lastBuildDate>Tue, 20 May 2014 08:56:26 -0000</lastBuildDate><atom:link href="https://sourceforge.net/p/webtextanalysis/wiki/TextExtraction/feed" rel="self" type="application/rss+xml"/><item><title>TextExtraction modified by Kostia</title><link>https://sourceforge.net/p/webtextanalysis/wiki/TextExtraction/</link><description>&lt;div class="markdown_content"&gt;&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Web pages often contain clutter (such as unnecessary images, ads, extraneous links, etc..) around the body of an article that distracts a user from actual content. &lt;/p&gt;
&lt;p&gt;There are many approaches that aim to making content more readable by extracting only the relevant content and removing the "noise" in web pages, such as images, extraneous links, navigation panels, copyright and privacy notices, advertisements and other redundant content blocks. Usually they assign a score to the page blocks based on the link density and other proxy for the relevance of Web page segment, such us node names, number of sub text-nodes, etc.. &lt;/p&gt;
&lt;p&gt;The following papers describe some useful techniques to remove the clutter around a web content (mainly news articles): &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="" href="http://citeseerx.ist.psu.edu/viewdoc/summary?cid=314861" rel="nofollow"&gt;Eliminating noisy information in Web pages for data mining&lt;/a&gt; (2003), by L Yi,B Liu,X Li - In Proc. of the Int. Conf. on Knowledge Discovery &amp;amp; Data Mining (KDD) &lt;/li&gt;
&lt;li&gt;&lt;a class="" href="http://citeseerx.ist.psu.edu/viewdoc/summary?cid=3607835" rel="nofollow"&gt;Web page cleaning for web mining through feature weighting&lt;/a&gt; (2003), by L Yi,B Liu - In Intl. Joint Conf. on Artificial Intelligence (IJCAI) &lt;/li&gt;
&lt;li&gt;&lt;a class="" href="http://citeseerx.ist.psu.edu/viewdoc/summary?cid=3684393" rel="nofollow"&gt;Automating Content Extraction of HTML Documents&lt;/a&gt; (2005), by Suhit Gupta,Gail E Kaiser,Peter Grimm,Michael F Chiang,Justin (WWW) &lt;/li&gt;
&lt;li&gt;&lt;a class="" href="http://citeseerx.ist.psu.edu/viewdoc/summary?cid=111410" rel="nofollow"&gt;Discovering informative content blocks from Web documents&lt;/a&gt; (2002), by S Lin,J Ho - In Proc. of the 8 th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD) &lt;/li&gt;
&lt;li&gt;&lt;a class="" href="http://citeseerx.ist.psu.edu/viewdoc/summary?cid=314731" rel="nofollow"&gt;Template detection via data mining and its applications&lt;/a&gt; (2002), by Z Bar-Yossef,S Rajagopalan - In Proc. of the Int. World Wide Web Conf. (WWW) &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A comprehensive work, that examines several approaches of extracting the main content from web pages, is the &lt;a class="" href="http://www.win.tue.nl/~mpechen/projects/pdfs/Louvan2009.pdf" rel="nofollow"&gt;Samuel Louvan's thesis&lt;/a&gt; for the Eindhoven University of Technology. &lt;/p&gt;
&lt;p&gt;An excellent "live" content extractor written in javascript is &lt;a class="" href="http://lab.arc90.com/2009/03/02/readability/" rel="nofollow"&gt;Readability&lt;/a&gt;, an effort of the arc90 laboratory.&lt;br /&gt;
The &lt;a class="" href="http://text-analysis.svn.sourceforge.net/viewvc/text-analysis/text-analysis/trunk/Repository/src/eu/kostia/repository/extract/WebPageCleaner.java?revision=771&amp;amp;view=markup"&gt;WebPageCleaner&lt;/a&gt; provided by &lt;em&gt;text-analysis&lt;/em&gt; is based on this last one. &lt;/p&gt;
&lt;h2 id="api-usage-example"&gt;API Usage example&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;WebPageCleaner&lt;/code&gt; can extract the relevant content as rich formatted html or simply text: &lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span class="n"&gt;URL&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;toURI&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;toURL&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;WebPageCleaner&lt;/span&gt; &lt;span class="n"&gt;cleaner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="n"&gt;WebPageCleaner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Content as formatted html&lt;/span&gt;
&lt;span class="n"&gt;StringWriter&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="n"&gt;StringWriter&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;cleaner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;printHTMLDocument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;System&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Content as plain text&lt;/span&gt;
&lt;span class="n"&gt;StringWriter&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="n"&gt;StringWriter&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;cleaner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;printText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;System&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="web-service-api"&gt;Web-Service API&lt;/h2&gt;
&lt;p&gt;Endpoint: /webservice/extractContent &lt;/p&gt;
&lt;p&gt;Parameter&lt;br /&gt;
Description &lt;/p&gt;
&lt;p&gt;url&lt;br /&gt;
The URL from that extraction the relevant text (required) &lt;/p&gt;
&lt;p&gt;output&lt;br /&gt;
text for plain text or html for hyper text. &lt;/p&gt;
&lt;p&gt;Example: &lt;a class="" href="http://localhost:8080/webservice/extractContent?url=http://www.cnn.com/2009/WORLD/asiapcf/10/07/solomon.islands.earthquake/index.html&amp;amp;output=text" rel="nofollow"&gt;http://localhost:8080/webservice/extractContent?url=http://www.cnn.com/2009/WORLD/asiapcf/10/07/solomon.islands.earthquake/index.html&amp;amp;output=text&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="" src="/userapps/trac/kostia76/raw-attachment/wiki/TextExtraction/WebPageCleaner.jpg" /&gt;&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Kostia</dc:creator><pubDate>Tue, 20 May 2014 08:56:26 -0000</pubDate><guid>https://sourceforge.nete77dabadb523dc9a2e7529b414ecbba5419225c5</guid></item></channel></rss>