text-analysis Wiki

Status: Alpha

Brought to you by: kostia76

TextExtraction

Labels:

Introduction

Web pages often contain clutter (such as unnecessary images, ads, extraneous links, etc..) around the body of an article that distracts a user from actual content.

There are many approaches that aim to making content more readable by extracting only the relevant content and removing the "noise" in web pages, such as images, extraneous links, navigation panels, copyright and privacy notices, advertisements and other redundant content blocks. Usually they assign a score to the page blocks based on the link density and other proxy for the relevance of Web page segment, such us node names, number of sub text-nodes, etc..

The following papers describe some useful techniques to remove the clutter around a web content (mainly news articles):

Eliminating noisy information in Web pages for data mining (2003), by L Yi,B Liu,X Li - In Proc. of the Int. Conf. on Knowledge Discovery & Data Mining (KDD)
Web page cleaning for web mining through feature weighting (2003), by L Yi,B Liu - In Intl. Joint Conf. on Artificial Intelligence (IJCAI)
Automating Content Extraction of HTML Documents (2005), by Suhit Gupta,Gail E Kaiser,Peter Grimm,Michael F Chiang,Justin (WWW)
Discovering informative content blocks from Web documents (2002), by S Lin,J Ho - In Proc. of the 8 th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD)
Template detection via data mining and its applications (2002), by Z Bar-Yossef,S Rajagopalan - In Proc. of the Int. World Wide Web Conf. (WWW)

A comprehensive work, that examines several approaches of extracting the main content from web pages, is the Samuel Louvan's thesis for the Eindhoven University of Technology.

An excellent "live" content extractor written in javascript is Readability, an effort of the arc90 laboratory.
The WebPageCleaner provided by text-analysis is based on this last one.

API Usage example

WebPageCleaner can extract the relevant content as rich formatted html or simply text:

URL url = file.toURI().toURL();
WebPageCleaner cleaner = new WebPageCleaner(url);

// Content as formatted html
StringWriter html = new StringWriter();
cleaner.printHTMLDocument(html);
System.out.println(html);

// Content as plain text
StringWriter text = new StringWriter();
cleaner.printText(text);
System.out.println(text);

Web-Service API

Endpoint: /webservice/extractContent

Parameter
Description

url
The URL from that extraction the relevant text (required)

output
text for plain text or html for hyper text.

Example: http://localhost:8080/webservice/extractContent?url=http://www.cnn.com/2009/WORLD/asiapcf/10/07/solomon.islands.earthquake/index.html&output=text

Wiki: Home