Menu

TextExtraction

Kostia

Introduction

Web pages often contain clutter (such as unnecessary images, ads, extraneous links, etc..) around the body of an article that distracts a user from actual content.

There are many approaches that aim to making content more readable by extracting only the relevant content and removing the "noise" in web pages, such as images, extraneous links, navigation panels, copyright and privacy notices, advertisements and other redundant content blocks. Usually they assign a score to the page blocks based on the link density and other proxy for the relevance of Web page segment, such us node names, number of sub text-nodes, etc..

The following papers describe some useful techniques to remove the clutter around a web content (mainly news articles):

A comprehensive work, that examines several approaches of extracting the main content from web pages, is the Samuel Louvan's thesis for the Eindhoven University of Technology.

An excellent "live" content extractor written in javascript is Readability, an effort of the arc90 laboratory.
The WebPageCleaner provided by text-analysis is based on this last one.

API Usage example

WebPageCleaner can extract the relevant content as rich formatted html or simply text:

URL url = file.toURI().toURL();
WebPageCleaner cleaner = new WebPageCleaner(url);

// Content as formatted html
StringWriter html = new StringWriter();
cleaner.printHTMLDocument(html);
System.out.println(html);

// Content as plain text
StringWriter text = new StringWriter();
cleaner.printText(text);
System.out.println(text);

Web-Service API

Endpoint: /webservice/extractContent

Parameter
Description

url
The URL from that extraction the relevant text (required)

output
text for plain text or html for hyper text.

Example: http://localhost:8080/webservice/extractContent?url=http://www.cnn.com/2009/WORLD/asiapcf/10/07/solomon.islands.earthquake/index.html&output=text


Related

Wiki: Home

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.