Web pages often contain clutter (such as unnecessary images, ads, extraneous links, etc..) around the body of an article that distracts a user from actual content.
There are many approaches that aim to making content more readable by extracting only the relevant content and removing the "noise" in web pages, such as images, extraneous links, navigation panels, copyright and privacy notices, advertisements and other redundant content blocks. Usually they assign a score to the page blocks based on the link density and other proxy for the relevance of Web page segment, such us node names, number of sub text-nodes, etc..
The following papers describe some useful techniques to remove the clutter around a web content (mainly news articles):
A comprehensive work, that examines several approaches of extracting the main content from web pages, is the Samuel Louvan's thesis for the Eindhoven University of Technology.
An excellent "live" content extractor written in javascript is Readability, an effort of the arc90 laboratory.
The WebPageCleaner provided by text-analysis is based on this last one.
WebPageCleaner
can extract the relevant content as rich formatted html or simply text:
URL url = file.toURI().toURL(); WebPageCleaner cleaner = new WebPageCleaner(url); // Content as formatted html StringWriter html = new StringWriter(); cleaner.printHTMLDocument(html); System.out.println(html); // Content as plain text StringWriter text = new StringWriter(); cleaner.printText(text); System.out.println(text);
Endpoint: /webservice/extractContent
Parameter
Description
url
The URL from that extraction the relevant text (required)
output
text for plain text or html for hyper text.