Simple-Scrape is a simple web-scraping library that allows for programmatic access to HTML code. No further techniques are needed and the library is very compact and thus easy to use.
Web-as-corpus tools in Java.
* Simple Crawler (and also integration with Nutch and Heritrix)
* HTML cleaner to remove boiler plate code
* Language recognition
* Corpus builder