WebCorpus is a Hadoop-based framework that enables you to calculate statistics on large web corpora extracted from web crawls.
- linguistic processing of text corpora with multiple GB or TB in size using Apache Hadoop
- extracts and counts sentences, word n-grams (with or without POS-tags) and cooccurrences
- reads popular web crawl formats (ARC and WARC)
- filters input data by language, duplicate URL, duplicate content and encoding errors
- can be extended by further linguistic counts based on custom UIMA annotations
Be the first to post a review of WebCorpus!