WebCorpus Icon

WebCorpus

beta

Hadoop framework for scalable processing of large web corpora

Add a Review
1 Download (This Week)
Last Update:
Download webcorpus-1.0.1.jar
Browse All Files
Linux

Screenshots

Description

WebCorpus is a Hadoop-based framework that enables you to calculate statistics on large web corpora extracted from web crawls.

WebCorpus Web Site

Categories

License

Apache License V2.0

Features

  • linguistic processing of text corpora with multiple GB or TB in size using Apache Hadoop
  • extracts and counts sentences, word n-grams (with or without POS-tags) and cooccurrences
  • reads popular web crawl formats (ARC and WARC)
  • filters input data by language, duplicate URL, duplicate content and encoding errors
  • can be extended by further linguistic counts based on custom UIMA annotations

KEEP ME UPDATED

Write a Review

User Reviews

Be the first to post a review of WebCorpus!

Additional Project Details

Programming Language

Java

Registered

2013-03-08
Screenshots can attract more users to your project.
Features can attract more users to your project.