WebCorpus is a Hadoop-based framework that enables you to calculate statistics on large web corpora extracted from web crawls.

Features

  • linguistic processing of text corpora with multiple GB or TB in size using Apache Hadoop
  • extracts and counts sentences, word n-grams (with or without POS-tags) and cooccurrences
  • reads popular web crawl formats (ARC and WARC)
  • filters input data by language, duplicate URL, duplicate content and encoding errors
  • can be extended by further linguistic counts based on custom UIMA annotations

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow WebCorpus

WebCorpus Web Site

Other Useful Business Software
Build Securely on AWS with Proven Frameworks Icon
Build Securely on AWS with Proven Frameworks

Lay a foundation for success with Tested Reference Architectures developed by Fortinet’s experts. Learn more in this white paper.

Moving to the cloud brings new challenges. How can you manage a larger attack surface while ensuring great network performance? Turn to Fortinet’s Tested Reference Architectures, blueprints for designing and securing cloud environments built by cybersecurity experts. Learn more and explore use cases in this white paper.
Download Now
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of WebCorpus!

Additional Project Details

Operating Systems

Linux

Programming Language

Java

Registered

2013-03-08