WebCorpus is a Hadoop-based framework that enables you to calculate statistics on large web corpora extracted from web crawls.
Features
- linguistic processing of text corpora with multiple GB or TB in size using Apache Hadoop
- extracts and counts sentences, word n-grams (with or without POS-tags) and cooccurrences
- reads popular web crawl formats (ARC and WARC)
- filters input data by language, duplicate URL, duplicate content and encoding errors
- can be extended by further linguistic counts based on custom UIMA annotations
License
Apache License V2.0Follow WebCorpus
Other Useful Business Software
$300 Free Credits for Your Google Cloud Projects
Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.
Rate This Project
Login To Rate This Project
User Reviews
Be the first to post a review of WebCorpus!