WebCorpus is a Hadoop-based framework that enables you to calculate statistics on large web corpora extracted from web crawls.

Features

  • linguistic processing of text corpora with multiple GB or TB in size using Apache Hadoop
  • extracts and counts sentences, word n-grams (with or without POS-tags) and cooccurrences
  • reads popular web crawl formats (ARC and WARC)
  • filters input data by language, duplicate URL, duplicate content and encoding errors
  • can be extended by further linguistic counts based on custom UIMA annotations

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow WebCorpus

WebCorpus Web Site

Other Useful Business Software
AI-powered service management for IT and enterprise teams Icon
AI-powered service management for IT and enterprise teams

Enterprise-grade ITSM, for every business

Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity. Maximize operational efficiency with refreshingly simple, AI-powered Freshservice.
Try it Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of WebCorpus!

Additional Project Details

Operating Systems

Linux

Programming Language

Java

Registered

2013-03-08