WebCorpus Wiki

Hadoop framework for scalable processing of large web corpora

Status: Beta

Brought to you by: biem-tuda, johannes_simon, remstef

Home

About

WebCorpus is a Hadoop-based Java tool that allows computation of statistics on large corpora extracted from web crawls. Currently supported are web crawls in WARC/ARC format and archives from the Leipzig corpora collection.

Out of the box, webcorpus can count n-grams, cooccurences and POS-n-grams. However, custom statistics can be added easily. See the documentation for more on this.

Quickstart

You don't feel like reading through all the documentation and want to get started right away? After making sure you have a compatible Hadoop version (currently Hadoop 2.x), try this to download the webcorpus package and count bigrams on an example corpus:

$ svn checkout svn://svn.code.sf.net/p/webcorpus/code/trunk webcorpus && \
  export WEBCORPUS_HOME=`pwd`/webcorpus && cd $WEBCORPUS_HOME
$ mvn package -DskipTests
$ bin/webcorpus-setup --hdfs-dir HDFS_DIR --with-examples
$ bin/webcorpus-process-archives --hdfs-dir HDFS_DIR -i input/en -o processed --lang en --format leipzig
$ bin/webcorpus-count ngrams -n 2 --hdfs-dir HDFS_DIR -i processed/sentAnnotate -o bigrams

where HDFS_DIR should be replaced by the HDFS directory in which to place the processing input and output (for example "/user/yourname/webcorpus"). This will:

Checkout the webcorpus code from SVN
Build the project using Maven
Prepare the HDFS project structure with the webcorpus-setup script. The --with-examples option will download small example web crawls to HDFS_DIR/input
Extract, filter and annotate sentences from the English example corpus and put them into HDFS_DIR/processed. This will e.g. deduplify sentences by content and URL.
Count bigrams on the filtered sentences

When everything completed, you will find all extracted bigrams along with their counts in HDFS_DIR/bigrams .

Documentation

WebCorpus builds on Hadoop as its foundation. Everything to be processed is submitted as separate jobs to the Hadoop cluster and the results are written to its HDFS. Jobs are run in a pipeline fashion, where each pipeline step can either filter, modify, or split the input. For an in-depth explanation of the involved Hadoop jobs, and their pipeline structure, see the documentation wiki page.

How to Cite

If you use this software in scientific projects, please cite the following paper:

Biemann, C., Bildhauer, F., Evert, S., Goldhahn, D., Quasthoff, U., Schäfer, R., Simon, J., Swiezinski, L., Zesch, T. (2013): Scalable Construction of High-Quality Web Corpora. Journal for Language Technology and Computational Linguistics (JLCL), 28(2):23-59 (http://www.jlcl.org/2013_Heft2/H2013-2.pdf)