WebCorpus is a Hadoop-based Java tool that allows computation of statistics on large corpora extracted from web crawls. Currently supported are web crawls in WARC/ARC format and archives from the Leipzig corpora collection.
Out of the box, webcorpus can count n-grams, cooccurences and POS-n-grams. However, custom statistics can be added easily. See the documentation for more on this.
You don't feel like reading through all the documentation and want to get started right away? After making sure you have a compatible Hadoop version (currently Hadoop 2.x), try this to download the webcorpus package and count bigrams on an example corpus:
$ svn checkout svn://svn.code.sf.net/p/webcorpus/code/trunk webcorpus && \ export WEBCORPUS_HOME=`pwd`/webcorpus && cd $WEBCORPUS_HOME $ mvn package -DskipTests $ bin/webcorpus-setup --hdfs-dir HDFS_DIR --with-examples $ bin/webcorpus-process-archives --hdfs-dir HDFS_DIR -i input/en -o processed --lang en --format leipzig $ bin/webcorpus-count ngrams -n 2 --hdfs-dir HDFS_DIR -i processed/sentAnnotate -o bigrams
where HDFS_DIR
should be replaced by the HDFS directory in which to place the processing input and output (for example "/user/yourname/webcorpus"). This will:
webcorpus-setup
script. The --with-examples
option will download small example web crawls to HDFS_DIR/input
HDFS_DIR/processed
. This will e.g. deduplify sentences by content and URL.When everything completed, you will find all extracted bigrams along with their counts in HDFS_DIR/bigrams
.
WebCorpus builds on Hadoop as its foundation. Everything to be processed is submitted as separate jobs to the Hadoop cluster and the results are written to its HDFS. Jobs are run in a pipeline fashion, where each pipeline step can either filter, modify, or split the input. For an in-depth explanation of the involved Hadoop jobs, and their pipeline structure, see the documentation wiki page.
If you use this software in scientific projects, please cite the following paper:
Biemann, C., Bildhauer, F., Evert, S., Goldhahn, D., Quasthoff, U., Schäfer, R., Simon, J., Swiezinski, L., Zesch, T. (2013): Scalable Construction of High-Quality Web Corpora. Journal for Language Technology and Computational Linguistics (JLCL), 28(2):23-59 (http://www.jlcl.org/2013_Heft2/H2013-2.pdf)