Recent changes to Home

Home modified by Chris Biemann

Chris Biemann — Tue, 21 Jan 2014 12:47:47 -0000

--- v21
+++ v22
@@ -29,4 +29,4 @@
 # How to Cite
 If you use this software in scientific projects, please cite the following paper:

-Biemann, C., Bildhauer, F., Evert, S., Goldhahn, D., Quasthoff, U., Schäfer, R., Simon, J., Swiezinski, L., Zesch, T. (2013): Scalable Construction of High-Quality Web Corpora. Journal for Language Technology and Computational Linguistics (JLCL), 28(2):23-59 ()
+Biemann, C., Bildhauer, F., Evert, S., Goldhahn, D., Quasthoff, U., Schäfer, R., Simon, J., Swiezinski, L., Zesch, T. (2013): Scalable Construction of High-Quality Web Corpora. Journal for Language Technology and Computational Linguistics (JLCL), 28(2):23-59 ()

Home modified by Chris Biemann

Chris Biemann — Tue, 21 Jan 2014 12:47:06 -0000

--- v20
+++ v21
@@ -18,7 +18,7 @@
 1. Checkout the webcorpus code from SVN
 2. Build the project using Maven
 3. Prepare the HDFS project structure with the webcorpus-setup script. The --with-examples option will download small example web crawls to HDFS_DIR/input
-4. Extract, filter and annotate sentences from the english example corpus and put them into HDFS_DIR/processed. This will e.g. deduplify sentences by content and URL.
+4. Extract, filter and annotate sentences from the English example corpus and put them into HDFS_DIR/processed. This will e.g. deduplify sentences by content and URL.
 5. Count bigrams on the filtered sentences

 When everything completed, you will find all extracted bigrams along with their counts in HDFS_DIR/bigrams .
@@ -29,4 +29,4 @@
 # How to Cite
 If you use this software in scientific projects, please cite the following paper:

-Biemann, C., Bildhauer, F., Evert, S., Goldhahn, D., Quasthoff, U., Schäfer, R., Simon, J., Swiezinski, L., Zesch, T. (2013): Scalable Construction of High-Quality Web Corpora. Journal for Language Technology and Computational Linguistics (JLCL), 28(2):23-59 (www.jlcl.org/2013_Heft2/H2013-2.pdf‎)
+Biemann, C., Bildhauer, F., Evert, S., Goldhahn, D., Quasthoff, U., Schäfer, R., Simon, J., Swiezinski, L., Zesch, T. (2013): Scalable Construction of High-Quality Web Corpora. Journal for Language Technology and Computational Linguistics (JLCL), 28(2):23-59 ()

Home modified by Chris Biemann

Chris Biemann — Tue, 21 Jan 2014 12:45:48 -0000

--- v19
+++ v20
@@ -25,3 +25,8 @@

 # Documentation
 WebCorpus builds on Hadoop as its foundation. Everything to be processed is submitted as separate jobs to the Hadoop cluster and the results are written to its HDFS. Jobs are run in a pipeline fashion, where each pipeline step can either filter, modify, or split the input. For an in-depth explanation of the involved Hadoop jobs, and their pipeline structure, see the [documentation]([Documentation]) wiki page.
+
+# How to Cite
+If you use this software in scientific projects, please cite the following paper:
+
+Biemann, C., Bildhauer, F., Evert, S., Goldhahn, D., Quasthoff, U., Schäfer, R., Simon, J., Swiezinski, L., Zesch, T. (2013): Scalable Construction of High-Quality Web Corpora. Journal for Language Technology and Computational Linguistics (JLCL), 28(2):23-59 (www.jlcl.org/2013_Heft2/H2013-2.pdf‎)

Home modified by Johannes

Johannes — Fri, 08 Nov 2013 16:14:22 -0000

--- v18
+++ v19
@@ -4,7 +4,7 @@
 Out of the box, webcorpus can count n-grams, cooccurences and POS-n-grams. However, custom statistics can be added easily. See the [documentation]([Documentation]) for more on this.

 # Quickstart
-You don't feel like reading through all the documentation and want to get started right away? After making sure you have a compatible Hadoop version (see note below), try this to download the webcorpus package and count bigrams on an example corpus:
+You don't feel like reading through all the documentation and want to get started right away? After making sure you have a compatible Hadoop version (currently Hadoop 2.x), try this to download the webcorpus package and count bigrams on an example corpus:

     $ svn checkout svn://svn.code.sf.net/p/webcorpus/code/trunk webcorpus && \
       export WEBCORPUS_HOME=`pwd`/webcorpus && cd $WEBCORPUS_HOME

Home modified by Johannes

Johannes — Wed, 06 Nov 2013 17:53:42 -0000

--- v17
+++ v18
@@ -2,16 +2,13 @@
 WebCorpus is a Hadoop-based Java tool that allows computation of statistics on large corpora extracted from web crawls. Currently supported are web crawls in WARC/[ARC format](http://archive.org/web/researcher/ArcFileFormat.php) and archives from the [Leipzig corpora collection](http://corpora.uni-leipzig.de/).

 Out of the box, webcorpus can count n-grams, cooccurences and POS-n-grams. However, custom statistics can be added easily. See the [documentation]([Documentation]) for more on this.
-
-
-Hadoop Compatability Note: This project has only been tested with Cloudera's Hadoop version hadoop-0.20.2-cdh3u5 from CDH3. For download and installation instructions of CDH3, see [this page](https://ccp.cloudera.com/display/DOC/CDH+Version+and+Packaging+Information#CDHVersionandPackagingInformation-CDH3Update5).

 # Quickstart
 You don't feel like reading through all the documentation and want to get started right away? After making sure you have a compatible Hadoop version (see note below), try this to download the webcorpus package and count bigrams on an example corpus:

     $ svn checkout svn://svn.code.sf.net/p/webcorpus/code/trunk webcorpus && \
       export WEBCORPUS_HOME=`pwd`/webcorpus && cd $WEBCORPUS_HOME
-    $ mvn compile assembly:single
+    $ mvn package -DskipTests
     $ bin/webcorpus-setup --hdfs-dir HDFS_DIR --with-examples
     $ bin/webcorpus-process-archives --hdfs-dir HDFS_DIR -i input/en -o processed --lang en --format leipzig
     $ bin/webcorpus-count ngrams -n 2 --hdfs-dir HDFS_DIR -i processed/sentAnnotate -o bigrams

Home modified by Johannes

Johannes — Mon, 04 Nov 2013 17:33:12 -0000

--- v16
+++ v17
@@ -14,7 +14,7 @@
     $ mvn compile assembly:single
     $ bin/webcorpus-setup --hdfs-dir HDFS_DIR --with-examples
     $ bin/webcorpus-process-archives --hdfs-dir HDFS_DIR -i input/en -o processed --lang en --format leipzig
-    $ bin/webcorpus-count ngrams -n 2 --hdfs-dir HDFS_DIR -i processed/uima -o bigrams
+    $ bin/webcorpus-count ngrams -n 2 --hdfs-dir HDFS_DIR -i processed/sentAnnotate -o bigrams

 where HDFS_DIR should be replaced by the HDFS directory in which to place the processing input and output (for example "/user/yourname/webcorpus"). This will:

Home modified by Johannes

Johannes — Wed, 19 Jun 2013 19:24:33 -0000

--- v15
+++ v16
@@ -12,9 +12,9 @@
     $ svn checkout svn://svn.code.sf.net/p/webcorpus/code/trunk webcorpus && \
       export WEBCORPUS_HOME=`pwd`/webcorpus && cd $WEBCORPUS_HOME
     $ mvn compile assembly:single
-    $ bin/webcorpus-setup --hdfs-dir=HDFS_DIR --with-examples
-    $ bin/webcorpus-process-archives --hdfs-dir=HDFS_DIR -i input/en -o processed --lang en --format leipzig
-    $ bin/webcorpus-count ngrams -n 2 --hdfs-dir=HDFS_DIR -i processed/uima -o bigrams
+    $ bin/webcorpus-setup --hdfs-dir HDFS_DIR --with-examples
+    $ bin/webcorpus-process-archives --hdfs-dir HDFS_DIR -i input/en -o processed --lang en --format leipzig
+    $ bin/webcorpus-count ngrams -n 2 --hdfs-dir HDFS_DIR -i processed/uima -o bigrams

 where HDFS_DIR should be replaced by the HDFS directory in which to place the processing input and output (for example "/user/yourname/webcorpus"). This will:

Home modified by Johannes

Johannes — Wed, 19 Jun 2013 18:25:15 -0000

--- v14
+++ v15
@@ -2,6 +2,9 @@
 WebCorpus is a Hadoop-based Java tool that allows computation of statistics on large corpora extracted from web crawls. Currently supported are web crawls in WARC/[ARC format](http://archive.org/web/researcher/ArcFileFormat.php) and archives from the [Leipzig corpora collection](http://corpora.uni-leipzig.de/).

 Out of the box, webcorpus can count n-grams, cooccurences and POS-n-grams. However, custom statistics can be added easily. See the [documentation]([Documentation]) for more on this.
+
+
+Hadoop Compatability Note: This project has only been tested with Cloudera's Hadoop version hadoop-0.20.2-cdh3u5 from CDH3. For download and installation instructions of CDH3, see [this page](https://ccp.cloudera.com/display/DOC/CDH+Version+and+Packaging+Information#CDHVersionandPackagingInformation-CDH3Update5).

 # Quickstart
 You don't feel like reading through all the documentation and want to get started right away? After making sure you have a compatible Hadoop version (see note below), try this to download the webcorpus package and count bigrams on an example corpus:
@@ -23,8 +26,5 @@

 When everything completed, you will find all extracted bigrams along with their counts in HDFS_DIR/bigrams .

-
-Hadoop Compatability Note: This project has only been tested with Cloudera's Hadoop version hadoop-0.20.2-cdh3u5 from CDH3. For download and installation instructions of CDH3, see [this page](https://ccp.cloudera.com/display/DOC/CDH+Version+and+Packaging+Information#CDHVersionandPackagingInformation-CDH3Update5).
-
 # Documentation
 WebCorpus builds on Hadoop as its foundation. Everything to be processed is submitted as separate jobs to the Hadoop cluster and the results are written to its HDFS. Jobs are run in a pipeline fashion, where each pipeline step can either filter, modify, or split the input. For an in-depth explanation of the involved Hadoop jobs, and their pipeline structure, see the [documentation]([Documentation]) wiki page.

Home modified by Johannes

Johannes — Fri, 14 Jun 2013 00:15:46 -0000

--- v13
+++ v14
@@ -1,7 +1,7 @@
 # About
-WebCorpus is a Hadoop-based Java tool that enables you to calculate statistics on large corpora extracted from web crawls. Currently supported are web crawls in [ARC format](http://archive.org/web/researcher/ArcFileFormat.php) and archives from the [Leipzig corpora collection](http://corpora.uni-leipzig.de/).
+WebCorpus is a Hadoop-based Java tool that allows computation of statistics on large corpora extracted from web crawls. Currently supported are web crawls in WARC/[ARC format](http://archive.org/web/researcher/ArcFileFormat.php) and archives from the [Leipzig corpora collection](http://corpora.uni-leipzig.de/).

-At the moment, webcorpus can count n-grams, cooccurences and POS-n-grams.
+Out of the box, webcorpus can count n-grams, cooccurences and POS-n-grams. However, custom statistics can be added easily. See the [documentation]([Documentation]) for more on this.

 # Quickstart
 You don't feel like reading through all the documentation and want to get started right away? After making sure you have a compatible Hadoop version (see note below), try this to download the webcorpus package and count bigrams on an example corpus:
@@ -27,4 +27,4 @@
 Hadoop Compatability Note: This project has only been tested with Cloudera's Hadoop version hadoop-0.20.2-cdh3u5 from CDH3. For download and installation instructions of CDH3, see [this page](https://ccp.cloudera.com/display/DOC/CDH+Version+and+Packaging+Information#CDHVersionandPackagingInformation-CDH3Update5).

 # Documentation
-WebCorpus builds on Hadoop as its foundation. Everything to be processed is submitted as separate jobs to the Hadoop cluster and the results are written to its HDFS. Jobs are run in a pipeline fashion, where each pipeline step can either filter, modify, or split the input. For an in-depth explanation of the involved Hadoop jobs, and their pipeline structure, see the [Documentation]([Documentation]) wiki page.
+WebCorpus builds on Hadoop as its foundation. Everything to be processed is submitted as separate jobs to the Hadoop cluster and the results are written to its HDFS. Jobs are run in a pipeline fashion, where each pipeline step can either filter, modify, or split the input. For an in-depth explanation of the involved Hadoop jobs, and their pipeline structure, see the [documentation]([Documentation]) wiki page.

WikiPage Home modified by Johannes

Johannes — Fri, 15 Mar 2013 14:52:37 -0000

--- v12
+++ v13
@@ -6,7 +6,7 @@
 # Quickstart
 You don't feel like reading through all the documentation and want to get started right away? After making sure you have a compatible Hadoop version (see note below), try this to download the webcorpus package and count bigrams on an example corpus:

-    $ svn checkout svn://svn.code.sf.net/p/webcorpus/code/trunk webcorpus && \\
+    $ svn checkout svn://svn.code.sf.net/p/webcorpus/code/trunk webcorpus && \
       export WEBCORPUS_HOME=`pwd`/webcorpus && cd $WEBCORPUS_HOME
     $ mvn compile assembly:single
     $ bin/webcorpus-setup --hdfs-dir=HDFS_DIR --with-examples