<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Recent changes to Home</title><link>https://sourceforge.net/p/webcorpus/wiki/Home/</link><description>Recent changes to Home</description><atom:link href="https://sourceforge.net/p/webcorpus/wiki/Home/feed" rel="self"/><language>en</language><lastBuildDate>Tue, 21 Jan 2014 12:47:47 -0000</lastBuildDate><atom:link href="https://sourceforge.net/p/webcorpus/wiki/Home/feed" rel="self" type="application/rss+xml"/><item><title>Home modified by Chris Biemann</title><link>https://sourceforge.net/p/webcorpus/wiki/Home/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v21
+++ v22
@@ -29,4 +29,4 @@
 # How to Cite
 If you use this software in scientific projects, please cite the following paper:

-Biemann, C., Bildhauer, F., Evert, S., Goldhahn, D., Quasthoff, U., Schäfer, R., Simon, J., Swiezinski, L., Zesch, T. (2013): Scalable Construction of High-Quality Web Corpora. Journal for Language Technology and Computational Linguistics (JLCL), 28(2):23-59 ()
+Biemann, C., Bildhauer, F., Evert, S., Goldhahn, D., Quasthoff, U., Schäfer, R., Simon, J., Swiezinski, L., Zesch, T. (2013): Scalable Construction of High-Quality Web Corpora. Journal for Language Technology and Computational Linguistics (JLCL), 28(2):23-59 ()
&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Chris Biemann</dc:creator><pubDate>Tue, 21 Jan 2014 12:47:47 -0000</pubDate><guid>https://sourceforge.netf92d45fc42f40f60be71b726feb34773ab9093e9</guid></item><item><title>Home modified by Chris Biemann</title><link>https://sourceforge.net/p/webcorpus/wiki/Home/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v20
+++ v21
@@ -18,7 +18,7 @@
 1. Checkout the webcorpus code from SVN
 2. Build the project using Maven
 3. Prepare the HDFS project structure with the &lt;code&gt;webcorpus-setup&lt;/code&gt; script. The &lt;code&gt;--with-examples&lt;/code&gt; option will download small example web crawls to &lt;code&gt;HDFS_DIR/input&lt;/code&gt;
-4. Extract, filter and annotate sentences from the english example corpus and put them into &lt;code&gt;HDFS_DIR/processed&lt;/code&gt;. This will e.g. deduplify sentences by content and URL.
+4. Extract, filter and annotate sentences from the English example corpus and put them into &lt;code&gt;HDFS_DIR/processed&lt;/code&gt;. This will e.g. deduplify sentences by content and URL.
 5. Count bigrams on the filtered sentences

 When everything completed, you will find all extracted bigrams along with their counts in &lt;code&gt;HDFS_DIR/bigrams&lt;/code&gt; .
@@ -29,4 +29,4 @@
 # How to Cite
 If you use this software in scientific projects, please cite the following paper:

-Biemann, C., Bildhauer, F., Evert, S., Goldhahn, D., Quasthoff, U., Schäfer, R., Simon, J., Swiezinski, L., Zesch, T. (2013): Scalable Construction of High-Quality Web Corpora. Journal for Language Technology and Computational Linguistics (JLCL), 28(2):23-59 (www.jlcl.org/2013_Heft2/H2013-2.pdf‎)
+Biemann, C., Bildhauer, F., Evert, S., Goldhahn, D., Quasthoff, U., Schäfer, R., Simon, J., Swiezinski, L., Zesch, T. (2013): Scalable Construction of High-Quality Web Corpora. Journal for Language Technology and Computational Linguistics (JLCL), 28(2):23-59 ()
&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Chris Biemann</dc:creator><pubDate>Tue, 21 Jan 2014 12:47:06 -0000</pubDate><guid>https://sourceforge.neta4196676f14caa9c3db3b56499e76dc77c6bda6d</guid></item><item><title>Home modified by Chris Biemann</title><link>https://sourceforge.net/p/webcorpus/wiki/Home/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v19
+++ v20
@@ -25,3 +25,8 @@

 # Documentation
 WebCorpus builds on Hadoop as its foundation. Everything to be processed is submitted as separate jobs to the Hadoop cluster and the results are written to its HDFS. Jobs are run in a pipeline fashion, where each pipeline step can either filter, modify, or split the input. For an in-depth explanation of the involved Hadoop jobs, and their pipeline structure, see the [documentation]([Documentation]) wiki page.
+
+# How to Cite
+If you use this software in scientific projects, please cite the following paper:
+
+Biemann, C., Bildhauer, F., Evert, S., Goldhahn, D., Quasthoff, U., Schäfer, R., Simon, J., Swiezinski, L., Zesch, T. (2013): Scalable Construction of High-Quality Web Corpora. Journal for Language Technology and Computational Linguistics (JLCL), 28(2):23-59 (www.jlcl.org/2013_Heft2/H2013-2.pdf‎)
&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Chris Biemann</dc:creator><pubDate>Tue, 21 Jan 2014 12:45:48 -0000</pubDate><guid>https://sourceforge.net144795ee0e13ec3d08a3e8cceb94cb55d11aa370</guid></item><item><title>Home modified by Johannes</title><link>https://sourceforge.net/p/webcorpus/wiki/Home/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v18
+++ v19
@@ -4,7 +4,7 @@
 Out of the box, webcorpus can count n-grams, cooccurences and POS-n-grams. However, custom statistics can be added easily. See the [documentation]([Documentation]) for more on this.

 # Quickstart
-You don't feel like reading through all the documentation and want to get started right away? After making sure you have a compatible Hadoop version (see note below), try this to download the webcorpus package and count bigrams on an example corpus:
+You don't feel like reading through all the documentation and want to get started right away? After making sure you have a compatible Hadoop version (currently Hadoop 2.x), try this to download the webcorpus package and count bigrams on an example corpus:

     $ svn checkout svn://svn.code.sf.net/p/webcorpus/code/trunk webcorpus &amp;&amp; \
       export WEBCORPUS_HOME=`pwd`/webcorpus &amp;&amp; cd $WEBCORPUS_HOME
&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Johannes</dc:creator><pubDate>Fri, 08 Nov 2013 16:14:22 -0000</pubDate><guid>https://sourceforge.net107765ba9d314615f2c7d24eae620e3124b0dc6f</guid></item><item><title>Home modified by Johannes</title><link>https://sourceforge.net/p/webcorpus/wiki/Home/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v17
+++ v18
@@ -2,16 +2,13 @@
 WebCorpus is a Hadoop-based Java tool that allows computation of statistics on large corpora extracted from web crawls. Currently supported are web crawls in WARC/[ARC format](http://archive.org/web/researcher/ArcFileFormat.php) and archives from the [Leipzig corpora collection](http://corpora.uni-leipzig.de/).

 Out of the box, webcorpus can count n-grams, cooccurences and POS-n-grams. However, custom statistics can be added easily. See the [documentation]([Documentation]) for more on this.
-
-
-&lt;i&gt;&lt;b&gt;Hadoop Compatability Note:&lt;/b&gt; This project has only been tested with Cloudera's Hadoop version &lt;b&gt;hadoop-0.20.2-cdh3u5&lt;/b&gt; from CDH3. For download and installation instructions of CDH3, see [this page](https://ccp.cloudera.com/display/DOC/CDH+Version+and+Packaging+Information#CDHVersionandPackagingInformation-CDH3Update5)&lt;/i&gt;.

 # Quickstart
 You don't feel like reading through all the documentation and want to get started right away? After making sure you have a compatible Hadoop version (see note below), try this to download the webcorpus package and count bigrams on an example corpus:

     $ svn checkout svn://svn.code.sf.net/p/webcorpus/code/trunk webcorpus &amp;&amp; \
       export WEBCORPUS_HOME=`pwd`/webcorpus &amp;&amp; cd $WEBCORPUS_HOME
-    $ mvn compile assembly:single
+    $ mvn package -DskipTests
     $ bin/webcorpus-setup --hdfs-dir HDFS_DIR --with-examples
     $ bin/webcorpus-process-archives --hdfs-dir HDFS_DIR -i input/en -o processed --lang en --format leipzig
     $ bin/webcorpus-count ngrams -n 2 --hdfs-dir HDFS_DIR -i processed/sentAnnotate -o bigrams
&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Johannes</dc:creator><pubDate>Wed, 06 Nov 2013 17:53:42 -0000</pubDate><guid>https://sourceforge.net3bf5d78ffc0d166522dcd4c5643e8090a96f4607</guid></item><item><title>Home modified by Johannes</title><link>https://sourceforge.net/p/webcorpus/wiki/Home/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v16
+++ v17
@@ -14,7 +14,7 @@
     $ mvn compile assembly:single
     $ bin/webcorpus-setup --hdfs-dir HDFS_DIR --with-examples
     $ bin/webcorpus-process-archives --hdfs-dir HDFS_DIR -i input/en -o processed --lang en --format leipzig
-    $ bin/webcorpus-count ngrams -n 2 --hdfs-dir HDFS_DIR -i processed/uima -o bigrams
+    $ bin/webcorpus-count ngrams -n 2 --hdfs-dir HDFS_DIR -i processed/sentAnnotate -o bigrams

 where &lt;code&gt;HDFS_DIR&lt;/code&gt; should be replaced by the HDFS directory in which to place the processing input and output (for example "/user/yourname/webcorpus"). This will:

&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Johannes</dc:creator><pubDate>Mon, 04 Nov 2013 17:33:12 -0000</pubDate><guid>https://sourceforge.netdae98dd24b0b88022eb6de264c44051e3b10d3f9</guid></item><item><title>Home modified by Johannes</title><link>https://sourceforge.net/p/webcorpus/wiki/Home/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v15
+++ v16
@@ -12,9 +12,9 @@
     $ svn checkout svn://svn.code.sf.net/p/webcorpus/code/trunk webcorpus &amp;&amp; \
       export WEBCORPUS_HOME=`pwd`/webcorpus &amp;&amp; cd $WEBCORPUS_HOME
     $ mvn compile assembly:single
-    $ bin/webcorpus-setup --hdfs-dir=HDFS_DIR --with-examples
-    $ bin/webcorpus-process-archives --hdfs-dir=HDFS_DIR -i input/en -o processed --lang en --format leipzig
-    $ bin/webcorpus-count ngrams -n 2 --hdfs-dir=HDFS_DIR -i processed/uima -o bigrams
+    $ bin/webcorpus-setup --hdfs-dir HDFS_DIR --with-examples
+    $ bin/webcorpus-process-archives --hdfs-dir HDFS_DIR -i input/en -o processed --lang en --format leipzig
+    $ bin/webcorpus-count ngrams -n 2 --hdfs-dir HDFS_DIR -i processed/uima -o bigrams

 where &lt;code&gt;HDFS_DIR&lt;/code&gt; should be replaced by the HDFS directory in which to place the processing input and output (for example "/user/yourname/webcorpus"). This will:

&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Johannes</dc:creator><pubDate>Wed, 19 Jun 2013 19:24:33 -0000</pubDate><guid>https://sourceforge.net39047ba6f55f52405d681e38b23d093303b9e88f</guid></item><item><title>Home modified by Johannes</title><link>https://sourceforge.net/p/webcorpus/wiki/Home/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v14
+++ v15
@@ -2,6 +2,9 @@
 WebCorpus is a Hadoop-based Java tool that allows computation of statistics on large corpora extracted from web crawls. Currently supported are web crawls in WARC/[ARC format](http://archive.org/web/researcher/ArcFileFormat.php) and archives from the [Leipzig corpora collection](http://corpora.uni-leipzig.de/).

 Out of the box, webcorpus can count n-grams, cooccurences and POS-n-grams. However, custom statistics can be added easily. See the [documentation]([Documentation]) for more on this.
+
+
+&lt;i&gt;&lt;b&gt;Hadoop Compatability Note:&lt;/b&gt; This project has only been tested with Cloudera's Hadoop version &lt;b&gt;hadoop-0.20.2-cdh3u5&lt;/b&gt; from CDH3. For download and installation instructions of CDH3, see [this page](https://ccp.cloudera.com/display/DOC/CDH+Version+and+Packaging+Information#CDHVersionandPackagingInformation-CDH3Update5)&lt;/i&gt;.

 # Quickstart
 You don't feel like reading through all the documentation and want to get started right away? After making sure you have a compatible Hadoop version (see note below), try this to download the webcorpus package and count bigrams on an example corpus:
@@ -23,8 +26,5 @@

 When everything completed, you will find all extracted bigrams along with their counts in &lt;code&gt;HDFS_DIR/bigrams&lt;/code&gt; .

-
-&lt;i&gt;&lt;b&gt;Hadoop Compatability Note:&lt;/b&gt; This project has only been tested with Cloudera's Hadoop version &lt;b&gt;hadoop-0.20.2-cdh3u5&lt;/b&gt; from CDH3. For download and installation instructions of CDH3, see [this page](https://ccp.cloudera.com/display/DOC/CDH+Version+and+Packaging+Information#CDHVersionandPackagingInformation-CDH3Update5)&lt;/i&gt;.
-
 # Documentation
 WebCorpus builds on Hadoop as its foundation. Everything to be processed is submitted as separate jobs to the Hadoop cluster and the results are written to its HDFS. Jobs are run in a pipeline fashion, where each pipeline step can either filter, modify, or split the input. For an in-depth explanation of the involved Hadoop jobs, and their pipeline structure, see the [documentation]([Documentation]) wiki page.
&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Johannes</dc:creator><pubDate>Wed, 19 Jun 2013 18:25:15 -0000</pubDate><guid>https://sourceforge.neta2c18ae031ff8b137e59f4a4306994512b5e7050</guid></item><item><title>Home modified by Johannes</title><link>https://sourceforge.net/p/webcorpus/wiki/Home/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v13
+++ v14
@@ -1,7 +1,7 @@
 # About
-WebCorpus is a Hadoop-based Java tool that enables you to calculate statistics on large corpora extracted from web crawls. Currently supported are web crawls in [ARC format](http://archive.org/web/researcher/ArcFileFormat.php) and archives from the [Leipzig corpora collection](http://corpora.uni-leipzig.de/).
+WebCorpus is a Hadoop-based Java tool that allows computation of statistics on large corpora extracted from web crawls. Currently supported are web crawls in WARC/[ARC format](http://archive.org/web/researcher/ArcFileFormat.php) and archives from the [Leipzig corpora collection](http://corpora.uni-leipzig.de/).

-At the moment, webcorpus can count n-grams, cooccurences and POS-n-grams.
+Out of the box, webcorpus can count n-grams, cooccurences and POS-n-grams. However, custom statistics can be added easily. See the [documentation]([Documentation]) for more on this.

 # Quickstart
 You don't feel like reading through all the documentation and want to get started right away? After making sure you have a compatible Hadoop version (see note below), try this to download the webcorpus package and count bigrams on an example corpus:
@@ -27,4 +27,4 @@
 &lt;i&gt;&lt;b&gt;Hadoop Compatability Note:&lt;/b&gt; This project has only been tested with Cloudera's Hadoop version &lt;b&gt;hadoop-0.20.2-cdh3u5&lt;/b&gt; from CDH3. For download and installation instructions of CDH3, see [this page](https://ccp.cloudera.com/display/DOC/CDH+Version+and+Packaging+Information#CDHVersionandPackagingInformation-CDH3Update5)&lt;/i&gt;.

 # Documentation
-WebCorpus builds on Hadoop as its foundation. Everything to be processed is submitted as separate jobs to the Hadoop cluster and the results are written to its HDFS. Jobs are run in a pipeline fashion, where each pipeline step can either filter, modify, or split the input. For an in-depth explanation of the involved Hadoop jobs, and their pipeline structure, see the [Documentation]([Documentation]) wiki page.
+WebCorpus builds on Hadoop as its foundation. Everything to be processed is submitted as separate jobs to the Hadoop cluster and the results are written to its HDFS. Jobs are run in a pipeline fashion, where each pipeline step can either filter, modify, or split the input. For an in-depth explanation of the involved Hadoop jobs, and their pipeline structure, see the [documentation]([Documentation]) wiki page.
&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Johannes</dc:creator><pubDate>Fri, 14 Jun 2013 00:15:46 -0000</pubDate><guid>https://sourceforge.net4fd1fa910f51e6ea927bb637c7706cfc708140bf</guid></item><item><title>WikiPage Home modified by Johannes</title><link>https://sourceforge.net/p/webcorpus/wiki/Home/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v12
+++ v13
@@ -6,7 +6,7 @@
 # Quickstart
 You don't feel like reading through all the documentation and want to get started right away? After making sure you have a compatible Hadoop version (see note below), try this to download the webcorpus package and count bigrams on an example corpus:

-    $ svn checkout svn://svn.code.sf.net/p/webcorpus/code/trunk webcorpus &amp;&amp; \\
+    $ svn checkout svn://svn.code.sf.net/p/webcorpus/code/trunk webcorpus &amp;&amp; \
       export WEBCORPUS_HOME=`pwd`/webcorpus &amp;&amp; cd $WEBCORPUS_HOME
     $ mvn compile assembly:single
     $ bin/webcorpus-setup --hdfs-dir=HDFS_DIR --with-examples
&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Johannes</dc:creator><pubDate>Fri, 15 Mar 2013 14:52:37 -0000</pubDate><guid>https://sourceforge.net5146215919d2d9f1b69a094fb8084b0b89f0b514</guid></item></channel></rss>