Recent changes to Documentation

Documentation modified by Steffen Remus

Steffen Remus — Thu, 17 Dec 2015 09:47:45 -0000

--- v25
+++ v26
@@ -787,7 +787,7 @@
 -----

-#  Example Shellscript to run the jobs (run.sh) 
+#  Example Shellscript to run the jobs

 HDFS_DIRECTORY=webcorpus
 JAR=../webcorpus.jar

Documentation modified by Steffen Remus

Steffen Remus — Thu, 17 Dec 2015 09:47:08 -0000

Documentation modified by Steffen Remus

Steffen Remus — Thu, 17 Dec 2015 09:46:38 -0000

--- v23
+++ v24
@@ -744,7 +744,7 @@

Note that total count is higher than the number of URLs. This means, that a Sentence occurs multiple time in at least one page.

-
+<!--
# How to run

@@ -779,10 +779,15 @@
* Edit webcorpus/mycorpus/jobconf.txt according to your requirements.
* > cd mycorpus
* > sh run.sh
-
-
-
-## Shellscript to run the jobs (run.sh)
+-->
+
+
+
+
+-----
+
+
+# Example Shellscript to run the jobs (run.sh)

HDFS_DIRECTORY=webcorpus

 JAR=../webcorpus.jar

Documentation modified by Johannes Simon

Johannes Simon — Thu, 20 Mar 2014 21:09:54 -0000

--- v22
+++ v23
@@ -491,7 +491,7 @@

 *   Takes optional parameter \[language_name\] (ISO 639-1) to set language to look for. 
-    *   hadoop jar webcorpus.jar webcorpus.hadoopjobs.LanguageJob webcorpus\_data/sentence webcorpus\_data/language [language_name] 
+    *   hadoop jar webcorpus.jar webcorpus.hadoopjobs.LanguageJob webcorpus\_data/sentence webcorpus\_data/language \[language_name\] 
 *   Sentences, where the detected language (lani) matches language\_name, will be labeled with: lang=language\_name, lani=language_name 
 *   Where lani does not match language_name for a sequence of sentences, following rules apply: 
     *   if sequence is at the beginning of a paragraph: lang=lani, lani=lani

Documentation modified by Johannes Simon

Johannes Simon — Thu, 20 Mar 2014 21:08:34 -0000

--- v21
+++ v22
@@ -404,7 +404,7 @@
 ###  UTF8Job
 Encoding errors can lead to undesirable behaviour in language technology. This job tries to filter at least the obvious appearances of encoding errors.

-*   Labels (*encoding_error:=[true|false]*) or removes documents with defective encoding. 
+*   Labels (*encoding_error:=\[true|false\]*) or removes documents with defective encoding. 
 *   Detects defective encoding just by looking for "�" (unknown glyph). 

@@ -699,9 +699,8 @@

 ####  Data example 

-#####  Before  Expects output of 
-
-LanguageJob as input. 
+#####  Before
+Expects output of LanguageJob as input. 
 #####  Call

hadoop jar webcorpus.jar webcorpus.hadoopjobs.SentenceExtractJob webcorpus_data/language webcorpus_data/sentenceExtract
@@ -730,9 +729,8 @@

 ####  Data example 

-#####  Before  Expects output of 
-
-SentenceExtractJob as input. 
+#####  Before
+Expects output of SentenceExtractJob as input. 
 #####  Call

hadoop jar webcorpus.jar webcorpus.hadoopjobs.SentenceExtractCompactJob webcorpus_data/sentenceExtract webcorpus_data/sentenceExtractCompact

Documentation modified by Johannes

Johannes — Wed, 03 Jul 2013 14:17:06 -0000

--- v20
+++ v21
@@ -145,7 +145,7 @@

     
       
-        UIMAJob (n)
+        SentenceAnnotateJob (n)
       

       
@@ -529,14 +529,14 @@

The second sentence is considered to be an english sentence, as it is following an english sentence and is short enough. The last sentence is removed.

-### UIMAJob
+### SentenceAnnotateJob

Runs arbitrary UIMA components on deduplified sentences.

Counts n-grams based on tokens.
Expects output of SentenceExtractCompactJob as input
Writes one XML-serialized CAS per line. Compresses each CAS with GZip.
- hadoop jar webcorpus.jar webcorpus.hadoopjobs.UIMAJob webcorpus_data/sentenceExtractCompact webcorpus_data/uima
+ hadoop jar webcorpus.jar webcorpus.hadoopjobs.SentenceAnnotateJob webcorpus_data/sentenceExtractCompact webcorpus_data/sentenceAnnotate

@@ -556,7 +556,7 @@

##### Before

-Expects the output of UIMAJob as input.
+Expects the output of SentenceAnnotateJob as input.
##### Call

hadoop jar webcorpus.jar webcorpus.hadoopjobs.NGramCountJob webcorpus_data/token webcorpus_data/ngram -n 3

@@ -589,7 +589,7 @@

##### Before

-Expects the output of UIMAJob as input.
+Expects the output of SentenceAnnotateJob as input.
##### Call

hadoop jar webcorpus.jar webcorpus.hadoopjobs.NGramWithPOSCountJob webcorpus_data/token webcorpus_data/ngram-with-pos -n 3

@@ -627,7 +627,7 @@

##### Before

-Expects the output of UIMAJob as input.
+Expects the output of SentenceAnnotateJob as input.
##### Call

hadoop jar webcorpus.jar webcorpus.hadoopjobs.POSNGramCountJob webcorpus_data/token webcorpus_data/pos-ngram -n 3

Documentation modified by Johannes

Johannes — Wed, 03 Jul 2013 09:02:19 -0000

--- v19
+++ v20
@@ -300,7 +300,12 @@
 *   Detects duplicates by same URL, same length and same content in first and last n characters. 
 *   Output one document per line. Format (replace #TAB# with tab): **URL#TAB#sm#TAB#pm#TAB#document**

-
+####  Configuration
+There's two configuration options that are of special relevance here:
+
+To only read documents from the archive that are of a specific mime type, you can turn on mime-type filtering using
+conf.set("webcorpus.common.io.warcinputformat.filter-mimetypes", true). The default value is false.
+To specify a list of mime-types to be read, use conf.set("webcorpus.documentjob.content-type-whitelist", "type1, type2, ..."), e.g. "text/html" to keep only HTML documents.

 ####  Data example

Documentation modified by Johannes

Johannes — Fri, 14 Jun 2013 19:53:32 -0000

--- v18
+++ v19
@@ -109,7 +109,7 @@

     
       
-        SentenceJob (--lang=<language>)
+        SentenceJob (language)
       

       
@@ -127,7 +127,7 @@

     
       
-        LanguageJob (--lang=<language>)
+        LanguageJob (language)

Documentation modified by Johannes

Johannes — Fri, 14 Jun 2013 19:52:24 -0000

--- v17
+++ v18
@@ -318,7 +318,7 @@

 #####  Call 

-

hadoop jar webcorpus.jar webcorpus.hadoopjobs.DocumentJob webcorpus_data/raw webcorpus_data/document <format>
+

hadoop jar webcorpus.jar webcorpus.hadoopjobs.DocumentJob webcorpus_data/raw webcorpus_data/document --input-format <format>

Here, format must be one of "warc", "arc" or "leipzig". @@ -468,7 +468,7 @@ ##### Call -

hadoop jar webcorpus.jar webcorpus.hadoopjobs.SentenceJob webcorpus_data/utf8 webcorpus_data/sentence --lang=en
+

hadoop jar webcorpus.jar webcorpus.hadoopjobs.SentenceJob webcorpus_data/utf8 webcorpus_data/sentence --lang en

##### After @@ -514,7 +514,7 @@ ##### Call -

hadoop jar webcorpus.jar webcorpus.hadoopjobs.LanguageJob webcorpus_data/sentence webcorpus_data/language en
+

hadoop jar webcorpus.jar webcorpus.hadoopjobs.LanguageJob webcorpus_data/sentence webcorpus_data/language --lang en

##### After @@ -543,7 +543,7 @@ * Counts n-grams based on tokens. * Expects a parameter to set n: - * hadoop jar webcorpus.jar webcorpus.hadoopjobs.NGramCountJob webcorpus\_data/token webcorpus\_data/ngram \[n\] + * hadoop jar webcorpus.jar webcorpus.hadoopjobs.NGramCountJob webcorpus\_data/token webcorpus\_data/ngram -n <n> @@ -554,7 +554,7 @@ Expects the output of UIMAJob as input. ##### Call -

hadoop jar webcorpus.jar webcorpus.hadoopjobs.NGramCountJob webcorpus_data/token webcorpus_data/ngram 3
+

hadoop jar webcorpus.jar webcorpus.hadoopjobs.NGramCountJob webcorpus_data/token webcorpus_data/ngram -n 3

Calculate some 3-grams.

@@ -576,7 +576,7 @@

Counts n-grams based on tokens.
Expects a parameter to set n:
- hadoop jar webcorpus.jar webcorpus.hadoopjobs.NGramWithPOSCountJob webcorpus_data/token webcorpus_data/ngram [n]
- hadoop jar webcorpus.jar webcorpus.hadoopjobs.NGramWithPOSCountJob webcorpus_data/token webcorpus_data/ngram-with-pos -n <n<

@@ -587,7 +587,7 @@
Expects the output of UIMAJob as input.
##### Call

hadoop jar webcorpus.jar webcorpus.hadoopjobs.NGramWithPOSCountJob webcorpus_data/token webcorpus_data/ngram 3

+

hadoop jar webcorpus.jar webcorpus.hadoopjobs.NGramWithPOSCountJob webcorpus_data/token webcorpus_data/ngram-with-pos -n 3

Calculate some 3-grams.

@@ -614,7 +614,7 @@
...

* Expects a parameter to set n:
- * hadoop jar webcorpus.jar webcorpus.hadoopjobs.POSNGramCountJob webcorpus_data/token webcorpus_data/ngram [n]
+ * hadoop jar webcorpus.jar webcorpus.hadoopjobs.POSNGramCountJob webcorpus_data/token webcorpus_data/pos-ngram -n <n<

@@ -625,7 +625,7 @@
Expects the output of UIMAJob as input.
##### Call

hadoop jar webcorpus.jar webcorpus.hadoopjobs.POSNGramCountJob webcorpus_data/token webcorpus_data/ngram 3

+

hadoop jar webcorpus.jar webcorpus.hadoopjobs.POSNGramCountJob webcorpus_data/token webcorpus_data/pos-ngram -n 3

Calculate some 3-grams.

@@ -649,7 +649,7 @@
* Counts cooccurrences based on tokens with distance up to n.
* Outputs all cooccurrences for all distances (1, 2, ..., n) at once.
* Expects a parameter to set n:
- * hadoop jar webcorpus.jar webcorpus.hadoopjobs.CooccurrenceJob webcorpus_data/token webcorpus_data/cooccurrence [n]
+ * hadoop jar webcorpus.jar webcorpus.hadoopjobs.CooccurrenceJob webcorpus_data/token webcorpus_data/cooccurrence -n <n<

@@ -660,7 +660,7 @@
CooccurrenceJob expects the output of TokenJob as input.
##### Call

hadoop jar webcorpus.jar webcorpus.hadoopjobs.CooccurrenceJob webcorpus_data/token webcorpus_data/cooccurrence 5

+

hadoop jar webcorpus.jar webcorpus.hadoopjobs.CooccurrenceJob webcorpus_data/token webcorpus_data/cooccurrence -n 5

Calculate cooccurrences up to distance 5.

##### After
@@ -790,15 +790,15 @@
hadoop jar ${JAR} webcorpus.hadoopjobs.DeduplicationJob ${HDFS_DIRECTORY}/document ${HDFS_DIRECTORY}/deduplication
hadoop jar ${JAR} webcorpus.hadoopjobs.DeduplicationByHostJob ${HDFS_DIRECTORY}/deduplication ${HDFS_DIRECTORY}/deduplicationByHost
hadoop jar ${JAR} webcorpus.hadoopjobs.UTF8Job ${HDFS_DIRECTORY}/deduplicationByHost ${HDFS_DIRECTORY}/utf8
-hadoop jar ${JAR} webcorpus.hadoopjobs.SentenceJob ${HDFS_DIRECTORY}/utf8 ${HDFS_DIRECTORY}/sentence
-hadoop jar ${JAR} webcorpus.hadoopjobs.LanguageJob ${HDFS_DIRECTORY}/sentence ${HDFS_DIRECTORY}/language ${LANGUAGE}
+hadoop jar ${JAR} webcorpus.hadoopjobs.SentenceJob ${HDFS_DIRECTORY}/utf8 ${HDFS_DIRECTORY}/sentence --lang en
+hadoop jar ${JAR} webcorpus.hadoopjobs.LanguageJob ${HDFS_DIRECTORY}/sentence --lang en ${HDFS_DIRECTORY}/language ${LANGUAGE}
hadoop jar ${JAR} webcorpus.hadoopjobs.TokenJob ${HDFS_DIRECTORY}/language ${HDFS_DIRECTORY}/token
-hadoop jar ${JAR} webcorpus.hadoopjobs.NGramCountJob ${HDFS_DIRECTORY}/token ${HDFS_DIRECTORY}/1gram 1
-hadoop jar ${JAR} webcorpus.hadoopjobs.NGramCountJob ${HDFS_DIRECTORY}/token ${HDFS_DIRECTORY}/2gram 2
-hadoop jar ${JAR} webcorpus.hadoopjobs.NGramCountJob ${HDFS_DIRECTORY}/token ${HDFS_DIRECTORY}/3gram 3
-hadoop jar ${JAR} webcorpus.hadoopjobs.NGramCountJob ${HDFS_DIRECTORY}/token ${HDFS_DIRECTORY}/4gram 4
-hadoop jar ${JAR} webcorpus.hadoopjobs.NGramCountJob ${HDFS_DIRECTORY}/token ${HDFS_DIRECTORY}/5gram 5
-hadoop jar ${JAR} webcorpus.hadoopjobs.CooccurrenceJob ${HDFS_DIRECTORY}/token ${HDFS_DIRECTORY}/cooccurrence 5
+hadoop jar ${JAR} webcorpus.hadoopjobs.NGramCountJob ${HDFS_DIRECTORY}/token ${HDFS_DIRECTORY}/1gram -n 1
+hadoop jar ${JAR} webcorpus.hadoopjobs.NGramCountJob ${HDFS_DIRECTORY}/token ${HDFS_DIRECTORY}/2gram -n 2
+hadoop jar ${JAR} webcorpus.hadoopjobs.NGramCountJob ${HDFS_DIRECTORY}/token ${HDFS_DIRECTORY}/3gram -n 3
+hadoop jar ${JAR} webcorpus.hadoopjobs.NGramCountJob ${HDFS_DIRECTORY}/token ${HDFS_DIRECTORY}/4gram -n 4
+hadoop jar ${JAR} webcorpus.hadoopjobs.NGramCountJob ${HDFS_DIRECTORY}/token ${HDFS_DIRECTORY}/5gram -n 5
+hadoop jar ${JAR} webcorpus.hadoopjobs.CooccurrenceJob ${HDFS_DIRECTORY}/token ${HDFS_DIRECTORY}/cooccurrence -n 5
hadoop jar ${JAR} webcorpus.hadoopjobs.SentenceExtractJob ${HDFS_DIRECTORY}/language ${HDFS_DIRECTORY}/sentenceExtract
hadoop jar ${JAR} webcorpus.hadoopjobs.SentenceExtractCompactJob ${HDFS_DIRECTORY}/sentenceExtract ${HDFS_DIRECTORY}/sentenceExtractCompact

@@ -845,8 +845,8 @@
dedupBloomVectorSize=1024

# Bloom filter: number of hashes to consider
-# int - default: 1024
-dedupBloomNbHash=1024
+# int - default: 7
+dedupBloomNbHash=7

# Deduplication Bloom Filter hash function
# {murmur, jenkins} - default: jenkins

Documentation modified by Johannes

Johannes — Thu, 13 Jun 2013 16:06:27 -0000

--- v16
+++ v17
@@ -109,7 +109,7 @@

     
       
-        SentenceJob (--lang=en|de)
+        SentenceJob (--lang=<language>)
       

       
@@ -121,13 +121,13 @@
       

       
-        Sentences are wrapped with XML-s-tags. If possible, a language-specific sentence segmentation model is used.
-      
-    
-    
-    
-      
-        LanguageJob (lang)
+        Sentences are wrapped with XML-s-tags. If possible, a language-specific sentence segmentation model is used. For the language, use its two-letter ISO 639-2 code.
+      
+    
+    
+    
+      
+        LanguageJob (--lang=<language>)
       

       
@@ -235,15 +235,15 @@

     
       
-        SentenceExtractJob (lang)
-      
-      
-      
-        One document per line with sentence annotation (and language annotation). Parameter giving expected language.
-      
-      
-      
-        Extract sentences with expected language and maximum lenght of 512 characters.
+        SentenceExtractJob
+      
+      
+      
+        One document per line with sentence annotation (and language annotation).
+      
+      
+      
+        Extract sentences with expected language (specified in LanguageJob run) and maximum length of 512 characters.