Menu

Tree [r46] /
 History

HTTPS access


File Date Author Commit
 src 2016-02-23 djoerd [r46] Fixes warnings, closes readers
 test 2013-05-01 djoerd [r32] Improved AnchorExtract, and updated test data t...
 README.html 2015-01-07 djoerd [r44] Version 0.4, one more unit test
 build.xml 2014-09-02 djoerd [r42] Support for CDH5 and Maven
 mirex.pdf 2010-04-23 djoerd [r5] report number added.
 pom.xml 2015-01-07 djoerd [r44] Version 0.4, one more unit test

Read Me

<body>

<h1>MIREX 0.4</h1>

"MIREX" stands for MapReduce Information Retrieval Experiments. 
MIREX provides a simple and flexible solution for doing large-scale 
information retrieval experiments. More info: http://mirex.sf.net 
Please cite the following paper if you use MIREX in your scientific 
work:

<blockquote>

    Djoerd Hiemstra and Claudia Hauff. 
    MapReduce for information retrieval evaluation: Let's quickly 
    test this on 12 TB of data. In <em> Proceedings of the CLEF 2010
    Conference on Multilingual and Multimodal Information Access 
    Evaluation </em>, September 2010, Padua, Italy.

</blockquote>


<h2> Install using maven </h2>

In pom.xml, change the versions of <tt>hadoop-mapreduce-client-core</tt>
and <tt>hadoop-common</tt> such that they match your Hadoop distribution, 
then run <tt>mvn package</tt>. 
This produces the file <tt>target/mirex-0.4.jar</tt>. 
The code was tested on Cloudera's cdh5 Hadoop distribution,
Hadoop version 2.3.


<h2> Copy files </h2>

First, the warc input files must be uploaded to the HDFS:

<pre>

% src=$dir_holding_CLUEWEB_dirs_in_vfs # example: ClueWeb
% dst=$dir_which_will_hold_CLUEWEB_dirs_in_hdfs #example: ClueWeb
% cd $src
% for dir in CLUEWEB09*/ClueWeb09_English_*; do echo $dir; \
%   hadoop dfs -copyFromLocal $dir $dst/$dir; done

</pre>

For the TREC tasks, you'll want to have the topics from the 2009 web:

<pre>

% wget http://trec.nist.gov/data/web/09/wt09.topics.queries-only
% hadoop dfs -copyFromLocal wt09.topics.queries-only wt09.topics.queries-only

</pre>

or for the 2010 data set:

<pre>

% wget http://trec.nist.gov/data/web/10/wt2010-topics.queries-only 
% hadoop fs -put wt2010-topics.queries-only wt2010-topics.queries-only

</pre>

In the following we will use the 2010 topic set.

<h2> Usage </h2>

Run anchor text extraction as follows (this will create a directory 
'results' with one line per document in 'warc-files'):

<pre>

% hadoop jar mirex-0.4.jar nl.utwente.mirex.AnchorExtract \ 
  ClueWeb09_English/*/ ClueWeb09_Anchors

</pre>

Then, run a TREC evaluation by:

<pre>

% hadoop jar mirex-0.4.jar nl.utwente.mirex.TrecRun \ 
  KEYVAL ClueWeb09_Anchors/* TrecOut wt2010-topics.queries-only

</pre>

Or, on the output of anchor text extraction, first extract global 
statistics for your queries by:

<pre>

% hadoop jar mirex-0.4.jar nl.utwente.mirex.QueryTermCount \ 
  KEYVAL ClueWeb09_Anchors/*  wt2010-topics.queries-only  \
  wt2010-topics.stats

</pre>

This will create a new TREC query file <tt> wt2010-topics-stats </tt>
that includes global statistics for each query term. 
(as follows: <tt> wt10-14:dinosaurs=1=29584=403875 </tt>)

Then, run several baseline retrieval models using:

<pre>

% hadoop jar mirex-0.4.jar nl.utwente.mirex.TrecRunBaselines \ 
  KEYVAL ClueWeb09_Anchors/* BaselineOut wt2010-topics.stats

</pre>

Use the perl script <a href="#trec-out.pl"> trec-out.pl</a>
provided below to produce output for <tt> trec_eval </tt>
(see: http://trec.nist.gov/trec_eval/)

<pre>

% hadoop fs -cat TrecOut/part-* | ./trec-out.pl \
  | sort -k 1n -k 4n >mirex.run

</pre>

Sorting is needed because <tt> trec_eval </tt> expects the results
to be sorted on query identifier. 

The <tt> TrecRunBaselines </tt>  program will output the results for 
several retrieval models at once. You need to "grep" the results you 
need (for instance the results of BM25 weighting) as follows.

<pre>

% hadoop fs -cat BaselinesOut/part-* | grep ":BM25" \
  | ./trec-out.pl | sort -k 1n -k 4n >bm25.run

</pre>

<h2> Perform runs directly on text </h2>

You can also create baseline runs directly on the collection text. 
To do this, modify the commands for <tt>TrecRun, QueryTermCount, 
TrecRunBaselines</tt>: you have to exchange the format parameter 
<tt>KEYVAL</tt> with <tt>WARC</tt> and specify the directory of 
the .warc.gz files instead.

<h2> Toy Data </h2>

Alternatively, we provide some toy data. Here is how to upload 
them to the hadoop file system:

<pre>

% hadoop fs -mkdir warc-files
% hadoop fs -put test/test.warc.gz ./warc-files/
% hadoop fs -put test/wt2010-topics.queries-only ./

</pre>


<h2> Copyright Notice </h2>

The contents of this file are subject to the PfTijah Public License
Version 1.1 (the "License"); you may not use this file except in
compliance with the License. You may obtain a copy of the License at
http://dbappl.cs.utwente.nl/Legal/PfTijah-1.1.html
<p>
Software distributed under the License is distributed on an "AS IS"
basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the
License for the specific language governing rights and limitations
under the License.
<p>
The Original Code is the Mirex system.
<p>
The Initial Developer of the Original Code is the "University of Twente".
Portions created by the "University of Twente" are
Copyright (C) 2010-2014 "University of Twente".
All Rights Reserved.
<p>
Authors: Djoerd Hiemstra, 
         Robin Aly

<h2> Dependencies </h2>

This product includes/uses software developed by others:
<ul>
  <li> Hadoop - Apache Software Foundation
  <li> Lemur Warc File Reader - Carnegie Mellon University
</ul>


<a name="trec-out.pl"></a>

<h2> Script that produces output for <tt> trec_eval </tt> </h2>

<pre>
#!/usr/bin/perl -w
#
#   trec-out.pl
#
# Script that converts MIREX output to trec_eval input

use strict; 

my $oldqid = "";
my $oldtag = "";
my $rank = 0;
while (&lt;STDIN&gt;) {
  chop;
  my ($key, $docid, $score) =  split;
  $key =~ s/^wt09-//;
  my ($qid, $tag) = split /:/, $key;
  if (!defined($tag)) { $tag = "mirex"; }
  if ($oldtag eq "") { $oldtag = $tag; } 
  if ($tag ne $oldtag) { 
    die "ERROR: Multiple run tags ($tag, $oldtag): first select using grep"; 
  }
  if ($qid ne $oldqid) { 
    $rank = 1;
    $oldqid = $qid;
  }
  else {
    $rank++;
  }
  print "$qid\tQ0\t$docid\t$rank\t$score\t$tag\n";
}

</pre>

</body>