We released a new version 0.3 for TREC Web track participants that work on the new ClueWeb12 dataset.
We rewrote the code such that uses the new Hadoop API. This code will be released later, but for people that cannot wait: It is already available from the SVN repositiory.
This draft report presents preliminary results for the TREC 2010 ad-hoc web search task. We ran our MIREX system on 0.5 billion web documents from the ClueWeb09 crawl. On average, the system retrieves at least 3 relevant documents on the first result page containing 10 results, using a simple index consisting of anchor texts, page titles, and spam removal.
MIREX was presented at the CLEF 2010 Conference on Multilingual and Multimodal Information Access Evaluation that took place on 20-23 September 2010 in Padua Italy, see: http://www.clef2010.org/
Djoerd Hiemstra and Claudia Hauff. MapReduce for information retrieval evaluation: "Let's quickly test this on 12 TB of data". In: Multilingual and Multimodal Information Access Evaluation. Lecture Notes in Computer Science 6360. Springer Verlag. pages 64-69, September 2010.
http://eprints.eemcs.utwente.nl/18469
We released a version 0.2 that supports several standard information retrieval models, such as language models with linear interpolation smoothing, language models with Dirichlet smoothing, and Okapi's BM25.
We’ve put anchor text for the English Category A documents of the TREC CLueWeb09 collection on line at:
* http://pathfinder.cs.utwente.nl/cgi-bin/opensearch/mirex-anchors.txt.gz
The file contains anchor text for about 87% of the pages in Category A. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. The size is about 21 GB (gzipped). The file is a tab-separated text file consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research):... read more
We created a MIREX project on SourceForge (yes, that's here). Watch us for a first official release of the MIREX software.