Retrieval speed

Retrieval
arion
2013-09-11
2013-09-13
  • arion

    arion - 2013-09-11

    Hi,

    I am using indri-5.0 with the WT10G collection. I indexed the collection using the following parameters:

    <parameters>
    <memory>2048M</memory>
    <storeDocs>true</storeDocs>
    <index>WT10G</index>
    <corpus><path>/projects/WT10G</path><class>trecweb</class></corpus>
    <field><name>title</name></field>
    <field><name>heading</name></field>
    <field><name>body</name></field>
    <field><name>url</name></field>
    <field><name>document</name></field>
    <metadata><forward>title</forward></metadata>
    <metadata><forward>heading</forward></metadata>
    <metadata><forward>url</forward></metadata>
    <metadata><forward>body</forward></metadata>
    <stemmer><name>krovetz</name></stemmer>
    </parameters>
    

    When I tested the php search-site with some queries from Trec 9 (2000) I noticed that the retrieval speed varies greatly, depending on the number of words in the query and if the query was submitted for the first time.

    I need to reduce the deviation of the speed from the average, so that it does not vary between 0.1 - 6 sec, but is rather close to a sub-second value. Is there a way to achieve that by changing the indexing parameters? Or perhaps changing something in the code of the php web interface, so that it returns 100 results instead of all possible relevant results..?

    Thanks

     
  • David Fisher

    David Fisher - 2013-09-12

    The wording of your interest sounds suspiciously like the performance requirement for a production commercial system.

    Generically, query time is improved by running on a machine with sufficient memory to enable the entire index to be loaded into the disk cache.

    You would probably also do well to upgrade to a more current version of indri.

    No amount of tweaking can change the fact that query execution time grows as a function of the size of the posting lists for the query terms. Add more terms, the query must take longer. Use terms that are more frequent, the query must take longer. Different operators also introduce additional complexity, or opportunities for for improved performance. For example, ordered window expressions, eg #1(term1 term2), are almost always faster to evaluate than #combine(term1 term2), because the phrase's posting list will be shorter (although it is computed at query time).

    There are individuals who provide indri optimization consulting for a fee, for specific commercial applications. Feel free to contact me directly.

     
  • arion

    arion - 2013-09-13

    Hi David,

    Actually I am using Indri for research purposes. Unfortunately, I cannot disclose more information since I haven't published any of this work yet.

    Initially, I tried indexing the ClueWeb09 collection, which turned out to be too large for my study's requirements and searching became too slow due to the index size. I then switched to the WT10G collection, which works somewhat better. The machine I am using is a Fedora Linux with 8G of memory but I cannot upgrade it further.

    Regarding what you mentioned, I have two questions:

    1. You wrote that I can enable the entire index to be loaded into the disk cache. How can I achieve that? Is this the <memory> option set in the indexing parameters or something I can tweak myself? (I couldn't find anything relevant in Indri's documentation)

    2. In the parameters file I posted earlier, is there something that I can omit to make the index file smaller (thus speed up retrieval)? Perhaps something that is introducing duplicate information (e.g., using the <body> and <doc> tags)?

    I would greatly appreciate any insights on this matter. And if you still have concerns regarding the use of Indri I am making I can contact you directly and explain further.

    Thank you again for all the support.

     
  • David Fisher

    David Fisher - 2013-09-13

    1) The disk cache is provided by the operating system. You would need to have more physical ram than the sum of the sizes of the index files to achieve that. It is not controlled by indri.

    2) Other than using a stopword list, which you may have, no, there is no particular indexing parameter modification that you can perform.

    Your machine is insufficient for any but the most cursory of experiments, and is certainly not up to the scale of ClueWeb.

    See http://www.lemurproject.org/clueweb09/indri-howto.php for some discussion of how to use indri at the ClueWeb scale.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks