Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

build index for clueweb12 category A

Indri
Jiyun Luo
2013-06-12
2013-06-12
  • Jiyun Luo
    Jiyun Luo
    2013-06-12

    Hello,

    I encountered a trouble when I build index for ClueWeb12 Category A. We only have about 8T available disk space, and the original ClueWeb12 data is about 6T. Originally I think it is OK to build the index on the disk, but after indexing for about a week, the program only finished Disk1, Disk2 and half of Disk3, but already ate more than 7T space. As my current evaluation, it may cost 10T space to index Cat A.

    Am I configuring the parameter in a wrong way? Can indexing process finishes in 8T spaces? Attachment is my parameter file.

     
    Last edit: Jiyun Luo 2013-06-12
  • David Pane
    David Pane
    2013-06-12

    Since you have limited space, I recommend indexing each segment separately and suggest that you set <storeDocs> to false as specified in the parameter file below.

    You will have to create 20 xml files similar to the below for each segment.

    Make sure that you only have the content that you want to index in the directory the you point the software to to build the index (i.e. /path/to/Data/ClueWeb12_00 should only have directories and documents that you want indexed).

    $ ls /path/to/Data/
    . ClueWeb12_00 ClueWeb12_02 ClueWeb12_04 ClueWeb12_06 ClueWeb12_08 ClueWeb12_10 ClueWeb12_12 ClueWeb12_14 ClueWeb12_16 ClueWeb12_18 .. ClueWeb12_01 ClueWeb12_03 ClueWeb12_05 ClueWeb12_07 ClueWeb12_09 ClueWeb12_11 ClueWeb12_13 ClueWeb12_15 ClueWeb12_17 ClueWeb12_19

    $ ls /path/to/Data/ClueWeb12_00/
    . .. 0000tw 0000wb 0000wt 0001wb 0002wb 0003wb 0004wb 0005wb 0006wb 0007wb 0008wb 0009wb 0010wb 0011wb 0012wb 0013wb

    <parameters>

    <memory>1G</memory>
    <storeDocs>false</storeDocs>
    <index>/path/to/segment/index/ClueWeb12_00_Index</index>
    <corpus>
    <path>/path/to/Data/ClueWeb12_00</path>
    <class>warc</class>
    </corpus>

    <field><name>title</name></field>
    <field><name>url</name></field>
    <field><name>body</name></field>
    <field><name>heading</name></field>

    <stemmer><name>krovetz</name></stemmer>
    <stopper>
    <word>a</word>
    <word>about</word>
    <word>above</word>
    .
    .
    .
    <word>yourself</word>
    <word>yourselves</word>
    </stopper>
    </parameters>

    Here is the expected sizes without storing the documents in the index. Indexing requires 2x the final space in order to account for merging of partial indexes. If you do all 20 in parallel, you will need approx 5TB of free space based on the space given below.

    Index sizes (storeDocs=false):
    $ du -hs new-indexes/*
    146G new-indexes/ClueWeb12_00
    122G new-indexes/ClueWeb12_01
    123G new-indexes/ClueWeb12_02
    130G new-indexes/ClueWeb12_03
    105G new-indexes/ClueWeb12_04
    65G new-indexes/ClueWeb12_05
    67G new-indexes/ClueWeb12_06
    91G new-indexes/ClueWeb12_07
    122G new-indexes/ClueWeb12_08
    118G new-indexes/ClueWeb12_09
    118G new-indexes/ClueWeb12_10
    126G new-indexes/ClueWeb12_11
    116G new-indexes/ClueWeb12_12
    90G new-indexes/ClueWeb12_13
    89G new-indexes/ClueWeb12_14
    109G new-indexes/ClueWeb12_15
    95G new-indexes/ClueWeb12_16
    98G new-indexes/ClueWeb12_17
    113G new-indexes/ClueWeb12_18
    100G new-indexes/ClueWeb12_19
    159G new-indexes/ClueWeb12-B13
    2.3T new-indexes

    I hope this helps.

    -David

     
    Last edit: David Pane 2013-06-12