Menu

#284 New Indri Index differs from older Indri Index

v5.x
open
5
2016-05-31
2016-05-15
J Mac
No

Hello,

I have a pre-built Index for the Gov2 coillection, which was built using Indri development release 5.6 (according to the manifest).

Using the same collection, and the latest version of Indri (5.10), I have found that the indexes no longer have the same statistics. See the relevant manifest entries below:

Indri 5.6

        <document-base>1</document-base>
        <frequent-terms>179186</frequent-terms>
        <maximum-document>25205180</maximum-document>
        <total-documents>25205179</total-documents>
        <total-terms>23813260341</total-terms>
        <unique-terms>39180780</unique-terms>

Indri 5.10

        <document-base>1</document-base>
        <frequent-terms>176799</frequent-terms>
        <maximum-document>25205180</maximum-document>
        <total-documents>25205179</total-documents>
        <total-terms>23451774775</total-terms>
        <unique-terms>39177861</unique-terms>

Is this expected behaviour? The same param file was used to create both indexes.

Discussion

  • Lemur Project

    Lemur Project - 2016-05-17

    It is not unusual for indexes produced by different version indexing processes to vary over
    time.

    I'm not current on any changes made to Indri indexing tokenization between versions 5.6
    and 5.10, but I would not be surprised if there has been some sort of change, that would end up with slightly different definitions of terms, and thus term counts, frequencies, etc.

    At least document counts remained the same, otherwise something would likely be seriously wrong.

    I'd have to look over the commit logs to confirm this, but it won't happen this week.

     
  • Lemur Project

    Lemur Project - 2016-05-17

    I'm informed that the index differences you report may be due to a stemming difference that came about from a change in libc/libc++ at some point between 5.6 and 5.10.

    I will attempt to confirm this.

     
  • David Fisher

    David Fisher - 2016-05-31
    • labels: indexing, index --> indexing, indri
    • assigned_to: David Fisher
     
  • David Fisher

    David Fisher - 2016-05-31

    Maybe.

    What are your build parameters.

    There are numerous changes between the revisions that might affect tokenization.

     
  • J Mac

    J Mac - 2016-05-31

    Hi,

    I've provided the param file I use when invoking IndriBuildIndex, I hope that's helpful.
    I used the same param file for both indexes.

    <parameters>
    <memory>64G</memory>
    <threads>1</threads>
    <index>gov2-indri-stopped</index>
    <storeDocs>true</storeDocs>
    <stemmer><name>krovetz</name></stemmer>
    <corpus>
       <path>path/to/my/gov2-corpus</path>
       <class>trecweb</class>
    </corpus>
    <stopper>  
      <word>a</word>
      <word>about</word>
      <word>above</word>
      <word>according</word>
      <word>across</word>
      <word>after</word>
      <word>afterwards</word>
      <word>again</word>
      <word>against</word>
      <word>albeit</word>
      <word>all</word>
      <word>almost</word>
      <word>alone</word>
      <word>along</word>
      <word>already</word>
      <word>also</word>
      <word>although</word>
      <word>always</word>
      <word>am</word>
      <word>among</word>
      <word>amongst</word>
      <word>an</word>
      <word>and</word>
      <word>another</word>
      <word>any</word>
      <word>anybody</word>
      <word>anyhow</word>
      <word>anyone</word>
      <word>anything</word>
      <word>anyway</word>
      <word>anywhere</word>
      <word>apart</word>
      <word>are</word>
      <word>around</word>
      <word>as</word>
      <word>at</word>
      <word>av</word>
      <word>be</word>
      <word>became</word>
      <word>because</word>
      <word>become</word>
      <word>becomes</word>
      <word>becoming</word>
      <word>been</word>
      <word>before</word>
      <word>beforehand</word>
      <word>behind</word>
      <word>being</word>
      <word>below</word>
      <word>beside</word>
      <word>besides</word>
      <word>between</word>
      <word>beyond</word>
      <word>both</word>
      <word>but</word>
      <word>by</word>
      <word>can</word>
      <word>cannot</word>
      <word>canst</word>
      <word>certain</word>
      <word>cf</word>
      <word>choose</word>
      <word>contrariwise</word>
      <word>cos</word>
      <word>could</word>
      <word>cu</word>
      <word>day</word>
      <word>do</word>
      <word>does</word>
      <word>doesn't</word>
      <word>doing</word>
      <word>dost</word>
      <word>doth</word>
      <word>double</word>
      <word>down</word>
      <word>dual</word>
      <word>during</word>
      <word>each</word>
      <word>either</word>
      <word>else</word>
      <word>elsewhere</word>
      <word>enough</word>
      <word>et</word>
      <word>etc</word>
      <word>even</word>
      <word>ever</word>
      <word>every</word>
      <word>everybody</word>
      <word>everyone</word>
      <word>everything</word>
      <word>everywhere</word>
      <word>except</word>
      <word>excepted</word>
      <word>excepting</word>
      <word>exception</word>
      <word>exclude</word>
      <word>excluding</word>
      <word>exclusive</word>
      <word>far</word>
      <word>farther</word>
      <word>farthest</word>
      <word>few</word>
      <word>ff</word>
      <word>first</word>
      <word>for</word>
      <word>formerly</word>
      <word>forth</word>
      <word>forward</word>
      <word>from</word>
      <word>front</word>
      <word>further</word>
      <word>furthermore</word>
      <word>furthest</word>
      <word>get</word>
      <word>go</word>
      <word>had</word>
      <word>halves</word>
      <word>hardly</word>
      <word>has</word>
      <word>hast</word>
      <word>hath</word>
      <word>have</word>
      <word>he</word>
      <word>hence</word>
      <word>henceforth</word>
      <word>her</word>
      <word>here</word>
      <word>hereabouts</word>
      <word>hereafter</word>
      <word>hereby</word>
      <word>herein</word>
      <word>hereto</word>
      <word>hereupon</word>
      <word>hers</word>
      <word>herself</word>
      <word>him</word>
      <word>himself</word>
      <word>hindmost</word>
      <word>his</word>
      <word>hither</word>
      <word>hitherto</word>
      <word>how</word>
      <word>however</word>
      <word>howsoever</word>
      <word>i</word>
      <word>ie</word>
      <word>if</word>
      <word>in</word>
      <word>inasmuch</word>
      <word>inc</word>
      <word>include</word>
      <word>included</word>
      <word>including</word>
      <word>indeed</word>
      <word>indoors</word>
      <word>inside</word>
      <word>insomuch</word>
      <word>instead</word>
      <word>into</word>
      <word>inward</word>
      <word>inwards</word>
      <word>is</word>
      <word>it</word>
      <word>its</word>
      <word>itself</word>
      <word>just</word>
      <word>kind</word>
      <word>kg</word>
      <word>km</word>
      <word>last</word>
      <word>latter</word>
      <word>latterly</word>
      <word>less</word>
      <word>lest</word>
      <word>let</word>
      <word>like</word>
      <word>little</word>
      <word>ltd</word>
      <word>many</word>
      <word>may</word>
      <word>maybe</word>
      <word>me</word>
      <word>meantime</word>
      <word>meanwhile</word>
      <word>might</word>
      <word>moreover</word>
      <word>most</word>
      <word>mostly</word>
      <word>more</word>
      <word>mr</word>
      <word>mrs</word>
      <word>ms</word>
      <word>much</word>
      <word>must</word>
      <word>my</word>
      <word>myself</word>
      <word>namely</word>
      <word>need</word>
      <word>neither</word>
      <word>never</word>
      <word>nevertheless</word>
      <word>next</word>
      <word>no</word>
      <word>nobody</word>
      <word>none</word>
      <word>nonetheless</word>
      <word>noone</word>
      <word>nope</word>
      <word>nor</word>
      <word>not</word>
      <word>nothing</word>
      <word>notwithstanding</word>
      <word>now</word>
      <word>nowadays</word>
      <word>nowhere</word>
      <word>of</word>
      <word>off</word>
      <word>often</word>
      <word>ok</word>
      <word>on</word>
      <word>once</word>
      <word>one</word>
      <word>only</word>
      <word>onto</word>
      <word>or</word>
      <word>other</word>
      <word>others</word>
      <word>otherwise</word>
      <word>ought</word>
      <word>our</word>
      <word>ours</word>
      <word>ourselves</word>
      <word>out</word>
      <word>outside</word>
      <word>over</word>
      <word>own</word>
      <word>per</word>
      <word>perhaps</word>
      <word>plenty</word>
      <word>provide</word>
      <word>quite</word>
      <word>rather</word>
      <word>really</word>
      <word>round</word>
      <word>said</word>
      <word>sake</word>
      <word>same</word>
      <word>sang</word>
      <word>save</word>
      <word>saw</word>
      <word>see</word>
      <word>seeing</word>
      <word>seem</word>
      <word>seemed</word>
      <word>seeming</word>
      <word>seems</word>
      <word>seen</word>
      <word>seldom</word>
      <word>selves</word>
      <word>sent</word>
      <word>several</word>
      <word>shalt</word>
      <word>she</word>
      <word>should</word>
      <word>shown</word>
      <word>sideways</word>
      <word>since</word>
      <word>slept</word>
      <word>slew</word>
      <word>slung</word>
      <word>slunk</word>
      <word>smote</word>
      <word>so</word>
      <word>some</word>
      <word>somebody</word>
      <word>somehow</word>
      <word>someone</word>
      <word>something</word>
      <word>sometime</word>
      <word>sometimes</word>
      <word>somewhat</word>
      <word>somewhere</word>
      <word>spake</word>
      <word>spat</word>
      <word>spoke</word>
      <word>spoken</word>
      <word>sprang</word>
      <word>sprung</word>
      <word>stave</word>
      <word>staves</word>
      <word>still</word>
      <word>such</word>
      <word>supposing</word>
      <word>than</word>
      <word>that</word>
      <word>the</word>
      <word>thee</word>
      <word>their</word>
      <word>them</word>
      <word>themselves</word>
      <word>then</word>
      <word>thence</word>
      <word>thenceforth</word>
      <word>there</word>
      <word>thereabout</word>
      <word>thereabouts</word>
      <word>thereafter</word>
      <word>thereby</word>
      <word>therefore</word>
      <word>therein</word>
      <word>thereof</word>
      <word>thereon</word>
      <word>thereto</word>
      <word>thereupon</word>
      <word>these</word>
      <word>they</word>
      <word>this</word>
      <word>those</word>
      <word>thou</word>
      <word>though</word>
      <word>thrice</word>
      <word>through</word>
      <word>throughout</word>
      <word>thru</word>
      <word>thus</word>
      <word>thy</word>
      <word>thyself</word>
      <word>till</word>
      <word>to</word>
      <word>together</word>
      <word>too</word>
      <word>toward</word>
      <word>towards</word>
      <word>ugh</word>
      <word>unable</word>
      <word>under</word>
      <word>underneath</word>
      <word>unless</word>
      <word>unlike</word>
      <word>until</word>
      <word>up</word>
      <word>upon</word>
      <word>upward</word>
      <word>upwards</word>
      <word>us</word>
      <word>use</word>
      <word>used</word>
      <word>using</word>
      <word>very</word>
      <word>via</word>
      <word>vs</word>
      <word>want</word>
      <word>was</word>
      <word>we</word>
      <word>week</word>
      <word>well</word>
      <word>were</word>
      <word>what</word>
      <word>whatever</word>
      <word>whatsoever</word>
      <word>when</word>
      <word>whence</word>
      <word>whenever</word>
      <word>whensoever</word>
      <word>where</word>
      <word>whereabouts</word>
      <word>whereafter</word>
      <word>whereas</word>
      <word>whereat</word>
      <word>whereby</word>
      <word>wherefore</word>
      <word>wherefrom</word>
      <word>wherein</word>
      <word>whereinto</word>
      <word>whereof</word>
      <word>whereon</word>
      <word>wheresoever</word>
      <word>whereto</word>
      <word>whereunto</word>
      <word>whereupon</word>
      <word>wherever</word>
      <word>wherewith</word>
      <word>whether</word>
      <word>whew</word>
      <word>which</word>
      <word>whichever</word>
      <word>whichsoever</word>
      <word>while</word>
      <word>whilst</word>
      <word>whither</word>
      <word>who</word>
      <word>whoa</word>
      <word>whoever</word>
      <word>whole</word>
      <word>whom</word>
      <word>whomever</word>
      <word>whomsoever</word>
      <word>whose</word>
      <word>whosoever</word>
      <word>why</word>
      <word>will</word>
      <word>wilt</word>
      <word>with</word>
      <word>within</word>
      <word>without</word>
      <word>worse</word>
      <word>worst</word>
      <word>would</word>
      <word>wow</word>
      <word>ye</word>
      <word>yet</word>
      <word>year</word>
      <word>yippee</word>
      <word>you</word>
      <word>your</word>
      <word>yours</word>
      <word>yourself</word>
      <word>yourselves</word>
    </stopper>
    </parameters>
    
     

Log in to post a comment.

MongoDB Logo MongoDB