Menu

Indexing custom fields

Galago
2014-12-22
2015-01-07
  • Nikita Zhiltsov

    Nikita Zhiltsov - 2014-12-22

    My another issue is about indexing fields. I've got a collection of trectext-formatted documents (please see an example below) and would like to treat custom fields as different document fields (multi-fielded document paradigm). While indexing, I tried command-line keys --fields+names etc. as well as a JSON configuration (~/.galago.conf):

    "tokenizer" : {
    "class" : "org.lemurproject.galago.core.parse.TagTokenizer",
    "fields" : ["names", "titles", "attributes", "similarentitynames", "categories", "types", "outgoingentitynames", "predicatenames"]
    }

    but nothing helps: I get runtime exceptions when attempting to include field names in the query (e.g. "obama.names" or "#combine( #field:names( obama ) )")

    I tried both UPPER CASE and lower case for field names.

    An example document is:

    <DOC>
    <DOCNO>http://dbpedia.org/resource/Barack_Obama</DOCNO>
    <TEXT>

    <NAMES> Barack Obama Barack Obama Barack Obama Barack Obama Barack Obama Obama, Barack Barack Hussein Obama II Barack Obama </NAMES>

    <TITLES> from the 13th District Member of the Illinois Senate President of the United States from the 13th District Member of the Illinois Senate President of the United States American politician, 44th President of the United States American politician, 44th President of the United States </TITLES>

    <ATTRIBUTES> Barack Hussein Obama II is the 44th and current President of the United States, in office since 2009. He is the first African American to hold the office. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he was president of the Harvard Law Review. He was a community organizer in Chicago before earning his law degree. Barack Hussein Obama II is the 44th and current President of the United States, in office since 2009. He is the first African American to hold the office. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he was president of the Harvard Law Review. He was a community organizer in Chicago before earning his law degree. He worked as a civil rights attorney in Chicago and taught constitutional law at the University of Chicago Law School from 1992 to 2004. He served three terms representing the 13th District in the Illinois Senate from 1997 to 2004, running unsuccessfully for the United States House of Representatives in 2000.</ATTRIBUTES>

    <SIMILARENTITYNAMES> Burack obama Berack Obama Barrack Hussein Obama Barrack Hussain Obama Barack Obama Junior Barack H Obama, Jr BARACK OBAMA Brock Obama Bobama Berrak Obama Barack Obama II Barack Hussein Sen. Barack Obama President barack obama President Barack Obama Obama Senator Obama Obama barack Obama, Barack Hussein OBAMA! </SIMILARENTITYNAMES>

    <CATEGORIES> 21st-century scholars 21st-century scholars 20th-century scholars 20th-century scholars American political writers American political writers 21st-century American writers 21st-century American writers Presidents of the United States Presidents of the United States Columbia University alumni Columbia University alumni African-American United States Senators African-American United States Senators United States Senators from Illinois United States Senators from Illinois American people of English descent </CATEGORIES>

    <TYPES> person agent office holder </TYPES>

    <OUTGOINGENTITYNAMES> Illinois Land of Lincoln; The Prairie State Illinois State of Illinois State of Illinois State of Illinois Illinois Illinois Land of Lincoln; The Prairie State Illinois State of Illinois State of Illinois State of Illinois Illinois Illinois Land of Lincoln; The Prairie State Illinois State of Illinois State of Illinois State of Illinois Illinois George Walker Bush George Walker Bush George W. Bush George W. Bush George W. Bush Bush, George, Jr.; Bush Jr. Bush, George Walker George Walker Bush George W. Bush Michelle Obama Michelle Obama Michelle Obama Michelle LaVaughn Robinson Michelle Obama Obama, Michelle LaVaughn Robinson; Robinson, Michelle Obama, Michelle Michelle LaVaughn Robinson Michelle Obama Michelle LaVaughn Robinson Michelle Obama Michelle Obama Michelle Obama Michelle Obama Michelle LaVaughn Robinson Michelle Obama Obama, Michelle LaVaughn Robinson; </OUTGOINGENTITYNAMES>

    <PREDICATENAMES> predecessor Alice J. Palmer Alice J. Palmer Alice Palmer Palmer, Alice J. Alice Roberts Alice Palmer Alice Palmer Alice Palmer (politician) successor Burris, Roland Roland W. Burris Roland W. Burris Roland W. Burris Roland Burris Roland Burris Roland Burris successor Burris, Roland Roland W. Burris Roland W. Burris Roland W. Burris Roland Burris Roland Burris Roland Burris successor Burris, Roland Roland W. Burris Roland W. Burris Roland W. Burris Roland Burris Roland Burris Roland Burris birth place The City and County of Honolulu Honolulu The City and County of Honolulu Honolulu </PREDICATENAMES>
    </TEXT>
    </DOC>

     

    Last edit: Nikita Zhiltsov 2014-12-22
  • David Fisher

    David Fisher - 2014-12-22

    Using the following build parameters json file and your example document above:

    sydney:~/work3/projects/test-stem$ cat build.json 
    {
    "inputPath" : "test.trectext",
    "indexPath" : "test-fields",
    "tokenizer" : {
    "fields" : ["names", "titles", "attributes", "similarentitynames",
    "categories", "types", "outgoingentitynames", "predicatenames"]
    }
    }
    sydney:~/work3/projects/test-stem$ cat queries.json 
    {
    "queries" : [
              {
              "number" : "query1",
              "text"   : "#combine(obama.names)"
    }
    ]
    }
    sydney:~/work3/projects/test-stem$ sh ~/work1/galago/core/target/appassembler/bin/galago batch-search --index=test-fields queries.json 
    query1 Q0 http://dbpedia.org/resource/Barack_Obama 1 -4.27840072 galago
    

    I experience no issue performing the retrieval using the name restricted query.

    Note that the tokenizer class does not need to be specified.

    You will probably find it easier to prepare parameter files, rather than trying to put extensive parameters on the command line.

     
  • Nikita Zhiltsov

    Nikita Zhiltsov - 2014-12-22

    David, thanks. How can I pass build.json to the build command? BTW, I could achieve the same result with these command-line keys: --fieldIndex=true
    --fields+names
    --fields+titles
    --fields+attributes
    --fields+similarentitynames
    --fields+categories
    --fields+types
    --fields+outgoingentitynames
    --fields+predicatenames
    --tokenizer/fields+names
    --tokenizer/fields+titles
    --tokenizer/fields+attributes
    --tokenizer/fields+similarentitynames
    --tokenizer/fields+categories
    --tokenizer/fields+types
    --tokenizer/fields+outgoingentitynames
    --tokenizer/fields+predicatenames

     
  • David Fisher

    David Fisher - 2014-12-23

    Give the parameters file name on the command line:

    sydney:~/work3/projects/test-stem$ sh ~/work1/galago/core/target/appassembler/bin/galago build build.json 
    
     
  • Nikita Zhiltsov

    Nikita Zhiltsov - 2014-12-23

    OK, I got it. However, #combine(obama.(names)) does not work for me. Does Galago consider the 'names' field language model for 'obama.names' or document language model?

     
  • David Fisher

    David Fisher - 2014-12-29

    That is indri query language syntax, not galago.

    Document language model is used.

     
  • Nikita Zhiltsov

    Nikita Zhiltsov - 2015-01-04

    All right. How to achieve the same effect, i.e., matching in particular fields considering per-field language models, syntactically or programmatically in Galago? Is it possible to do that in a custom traversal?

     
  • David Fisher

    David Fisher - 2015-01-05

    There is the PRMS2Traversal (in core/src/main/java/org/lemurproject/galago/core/retrieval/traversal)

    The operator is #prms, see the top of the file for documentation. That model uses field language model smoothing.

    If it is not exactly what you want, it will provide a starting point for a custom traversal.

     
  • Nikita Zhiltsov

    Nikita Zhiltsov - 2015-01-05

    OK, thank you very much!

     
  • Nikita Zhiltsov

    Nikita Zhiltsov - 2015-01-06

    How strange, I get very different results for Indri and Galago implementations of MLM on the same collection:

    1) Indri query:

    #combine(#wsum(0.8 barack.(names)
            0.2 barack.(attributes))    
            #wsum(0.8 obama.(names)
            0.2 obama.(attributes)
     )
    

    Indri's presets include Dirichlet smoothing with average field lengths with per-field priors.

    2) A query generated by PRMS2Traversal (forced to set necessary weights = (0.8, 0.2)) in Galago:

    #combine:norm=false:w=1.0(#wsum:0=0.8:1=0.2:w=1.0( 
    
    #dirichlet:avgLength=4.124539192583878:collectionLength=35514056:documentCount=8610430:lengths=names:maximumCount=7:nodeFrequency=386:w=0.8( #lengths:names:part=lengths() #counts:barack:part=field.names() ) 
    
    #dirichlet:avgLength=120.36047466241781:collectionLength=415926924:documentCount=3455736:lengths=attributes:maximumCount=56:nodeFrequency=3367:w=0.2( #lengths:attributes:part=lengths() #counts:barack:part=field.attributes() ) ) 
    
    #wsum:0=0.8:1=0.2:w=1.0( 
    
    #dirichlet:avgLength=4.124539192583878:collectionLength=35514056:documentCount=8610430:lengths=names:maximumCount=8:nodeFrequency=832:w=0.8( #lengths:names:part=lengths() #counts:obama:part=field.names() ) 
    
    #dirichlet:avgLength=120.36047466241781:collectionLength=415926924:documentCount=3455736:lengths=attributes:maximumCount=77:nodeFrequency=5819:w=0.2( #lengths:attributes:part=lengths() #counts:obama:part=field.attributes() ) ) )
    

    I noticed that Galago computes average lengths of fields considering only non-empty field values (e.g. collectionLength / documentCount where documentCount may vary from field to field; there can be documents with some empty fields except "names", which is required, in the collection), which does not seem reasonable and may affect estimates of priors in the language models. I tried to run Galago's queries with correct priors by replacing the numbers in the query, but it did not work out.

    Aside from it, I can't explain why the results are so different (namely, Galago results are significantly worse than Indri's on my gold standard).

     
  • Nikita Zhiltsov

    Nikita Zhiltsov - 2015-01-07

    It turns out PRMS2Traversal ignores stemming. What's the difference between field parts and extents?

    The relevant excerpt from PRMS2Traversal:

     // if we have access to the correct field-part:
                    if (this.retrieval.getAvailableParts().containsKey("field." + field)) {
                        NodeParameters par1 = new NodeParameters();
                        par1.set("default", term);
                        par1.set("part", "field." + field);
                        termFieldCounts = new Node("counts", par1, new ArrayList());
                    } else {
                        // otherwise use an #inside op
                        NodeParameters par1 = new NodeParameters();
                        par1.set("default", term);
                        termExtents = new Node("extents", par1, new ArrayList());
                        termExtents = TextPartAssigner.assignPart(termExtents, globals, this.retrieval.getAvailableParts());
    
                        termFieldCounts = new Node("inside");
                        termFieldCounts.addChild(StructuredQuery.parse("#extents:part=extents:" + field + "()"));
                        termFieldCounts.addChild(termExtents);
                    }
    
     
  • David Fisher

    David Fisher - 2015-01-07

    The missing stemming appears to be a bug, please add one to the tracker. The offending code is most likely the implementation of the inside Node.

    If the field data was indexed separately as a an extra part, it can be accessed by the field.<fieldname> part. If not, the field data is indexed in the extents part.

     
  • Nikita Zhiltsov

    Nikita Zhiltsov - 2015-01-07

    I notice that the first block is always called, if one has stemmedPostings=true & nonStemmedPostings=false, and the second block is called if stemmedPostings=true & nonStemmedPostings=true.

    Done: https://sourceforge.net/p/lemur/bugs/249/

     

    Last edit: Nikita Zhiltsov 2015-01-07

Log in to post a comment.