Menu

Having a problem with offset annotations

Indri
Dan Jamrog
2013-07-23
2013-07-31
  • Dan Jamrog

    Dan Jamrog - 2013-07-23

    We are seeing a problem where sentences defined by offset annotations seem to lose the the last token in the sentence when the last term is of length one. Here's a simple example

    <doc>
    <docno>1022921261</docno>
    <text>
    Ms. APersonX is a 31-year old female. Her weight is 180 lbs and, at 5 2 in height, her BMI is 32 l. She complains of irritation and chafing.
    </text>
    </doc>

    Here is the sentence byte offsets in Indri offset annotation format that can be input the buildindex utility
    1022921261 TAG 1 SENTENCE 13 10 0
    1022921261 TAG 2 SENTENCE 39 36 0
    1022921261 TAG 3 SENTENCE 77 61 0
    1022921261 TAG 4 SENTENCE 140 39 0

    The third sentence is indexed without the last token "l"

     
  • David Fisher

    David Fisher - 2013-07-23

    The byte offsets do not line up with the posted version (forum formatting plays hell with data). Please attach the example as a file and post it to the Bugs tickets (https://sourceforge.net/p/lemur/bugs/). Include all of the relevant information concerning version, OS, indexing parameters, etc.

     
  • Dan Jamrog

    Dan Jamrog - 2013-07-24

    Thank you for the quick reply. I've attached a (slightly modified) example including indexing parameters. We were able to to work around this issue of tokenization of characters that appear at the end of a sentence with this change to
    the in TextTokenizer.l routine processASCIIToken (...)

    within this block if ( _tokenize_entire_words ) {
    change
    writeToken( toktext, token_len, byte_position - tokleng, byte_position);
    to
    writeToken( toktext, token_len, byte_position - tokleng, byte_position - tokleng + token_len );
    We're not sure if this is the best fix or whether there is the regular expressions used in tokenization should be changed.
    I'm working off indri 2.7 code base. The OS is Red Hat Enterprise Linux Server release 6.2 (Santiago)

     
  • David Fisher

    David Fisher - 2013-07-24

    Ok. 2.7 is rather ancient. I am using the current head, but should match up, for this problem.

    Changing the tokenizer is not a good idea.

    The fault lies in the OffsetAnnotationAnnotator, where it rounds down, rather than up, when encountering an annotation within a token. Any token of length 2 or less will be dropped.

    This change:

    harvey:~/Development/indri$ svn diff 
    Index: src/OffsetAnnotationAnnotator.cpp
    ===================================================================
    --- src/OffsetAnnotationAnnotator.cpp   (revision 2582)
    +++ src/OffsetAnnotationAnnotator.cpp   (working copy)
    @@ -291,19 +291,7 @@
             te->end = tok_pos;
    
           } else { 
    -
    
    -        // Current tag ends inside the current token.
    -
    -        if ( te->end <= (*token).begin + ( (*token).end - (*token).begin )/2 ) {
    -
    -          te->end = tok_pos; // Round down to previous token boundary.
    -
    -        } else {
    -          
    -          te->end = tok_pos + 1; // Round up to next token boundary.
    -          
    -        }
    -
    +        te->end = tok_pos + 1; // Round up to next token boundary.
           }
    
           // ensure tag boundaries are still within the document
    

    resolves the problem:

    Old:

    harvey:~/Development/test/test-oa$ ../../indri/runquery/IndriRunQuery\ 
    -index=test_oa/ -query="#combine\[sentence](she is of)"\ 
    -printPassages=true
    -3.02944    1022921261  25  31
    She complains of irritation and chafing
    -3.03539    1022921261  0   1
    1022921261
    -3.03686    1022921261  1   9
    Ms. APersonX is a 31-year old female
    -3.03832    1022921261  9   24
    Her weight is 180 lbs and, at 5 2  in height, her BMI is 32
    

    New:

    harvey:~/Development/test/test-oa$ ../../indri/runquery/IndriRunQuery\ 
    -index=test_oa/ -query="#combine\[sentence](she is of)"\ 
    -printPassages=true
    -3.06196    1022921261  25  31
    She complains of irritation and chafing
    -3.06818    1022921261  0   1
    1022921261
    -3.0696 1022921261  1   9
    Ms. APersonX is a 31-year old female
    -3.07142    1022921261  9   25
    Her weight is 180 lbs and, at 5 2  in height, her BMI is 32 1.
    
     
    • Dan Jamrog

      Dan Jamrog - 2013-07-24

      Thanks, David. Would a similar change for the start tag also be ok? Something like the following:

        // When the tag begins in the middle of the token, we need to
        // decide whether to round up (activate the tag at this token
        // position) or round down (activate the tag at tok_pos + 1).
      
        if ( (*curr_raw_tag)->begin <= (*token).begin + ( (*token).end - (*token).begin )/2 ) {
      
          // Tag either begins before the token, or is closer to begin
          // than to the end of the token, so we are rounding up.  Begin
          // value will be be set to the current token position.
          te->begin = tok_pos;
      
        }
      

      // else {

          // Tag begins closer to where the token ends, so we'll round down.
      

      // te->begin = tok_pos + 1;

      // }

       
  • David Fisher

    David Fisher - 2013-07-24

    yes.

     
    • Dan Jamrog

      Dan Jamrog - 2013-07-30

      Thanks, this has been working well for us, with one exception. When the sentence begins with a single '>' (greater than) character, the '>' is omitted from the sentence. I think the reason is that there isn't a token created for the '>'. I've tried replacing the '>' with various entities (> <, etc.), but none result in a token for the '>' character. Is this possible?

       
  • David Fisher

    David Fisher - 2013-07-31

    If you want the > character to appear as a token, you will have to convert it to an indexable term (xml entities are discarded, just like punctuation). For example, you could substitute it with GREATERTHAN (or some otherwise OOV term).

    Note that doing so changes your collection statistics (adding a term).

    If you are simply trying to capture that the sentence is quoted text, eg from an email or usenet style posting from back in the day, you would do better to add a second offset annotation for quote, at the same span as the sentence annotation.

     

Log in to post a comment.

MongoDB Logo MongoDB