We are seeing a problem where sentences defined by offset annotations seem to lose the the last token in the sentence when the last term is of length one. Here's a simple example
<doc>
<docno>1022921261</docno>
<text>
Ms. APersonX is a 31-year old female. Her weight is 180 lbs and, at 5 2 in height, her BMI is 32 l. She complains of irritation and chafing.
</text>
</doc>
Here is the sentence byte offsets in Indri offset annotation format that can be input the buildindex utility
1022921261 TAG 1 SENTENCE 13 10 0
1022921261 TAG 2 SENTENCE 39 36 0
1022921261 TAG 3 SENTENCE 77 61 0
1022921261 TAG 4 SENTENCE 140 39 0
The third sentence is indexed without the last token "l"
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The byte offsets do not line up with the posted version (forum formatting plays hell with data). Please attach the example as a file and post it to the Bugs tickets (https://sourceforge.net/p/lemur/bugs/). Include all of the relevant information concerning version, OS, indexing parameters, etc.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for the quick reply. I've attached a (slightly modified) example including indexing parameters. We were able to to work around this issue of tokenization of characters that appear at the end of a sentence with this change to
the in TextTokenizer.l routine processASCIIToken (...)
within this block if ( _tokenize_entire_words ) {
change
writeToken( toktext, token_len, byte_position - tokleng, byte_position);
to
writeToken( toktext, token_len, byte_position - tokleng, byte_position - tokleng + token_len );
We're not sure if this is the best fix or whether there is the regular expressions used in tokenization should be changed.
I'm working off indri 2.7 code base. The OS is Red Hat Enterprise Linux Server release 6.2 (Santiago)
Ok. 2.7 is rather ancient. I am using the current head, but should match up, for this problem.
Changing the tokenizer is not a good idea.
The fault lies in the OffsetAnnotationAnnotator, where it rounds down, rather than up, when encountering an annotation within a token. Any token of length 2 or less will be dropped.
harvey:~/Development/test/test-oa$../../indri/runquery/IndriRunQuery\-index=test_oa/-query="#combine\[sentence](she is of)"\-printPassages=true-3.0294410229212612531Shecomplainsofirritationandchafing-3.035391022921261011022921261-3.03686102292126119Ms.APersonXisa31-yearoldfemale-3.038321022921261924Herweightis180lbsand,at52inheight,herBMIis32
New:
harvey:~/Development/test/test-oa$../../indri/runquery/IndriRunQuery\-index=test_oa/-query="#combine\[sentence](she is of)"\-printPassages=true-3.0619610229212612531Shecomplainsofirritationandchafing-3.068181022921261011022921261-3.0696102292126119Ms.APersonXisa31-yearoldfemale-3.071421022921261925Herweightis180lbsand,at52inheight,herBMIis321.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks, David. Would a similar change for the start tag also be ok? Something like the following:
// When the tag begins in the middle of the token, we need to// decide whether to round up (activate the tag at this token// position) or round down (activate the tag at tok_pos + 1).if((*curr_raw_tag)->begin<=(*token).begin+((*token).end-(*token).begin)/2){// Tag either begins before the token, or is closer to begin// than to the end of the token, so we are rounding up. Begin// value will be be set to the current token position.te->begin=tok_pos;}
// else {
// Tag begins closer to where the token ends, so we'll round down.
// te->begin = tok_pos + 1;
// }
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks, this has been working well for us, with one exception. When the sentence begins with a single '>' (greater than) character, the '>' is omitted from the sentence. I think the reason is that there isn't a token created for the '>'. I've tried replacing the '>' with various entities (> <, etc.), but none result in a token for the '>' character. Is this possible?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you want the > character to appear as a token, you will have to convert it to an indexable term (xml entities are discarded, just like punctuation). For example, you could substitute it with GREATERTHAN (or some otherwise OOV term).
Note that doing so changes your collection statistics (adding a term).
If you are simply trying to capture that the sentence is quoted text, eg from an email or usenet style posting from back in the day, you would do better to add a second offset annotation for quote, at the same span as the sentence annotation.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We are seeing a problem where sentences defined by offset annotations seem to lose the the last token in the sentence when the last term is of length one. Here's a simple example
<doc>
<docno>1022921261</docno>
<text>
Ms. APersonX is a 31-year old female. Her weight is 180 lbs and, at 5 2 in height, her BMI is 32 l. She complains of irritation and chafing.
</text>
</doc>
Here is the sentence byte offsets in Indri offset annotation format that can be input the buildindex utility
1022921261 TAG 1 SENTENCE 13 10 0
1022921261 TAG 2 SENTENCE 39 36 0
1022921261 TAG 3 SENTENCE 77 61 0
1022921261 TAG 4 SENTENCE 140 39 0
The third sentence is indexed without the last token "l"
The byte offsets do not line up with the posted version (forum formatting plays hell with data). Please attach the example as a file and post it to the Bugs tickets (https://sourceforge.net/p/lemur/bugs/). Include all of the relevant information concerning version, OS, indexing parameters, etc.
Thank you for the quick reply. I've attached a (slightly modified) example including indexing parameters. We were able to to work around this issue of tokenization of characters that appear at the end of a sentence with this change to
the in TextTokenizer.l routine processASCIIToken (...)
within this block if ( _tokenize_entire_words ) {
change
writeToken( toktext, token_len, byte_position - tokleng, byte_position);
to
writeToken( toktext, token_len, byte_position - tokleng, byte_position - tokleng + token_len );
We're not sure if this is the best fix or whether there is the regular expressions used in tokenization should be changed.
I'm working off indri 2.7 code base. The OS is Red Hat Enterprise Linux Server release 6.2 (Santiago)
Ok. 2.7 is rather ancient. I am using the current head, but should match up, for this problem.
Changing the tokenizer is not a good idea.
The fault lies in the OffsetAnnotationAnnotator, where it rounds down, rather than up, when encountering an annotation within a token. Any token of length 2 or less will be dropped.
This change:
resolves the problem:
Old:
New:
Thanks, David. Would a similar change for the start tag also be ok? Something like the following:
// else {
// te->begin = tok_pos + 1;
// }
yes.
Thanks, this has been working well for us, with one exception. When the sentence begins with a single '>' (greater than) character, the '>' is omitted from the sentence. I think the reason is that there isn't a token created for the '>'. I've tried replacing the '>' with various entities (> <, etc.), but none result in a token for the '>' character. Is this possible?
If you want the > character to appear as a token, you will have to convert it to an indexable term (xml entities are discarded, just like punctuation). For example, you could substitute it with GREATERTHAN (or some otherwise OOV term).
Note that doing so changes your collection statistics (adding a term).
If you are simply trying to capture that the sentence is quoted text, eg from an email or usenet style posting from back in the day, you would do better to add a second offset annotation for quote, at the same span as the sentence annotation.