Tokenization Errors in Fields with TokenizeTagContent Set to False

Search engine and data mining applications and ClueWeb datasets.

Brought to you by: cammiemw, david_fisher, gregorybrooks, jamiecallan, sm-harding

#278 Tokenization Errors in Fields with TokenizeTagContent Set to False

Milestone: v1.x

Status: open

Owner: Lemur Project

Labels: indexing (6) tokenization (1) normalization (4)

Priority: 1

Updated: 2016-03-24

Created: 2016-03-24

Creator: Lemur Project

Private: No

In cases where a document field has a tokenizetagcontent attribute set to false and there are spaces between the begin tag and first term, and spaces between the last term and end tag, the spaces are included as part of the normalized token going into the index.

For example:

<AREA tokenizetagcontent=\"false\"> 2003.145  </AREA>

produces two tokens (the single term split at the decimal point) of
" 2003" and "145 ".

The tokenizer needs to squeeze out such spaces.

This extends to whitespace as well.

2003.145 \n
123.0055

This produces tokens
" 2003"
"145 \n"
" 123"
"0055 "

This ONLY appears to happen for field tags with the tokenizeTagContent attribute defined as false.

QUESTION: What is real purpose of tokenizeTagContent attribute? If it means accept the token exactly as is, then leading/trailing spaces might be acceptable, although it might also mean take multiple terms in the field as a single term (no tokenization).

Since the terms in the field are still processed as multiple terms, I believe any leading/trailing whitespace needs to be removed.

Tokenization Errors in Fields with TokenizeTagContent Set to False

Search engine and data mining applications and ClueWeb datasets.

Group

Searches

Help

#278 Tokenization Errors in Fields with TokenizeTagContent Set to False

Discussion