RegularExpressionAnalyser cannot be correctly applied on headers field

An open source search engine with RESTFul API and crawlers

Brought to you by: emmanuel_keller

#198 RegularExpressionAnalyser cannot be correctly applied on headers field

Milestone: v1.5.x

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2015-02-20

Created: 2015-02-19

Creator: Guido

Private: No

Dear Sir/Madam,

We have documents in a database, which we serve on our website. We would like to index and store the database id of these documents by supplying the database id as a "X-OSS-ItemKey" response header, but we run into multiple issues. Either I don't understand how to configure this process or I ran into some bugs. I would be grateful if you could have a look at my workflow and explain if I'm doing anything incorrect and perhaps I found some bugs, which you could fix to improve your product.

The summary of the problem:
I'm creating a field into which I store all the response headers, this works correctly. Then I create a second field, into which I will store the database id. The ID is captured by reading the headers field, then applying a regular expression analyser. I cannot get the regular expression analyser to be correctly applied over all the headers field.

For step-by-step instructions on how to reproduce this problem, see below.

As I mentioned before, any help or instructions would be greatly appreciated.

Yours sincerely,
Guido Senff

1. Define a field that stores and indexes all headers
Schema
Fields
name: headerinformation
indexed: yes
stored: yes
termvector: no
analyser: none
copy of: none

2. index all header information into this field (unfortunately, no regular expression filter can be applied on the headers at this stage. Feature request?)
Crawler
Web
Field mapping
URL field: headers
Index field: headerinformation

3. Create an analyser to subtract the part of the headers in which we are actually interested, the number after X-OSS-ItemKey:
Schema
Analyser
Header Analyser - ItemKey
+ Language undefined
+ index: KeywordTokenizer
+ query: KeywordTokenizer
+ Filter list:
RegularExpressionFilter, Query and indexation, Pattern: X-OSS-ItemKey: (\d+)

4. Test this analyser on some sample header data, the results seem good! We see the 12345 key that we are interested in!
Analyser test
Enter a text to analyze:
X-OSS-Company:test
X-OSS-DocumentType:Overig
X-OSS-ItemKey: 12345
X-OSS-Keywords:
X-OSS-SearchType:

Result:
KeywordTokenizer: X-OSS-Company:test X-OSS-DocumentType:Overig X-OSS-ItemKey: 12345 X-OSS-Keywords: X-OSS-SearchType: [0,99 - 1]
RegularExpressionFilter: 1009160 [0,104 - 1]

5. Now create a field, into which our "X-OSS-ItemKey"-value is copied, that is, copied from field "headerinformation", but with our regular expression applied from the "Header Analyser - ItemKey"-analyser
Schema
Fields
name: ItemKeyHeader
indexed: yes
stored: yes
termvector: no
analyser: Header Analyser - ItemKey
copy of: headerinformation

6. Do a manual crawl on a webpage, to test our newly added header "X-OSS-ItemKey"-value parser.

Crawler
Web
Manual crawl
URL to crawl: <our internal URL>

Crawl Result
Index document
ItemKeyHeader:
X-OSS-SearchType:
X-OSS-Company: test
X-OSS-ItemKey: 12345
X-OSS-DocumentType: Overig
X-OSS-Keywords:
.. all other headers ..

headerinformation:
X-OSS-SearchType:
X-OSS-Company: test
X-OSS-ItemKey: 12345
X-OSS-DocumentType: Overig
X-OSS-Keywords:
.. all other headers ..

BUG? This is not what I expected? Field "headerinformation" seems correct, it contains all headers. But "ItemKeyHeader" also contains all headers! I would expect to see "X-OSS-ItemKey"-value here, which is: 12345

7. Checking the actual index with Luke, lucine index toolbox
Luke - Lucine index toolbox
Search
Enter search expression: *:*

Results:
Row with my document
ItemKeyHeader: X-OSS-ItemKey: 12345
headerinformation: .. all headers ..
.. other indexed fields ..

BUG? This appears incorrect, I would expect to see 12345 here. Did OSS ignore my group and matched the entire pattern?

I also tried the RegularExpressionReplace filter, to just replace it into $1. This does not seem to affect the input string at all, all the headers are returned as if there is no regular expression.

Side information:
For non-binary documents, we supply the database id information as html meta header. We use the follow configuration to catch the
meta data element matching "ItemKey", we also use a group in this regular expression to capture the database id and
it works at this configuration.

Schema
Parser list
Field mapping
HTML
htmlSource ItemKey <meta\s+(?:name="ItemKey"|content="(.[^"]*)"){1}\s+(?:name="ItemKey"|content="(.[^"]*)"){1}\s*/?> false

RegularExpressionAnalyser cannot be correctly applied on headers field

An open source search engine with RESTFul API and crawlers

Group

Searches

Help

#198 RegularExpressionAnalyser cannot be correctly applied on headers field

Discussion