Menu

#198 RegularExpressionAnalyser cannot be correctly applied on headers field

v1.5.x
open
nobody
None
5
2015-02-20
2015-02-19
Guido
No

Dear Sir/Madam,

We have documents in a database, which we serve on our website. We would like to index and store the database id of these documents by supplying the database id as a "X-OSS-ItemKey" response header, but we run into multiple issues. Either I don't understand how to configure this process or I ran into some bugs. I would be grateful if you could have a look at my workflow and explain if I'm doing anything incorrect and perhaps I found some bugs, which you could fix to improve your product.

The summary of the problem:
I'm creating a field into which I store all the response headers, this works correctly. Then I create a second field, into which I will store the database id. The ID is captured by reading the headers field, then applying a regular expression analyser. I cannot get the regular expression analyser to be correctly applied over all the headers field.

For step-by-step instructions on how to reproduce this problem, see below.

As I mentioned before, any help or instructions would be greatly appreciated.

Yours sincerely,
Guido Senff

1. Define a field that stores and indexes all headers
Schema
    Fields
        name: headerinformation
        indexed: yes
        stored: yes
        termvector: no
        analyser: none
        copy of: none

2. index all header information into this field (unfortunately, no regular expression filter can be applied on the headers at this stage. Feature request?)
Crawler
    Web
        Field mapping
            URL field: headers
            Index field: headerinformation

3. Create an analyser to subtract the part of the headers in which we are actually interested, the number after X-OSS-ItemKey:
Schema
    Analyser
        Header Analyser - ItemKey
            + Language undefined
            + index: KeywordTokenizer
            + query: KeywordTokenizer
            + Filter list:
                RegularExpressionFilter, Query and indexation, Pattern: X-OSS-ItemKey: (\d+)

4. Test this analyser on some sample header data, the results seem good! We see the 12345 key that we are interested in!
Analyser test
    Enter a text to analyze:
        X-OSS-Company:test
        X-OSS-DocumentType:Overig
        X-OSS-ItemKey: 12345
        X-OSS-Keywords:
        X-OSS-SearchType:

    Result:
        KeywordTokenizer: X-OSS-Company:test X-OSS-DocumentType:Overig X-OSS-ItemKey: 12345 X-OSS-Keywords: X-OSS-SearchType: [0,99 - 1]
        RegularExpressionFilter:    1009160 [0,104 - 1]

5. Now create a field, into which our "X-OSS-ItemKey"-value is copied, that is, copied from field "headerinformation", but with our regular expression applied from the "Header Analyser - ItemKey"-analyser
Schema
    Fields
        name: ItemKeyHeader
        indexed: yes
        stored: yes
        termvector: no
        analyser: Header Analyser - ItemKey
        copy of: headerinformation

6. Do a manual crawl on a webpage, to test our newly added header "X-OSS-ItemKey"-value parser.

Crawler
    Web
        Manual crawl
            URL to crawl: <our internal URL>

    Crawl Result
        Index document
            ItemKeyHeader:
                X-OSS-SearchType:
                X-OSS-Company: test
                X-OSS-ItemKey: 12345
                X-OSS-DocumentType: Overig
                X-OSS-Keywords:
                .. all other headers ..

            headerinformation:
                X-OSS-SearchType:
                X-OSS-Company: test
                X-OSS-ItemKey: 12345
                X-OSS-DocumentType: Overig
                X-OSS-Keywords:
                .. all other headers ..

BUG? This is not what I expected? Field "headerinformation" seems correct, it contains all headers. But "ItemKeyHeader" also contains all headers! I would expect to see "X-OSS-ItemKey"-value here, which is: 12345

7. Checking the actual index with Luke, lucine index toolbox
    Luke - Lucine index toolbox
        Search
            Enter search expression: *:*

        Results:
            Row with my document
                ItemKeyHeader: X-OSS-ItemKey: 12345
                headerinformation: .. all headers ..
                .. other indexed fields ..

BUG? This appears incorrect, I would expect to see 12345 here. Did OSS ignore my group and matched the entire pattern?

I also tried the RegularExpressionReplace filter, to just replace it into $1. This does not seem to affect the input string at all, all the headers are returned as if there is no regular expression.

Side information:
For non-binary documents, we supply the database id information as html meta header. We use the follow configuration to catch the
meta data element matching "ItemKey", we also use a group in this regular expression to capture the database id and
it works at this configuration.

Schema
    Parser list
            Field mapping
                HTML
                    htmlSource ItemKey <meta\s+(?:name="ItemKey"|content="(.[^"]*)"){1}\s+(?:name="ItemKey"|content="(.[^"]*)"){1}\s*/?> false

Discussion


Log in to post a comment.