SourceForge has been redesigned. Learn more.
Close

Is the Captured by XPath field in the XML Parser supported on v1.5.3

Help
2014-05-25
2014-05-27
  • Dave Walton

    Dave Walton - 2014-05-25

    Hi : V1.5.3 - build 390f883772

    I am totally new to the product and have just started playing with the features to see how well it matches my reqs. I have experience of configuring FAST ESP, so am not new to search engines.

    I am trying to extract a specific element of a crawled XML file into an index field . See my process I tried below..
    I succeeded in filling a custom field with ALL content and outputting in a query as expected just by NOT setting the capture reg field at all, so I am confident that I am using it ok.

    Question. Is xpath currently supported, I can’t get it to work ? Is there an example showing expected format of the Capture reg Exp field ?

    Am I missing a link to documentation on the content collection side, I cannot see any ?

    cheers
    Dave

    process

    I assume the process is
    - Create field (stored so as can be output)
    - Edit the existing XML parser
    - Add a new Field mapping
    : Parser field = content
    : Index field = my new one
    : Capture Reg.Exp = standard XPATH (like /aaa/bbb )
    : Analyser = empty
    - Recrawl the files (after touching them)

     
  • Alexandre Toyer

    Alexandre Toyer - 2014-05-26

    Hello Dave,

    You can definitely use XPATH in OpenSearchServer 1.5.3, but not exactly with the process you are describing.
    You need to create a new parser, by selecting "XML (XPATH)" from the select list listing available types of parsers. You then have to fill in field "XPATH request for documents" with the XPATH to access documents. For example if your XML file contains several RSS items you can use: /rss/channel/item (this is not a particularly good example since RSS feed has its own dedicated parser in OpenSearchServer :) )
    Then in tab "Field mapping" you can map an XPATH request to a specific field of your schema. For example writing "title" would map each /rss/channel/item/title node of your XML file.

    Please tell me if you need more information, I would be glad to help you further.

    Regards,
    Alexandre

     
  • Dave Walton

    Dave Walton - 2014-05-26

    Hi, thanks for any help. Sorry for lots of questions :-)

    I started up this route before editing the XML parser but I gave up as I assumed that I would need to set a supported extension of xml in my custom parser which it wouldn’t let me as the original one exists.

    I still can’t get it to work. Let me summarise what I do

    1) I have files which include content like

    <Sect>
    <CountryClassifier>Australia, Belgium</CountryClassifier>
    </Sect>

    2) I have a custom indexed/stored field called “CountryClassifier”

    3) Create a new XML(XPATH) parser called CountryParser. Leave EVERYTHING default except:
    a) ParserAttributes:Size limit = 100000
    b) ParserAttributes:XPATH request for documents = /Sect
    c) FieldMapping(1):ParserField = content
    d) FieldMapping(1):IndexField = CountryClassifier
    e) FieldMapping(1):CaptureReg.Exp = CountryClassifier

    NB: I assume this means "use index field 'content' as the source, apply xpath of /Sect/CountryClassifier and save the results in field CountryClassifier"

    I reload the touched file. I can see the fileSystemDate of the result returned update so the content is being reloaded.

    It is confusing to not set any other tabs on the custom XPATH parser, as it seems logical to have to set the support ext/MIME type ? Is an XPATH parser automatically applied to all XML files or do I have to do anything else to get it processing in the pipeline ?

    thanks for your help again

    cheers
    Dave.

    NB:
    What determines the order of the Parser execution when multiple parsers read the same file?
    How do I set the order ?
    I assume that there is a standard set of parser fields like "content" that get set automatically at the beginning of the parser processing pipeline.?
    I assume that the two XPATHS values are because the first is to navigate to the main node and then multiple field mappings are expected to come from within that piece of the document using the sub-node names only ?

     
  • Dave Walton

    Dave Walton - 2014-05-26

    oops. Point 1 above showed an xml file example with root node = "Sect" and subnode = "CountryClassifier" but the formatting was eaten by the box:-)

     
  • Alexandre Toyer

    Alexandre Toyer - 2014-05-27

    Hello Dave,

    You will find a quick tutorial in attached file. I will make a page in our documentation with this.

    Only one parser can handle each file. You can not create several parsers handling the same file extension.

    Please tell me if this helps you and if things are clearer now.

    EDIT: I just saw that I used title "Document 2" twice, instead of using title "Document 3" for third item, don't be surprised :)

    Regards,
    Alexandre

     
    Last edit: Alexandre Toyer 2014-05-27
  • Dave Walton

    Dave Walton - 2014-05-27

    hi

    Thanks, I got it working. Obviously, once you go the path parser route, I had to add other mappings to extract the full content for text searching as well.

    thank you for your help.

    Dave.

    ps. I have another question, but I will open another topic.

     
  • mzeid

    mzeid - 2017-04-10

    Hi Alexandre and Dave,

    Sorry for jumping in. Can I use the same steps with XPath on web pages? For example, can I map such xpath paths from a web page

    //*[contains(concat( " ", @class, " " ), concat( " ", "field--label-hidden", " " ))]
    

    or

    //ul[(((count(preceding-sibling::*) + 1) = 4) and parent::*)]

    or

    //strong

    to fields in OSS and have these fields indexed and mapped to Solr fields? Is this doable? Or does this apply only to xml files stored on the server?

    Thanks

     
    Last edit: mzeid 2017-04-10

Log in to post a comment.