Menu

Unknown system function substring-after-match

Help
batis
2012-07-13
2012-09-04
  • batis

    batis - 2012-07-13

    I got that error when trying to use that xquery function !

    What am I missing?

     
  • batis

    batis - 2012-07-13

    To be specific, I extract names like: "SOPASKY DAVID Jr ( ZINGO )"

    What I want is to separate these words to separate variables:

    name = SOPASKY (always the 1st word)

    surname = DAVID Jr (1 or more till the opening parenthesis)

    knownAs = ZINGO (words within the parenthesis)

    I tried many xquery functions, but still getting same error of "unknown
    function"

     
  • Selvin Fehric

    Selvin Fehric - 2012-07-13

    Paste example code so we can see.

     
  • batis

    batis - 2012-07-13

    Here's the code:

    <?xml version="1.0" encoding="UTF-8"?>
    
    <config charset="UTF-8">
    
        <include path="functions.xml"/>
    
        <!-- collects all tables for individual products -->
        <var-def name="user">    
            <call name="download-multipage-list">
                <call-param name="pageUrl">${url}</call-param>
                <call-param name="nextXPath">//div[starts-with(., 'next')]/@href</call-param>
                <call-param name="itemXPath">//div[@id="content"]/table/tbody/tr/td[1]/div[1]/a/@href</call-param>
                <call-param name="maxloops">10</call-param>
            </call>
        </var-def>
    
        <!-- iterates over all collected products and extract desired data -->
        <file action="write" path="${xmlPath}" charset="UTF-8">
            <![CDATA[
                <?xml version="1.0" encoding="UTF-8"?>
                <users>
            ]]>
            <loop item="item" index="i">
                <list><var name="user"/></list>
                <body>
                    <xquery>
                        <xq-param name="item" type="node()"><var name="item"/></xq-param>
                        <xq-expression><![CDATA[
                            declare variable $item as node() external;
    
                            let $fullName := substring-after(normalize-space(data($item//div[@id="content"]/div/div[2]/h1)), 'Full Name :')
                let $p1 := substring-before(data($fullName), ' ')
                            let $p2 := substring-before(substring-after(data($fullName), ' '), '(')
                            let $p3 := substring-before(substring-after(data($fullName), '('), ')')
                            return
                                <user>
                                    <name>{normalize-space($p1)}</name>
                                    <surname>{normalize-space($p2)}</surname>
                                    <nickname>{normalize-space($p3)}</nickname>
                                </user>
                        ]]></xq-expression>
                    </xquery>
                </body>
            </loop>
            <![CDATA[</users>]]>
        </file>
    </config>
    

    The problem is sometimes "Full Name: " is written in different ways:
    upper/lower case, with/without sapce or ':' .... How can I consider all these
    situations ? => regex may help a lot, but when trying functions using regex
    like substring-after-match(), it says that these functions are not supported !

     
  • batis

    batis - 2012-07-13

    Also, to test the code, I use <loop item="item" index="i" maxloops="10"> ...
    And the code is working, but when I changed maxloops to 100 I got an error:

    SXXP0003: Error reported by XML parser: Element type "span" must be followed by either attribute specifications, ">" or "/>".
    org.webharvest.exception.ScraperXQueryException: Error executing XQuery expression (XQuery = [declare variable................
    

    If I understand, that means the XML code is dirty, isn't the output of <html- to-xml=""/> supposed to be well-formed ?

     
  • Selvin Fehric

    Selvin Fehric - 2012-07-13

    Maybe the problem is that there is no so many pages. You have to specify the
    number of loops according to pager.

     
  • batis

    batis - 2012-07-13

    Abolutely not, the webpage contains more than 500 links to retrieve...

    Here's the actual config for a similar website that contains more than 2000
    links and wich works with maxloops="200" and gives same error for 300:
    http://pastebin.com/CtBhsDA1

     
  • Selvin Fehric

    Selvin Fehric - 2012-07-14

    I run the config from last link, and error happens at: http://www.dictionaryo
    farthistorians.org/brisacc.htm

    As you can see html is not well formed and you have tags like
    or />.

    You can set some try-catch blocks to pass through that loop.

    I couldn't run the first config that you pasted, because I don't know the url
    that you are crawling, but I think that problem is similar.

     
  • batis

    batis - 2012-07-14

    Oh my, that's great you found the link causing all this xD

    I tried to put a try/catch block, but it's not working... Unfortunately ther's
    no examples to how it works (other than the manual), so could you please
    suggest me where to put it in the same example above:
    http://pastebin.com/CtBhsDA1

     
  • batis

    batis - 2012-07-14

    I tried many ways to do it but still having the same error... Could you please
    help me making it so it displays also problematic URLs ?

     
  • Selvin Fehric

    Selvin Fehric - 2012-07-14

    Put whole <xquery>... </xquery> in body part of try block. It should work.

    <loop item="elmtURL" index="i" maxloops="300">

    <list>

    <xpath expression="//table/tbody/tr/td/ul/li/font/a/@href">

    <html-to-xml omitcomments="true" outputtype="pretty">

    <http url="${url}"/>

    </html-to-xml>

    </xpath>

    </list>

    <body>

    <try>

    <body>

    <xquery>

    <xq-param name="doc" type="node()">

    <html-to-xml omitcomments="true" outputtype="pretty">

    <http url="${sys.fullUrl(url.toString(), elmtURL.toString())}"/>

    </html-to-xml>

    </xq-param>

    <xq-expression>

    <DateBorn>{normalize-space($dateBorn)}</DateBorn> </pers> ]]> </xq-expression> </xquery> </body> <catch> </catch> </try> </body> </loop>
     
  • batis

    batis - 2012-07-16

    Ow, I tried it in many ways, but not like that x_o

    Thanks

     
  • Selvin Fehric

    Selvin Fehric - 2012-07-16

    No problem :) You have to set like this because body part is executed for
    every loop cycle. Now you can set some code in catch block if you want to
    manage data which throw exception.

     

Log in to post a comment.