WebHarvest - web data extraction tool / Discussion / Help: Unknown system function substring-after-match

batis - 2012-07-13

I got that error when trying to use that xquery function !

What am I missing?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

batis - 2012-07-13

To be specific, I extract names like: "SOPASKY DAVID Jr ( ZINGO )"

What I want is to separate these words to separate variables:

name = SOPASKY (always the 1st word)

surname = DAVID Jr (1 or more till the opening parenthesis)

knownAs = ZINGO (words within the parenthesis)

I tried many xquery functions, but still getting same error of "unknown
function"

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Selvin Fehric - 2012-07-13

Paste example code so we can see.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Here's the code:

<?xml version="1.0" encoding="UTF-8"?>

<config charset="UTF-8">

    <include path="functions.xml"/>

    <!-- collects all tables for individual products -->
    <var-def name="user">    
        <call name="download-multipage-list">
            <call-param name="pageUrl">${url}</call-param>
            <call-param name="nextXPath">//div[starts-with(., 'next')]/@href</call-param>
            <call-param name="itemXPath">//div[@id="content"]/table/tbody/tr/td[1]/div[1]/a/@href</call-param>
            <call-param name="maxloops">10</call-param>
        </call>
    </var-def>

    <!-- iterates over all collected products and extract desired data -->
    <file action="write" path="${xmlPath}" charset="UTF-8">
        <![CDATA[
            <?xml version="1.0" encoding="UTF-8"?>
            <users>
        ]]>
        <loop item="item" index="i">
            <list><var name="user"/></list>
            <body>
                <xquery>
                    <xq-param name="item" type="node()"><var name="item"/></xq-param>
                    <xq-expression><![CDATA[
                        declare variable $item as node() external;

                        let $fullName := substring-after(normalize-space(data($item//div[@id="content"]/div/div[2]/h1)), 'Full Name :')
            let $p1 := substring-before(data($fullName), ' ')
                        let $p2 := substring-before(substring-after(data($fullName), ' '), '(')
                        let $p3 := substring-before(substring-after(data($fullName), '('), ')')
                        return
                            <user>
                                <name>{normalize-space($p1)}</name>
                                <surname>{normalize-space($p2)}</surname>
                                <nickname>{normalize-space($p3)}</nickname>
                            </user>
                    ]]></xq-expression>
                </xquery>
            </body>
        </loop>
        <![CDATA[</users>]]>
    </file>
</config>

The problem is sometimes "Full Name: " is written in different ways:
upper/lower case, with/without sapce or ':' .... How can I consider all these
situations ? => regex may help a lot, but when trying functions using regex
like substring-after-match(), it says that these functions are not supported !

batis - 2012-07-13

Also, to test the code, I use <loop item="item" index="i" maxloops="10"> ...
And the code is working, but when I changed maxloops to 100 I got an error:

SXXP0003: Error reported by XML parser: Element type "span" must be followed by either attribute specifications, ">" or "/>". org.webharvest.exception.ScraperXQueryException: Error executing XQuery expression (XQuery = [declare variable................

If I understand, that means the XML code is dirty, isn't the output of <html- to-xml=""/> supposed to be well-formed ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Selvin Fehric - 2012-07-13

Maybe the problem is that there is no so many pages. You have to specify the
number of loops according to pager.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

batis - 2012-07-13

Abolutely not, the webpage contains more than 500 links to retrieve...

Here's the actual config for a similar website that contains more than 2000
links and wich works with maxloops="200" and gives same error for 300:
http://pastebin.com/CtBhsDA1

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Selvin Fehric - 2012-07-14

I run the config from last link, and error happens at: http://www.dictionaryo
farthistorians.org/brisacc.htm

As you can see html is not well formed and you have tags like
or />.

You can set some try-catch blocks to pass through that loop.

I couldn't run the first config that you pasted, because I don't know the url
that you are crawling, but I think that problem is similar.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

batis - 2012-07-14

Oh my, that's great you found the link causing all this xD

I tried to put a try/catch block, but it's not working... Unfortunately ther's
no examples to how it works (other than the manual), so could you please
suggest me where to put it in the same example above:
http://pastebin.com/CtBhsDA1

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

batis - 2012-07-14

I tried many ways to do it but still having the same error... Could you please
help me making it so it displays also problematic URLs ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Selvin Fehric - 2012-07-14

Put whole <xquery>... </xquery> in body part of try block. It should work.

<loop item="elmtURL" index="i" maxloops="300">

<list>

<xpath expression="//table/tbody/tr/td/ul/li/font/a/@href">

<html-to-xml omitcomments="true" outputtype="pretty">

<http url="${url}"/>

</html-to-xml>

</xpath>

</list>

<body>

<try>

<body>

<xquery>

<xq-param name="doc" type="node()">

<html-to-xml omitcomments="true" outputtype="pretty">

<http url="${sys.fullUrl(url.toString(), elmtURL.toString())}"/>

</html-to-xml>

</xq-param>

<xq-expression>
<DateBorn>{normalize-space($dateBorn)}</DateBorn> </pers> ]]> </xq-expression> </xquery> </body> <catch> </catch> </try> </body> </loop>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

batis - 2012-07-16

Ow, I tried it in many ways, but not like that x_o

Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Selvin Fehric - 2012-07-16

No problem :) You have to set like this because body part is executed for
every loop cycle. Now you can set some code in catch block if you want to
manage data which throw exception.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Unknown system function substring-after-match

Forums

Help

Unknown system function substring-after-match document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Unknown system function substring-after-match