<?xml version="1.0" encoding="UTF-8"?><configcharset="UTF-8"><includepath="functions.xml"/><!-- collects all tables for individual products --><var-defname="user"><callname="download-multipage-list"><call-paramname="pageUrl">${url}</call-param><call-paramname="nextXPath">//div[starts-with(., 'next')]/@href</call-param><call-paramname="itemXPath">//div[@id="content"]/table/tbody/tr/td[1]/div[1]/a/@href</call-param><call-paramname="maxloops">10</call-param></call></var-def><!-- iterates over all collected products and extract desired data --><fileaction="write"path="${xmlPath}"charset="UTF-8"><![CDATA[ <?xml version="1.0" encoding="UTF-8"?><users> ]]><loopitem="item"index="i"><list><varname="user"/></list><body><xquery><xq-paramname="item"type="node()"><varname="item"/></xq-param><xq-expression><![CDATA[ declare variable $item as node() external; let $fullName := substring-after(normalize-space(data($item//div[@id="content"]/div/div[2]/h1)), 'Full Name :') let $p1 := substring-before(data($fullName), ' ') let $p2 := substring-before(substring-after(data($fullName), ' '), '(') let $p3 := substring-before(substring-after(data($fullName), '('), ')') return<user><name>{normalize-space($p1)}</name><surname>{normalize-space($p2)}</surname><nickname>{normalize-space($p3)}</nickname> </user> ]]></xq-expression></xquery></body></loop><![CDATA[</users>]]></file></config>
The problem is sometimes "Full Name: " is written in different ways:
upper/lower case, with/without sapce or ':' .... How can I consider all these
situations ? => regex may help a lot, but when trying functions using regex
like substring-after-match(), it says that these functions are not supported !
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Also, to test the code, I use <loop item="item" index="i" maxloops="10"> ...
And the code is working, but when I changed maxloops to 100 I got an error:
Abolutely not, the webpage contains more than 500 links to retrieve...
Here's the actual config for a similar website that contains more than 2000
links and wich works with maxloops="200" and gives same error for 300: http://pastebin.com/CtBhsDA1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Oh my, that's great you found the link causing all this xD
I tried to put a try/catch block, but it's not working... Unfortunately ther's
no examples to how it works (other than the manual), so could you please
suggest me where to put it in the same example above: http://pastebin.com/CtBhsDA1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
No problem :) You have to set like this because body part is executed for
every loop cycle. Now you can set some code in catch block if you want to
manage data which throw exception.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I got that error when trying to use that xquery function !
What am I missing?
To be specific, I extract names like: "SOPASKY DAVID Jr ( ZINGO )"
What I want is to separate these words to separate variables:
name = SOPASKY (always the 1st word)
surname = DAVID Jr (1 or more till the opening parenthesis)
knownAs = ZINGO (words within the parenthesis)
I tried many xquery functions, but still getting same error of "unknown
function"
Paste example code so we can see.
Here's the code:
The problem is sometimes "Full Name: " is written in different ways:
upper/lower case, with/without sapce or ':' .... How can I consider all these
situations ? => regex may help a lot, but when trying functions using regex
like substring-after-match(), it says that these functions are not supported !
Also, to test the code, I use <loop item="item" index="i" maxloops="10"> ...
And the code is working, but when I changed maxloops to 100 I got an error:
If I understand, that means the XML code is dirty, isn't the output of <html- to-xml=""/> supposed to be well-formed ?
Maybe the problem is that there is no so many pages. You have to specify the
number of loops according to pager.
Abolutely not, the webpage contains more than 500 links to retrieve...
Here's the actual config for a similar website that contains more than 2000
links and wich works with maxloops="200" and gives same error for 300:
http://pastebin.com/CtBhsDA1
I run the config from last link, and error happens at: http://www.dictionaryo
farthistorians.org/brisacc.htm
As you can see html is not well formed and you have tags like
or />.
You can set some try-catch blocks to pass through that loop.
I couldn't run the first config that you pasted, because I don't know the url
that you are crawling, but I think that problem is similar.
Oh my, that's great you found the link causing all this xD
I tried to put a try/catch block, but it's not working... Unfortunately ther's
no examples to how it works (other than the manual), so could you please
suggest me where to put it in the same example above:
http://pastebin.com/CtBhsDA1
I tried many ways to do it but still having the same error... Could you please
help me making it so it displays also problematic URLs ?
Put whole <xquery>... </xquery> in body part of try block. It should work.
<loop item="elmtURL" index="i" maxloops="300">
<list>
<xpath expression="//table/tbody/tr/td/ul/li/font/a/@href">
<html-to-xml omitcomments="true" outputtype="pretty">
<http url="${url}"/>
</html-to-xml>
</xpath>
</list>
<body>
<try>
<body>
<xquery>
<xq-param name="doc" type="node()">
<html-to-xml omitcomments="true" outputtype="pretty">
<http url="${sys.fullUrl(url.toString(), elmtURL.toString())}"/>
</html-to-xml>
</xq-param>
<xq-expression>
<DateBorn>{normalize-space($dateBorn)}</DateBorn> </pers> ]]> </xq-expression> </xquery> </body> <catch> </catch> </try> </body> </loop>Ow, I tried it in many ways, but not like that x_o
Thanks
No problem :) You have to set like this because body part is executed for
every loop cycle. Now you can set some code in catch block if you want to
manage data which throw exception.