WebHarvest - web data extraction tool / Discussion / Help: I need help regarding extraction of data

Malik - 2012-06-27

Harvest relevant data from http://www.bhg.com/:

For each and every page of the site crawl the following data (examples in
brackets from http://www.bhg.com/gardening/vegetable/fruit/grow-your-own-
apples/

1) Page title (e.g. "Grow Your Own Apples")

2) Meta Name Description (e.g. "It's fun and easy to grow apples in your own
backyard. Follow the tips below to ensure beautifully grown apples!")

3) Breadcrumb part 2 (e.g. "Gardening")

4) Breadcrumb part 3 (e.g. "Edible Gardening")

5) Breadcrumb part 4 (e.g. "Fruit")

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Malik - 2012-06-27

i have to extract from all pages of the site how can it be done.

Kind regards,

Malik

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Selvin Fehric - 2012-06-27

Go through example configuration files, and also take a look how to use
xquery. You just need to use correct paths to that elements, and make some
minor changes to one of example configuration. I don't know exactly which one.
Take a look them on official website.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Malik - 2012-06-27

Hi,

Yes i did and i also try to modefy the "Yahoo Shoping Example" from the
site.But i am not able to do.I need the above mention task to be done for the
whole website.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Malik - 2012-06-27

This is the code i am using for a single page and its working but i need it to
work for all the pages of website.

<config charset="ISO-8859-1">

<var-def name="wholepage">

<html-to-xml>

<http url="<a class=" "="" href="http://www.bhg.com/gardening/vegetable/fruit/grow-your-own- apples/%24">http://www.bhg.com/gardening/vegetable/fruit/grow-your-own-klzzwxh:0004apples/${uri}"/>

</html-to-xml>

</var-def>

<file action="write" path="fg/catalog.xml" charset="UTF-8">

<loop item="item" index="i">

<list></list>

<body>

<xquery>

<xq-param name="item" type="node()"></xq-param>

<xq-expression><![CDATA)

let $desc := data($item//*)

return

<product>

<name>{normalize-space($name)}</name>

{normalize-space($desc)}

</product>

]]></xq-expression>

</xquery>

</body>

</loop>

</file>

</config>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Malik - 2012-06-27

Now i am using this approach:

<config charset="ISO-8859-1">

<var-def name="startUrl">http://www.bhg.com/</var-def>

<file action="write" path="asdasd/nytimes${sys.date()}.xml" charset="UTF-8">

<template>

</template>

<loop item="articleUrl" index="i">

<list>

<xpath expression="//div/h5/a/@href">

<html-to-xml>

<http url="${startUrl}"/>

</html-to-xml>

</xpath>

<xpath expression="//div/a/@href">

<html-to-xml>

<http url="${startUrl}"/>

</html-to-xml>

</xpath>

</list>

<body>

<xquery>

<xq-param name="doc">

<html-to-xml>

<http url="${sys.fullUrl(startUrl, articleUrl)}?&pagewanted=print"/>

</html-to-xml>

</xq-param>

<xq-expression><![CDATA)

let $text := data($doc//div)

return

<author>{data($author)}</author> {data($text)}

]]></xq-expression>

</xquery>

</body>

</loop>

</file>

</config>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Malik - 2012-06-27

I even try to replicate the Newyourk news letter example but i runs and i get
empty lines just this appears

<newyourk_times date="27.06.2012"> </newyourk_times>

Why is it not working?

I copied the code from the website examples.

http://web-harvest.sourceforge.net/samples.php?num=4

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Selvin Fehric - 2012-06-27

Check the simple crawler example (http://web-
harvest.sourceforge.net/samples.php?num=6), and then integrate that code
which works for one site.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Selvin Fehric - 2012-06-27

Sorry, for one page, not site. You can set some kind of function which parse
url and check if it is valid for you. And then when you get that url and its
content, then you do the functionality that you want, i.e. extract data.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Malik - 2012-06-29

Can you give me a example or some hint with code i really cant understand what
to do.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Selvin Fehric - 2012-06-30

Sorry, I just saw this. I don't have full example on this computer, but I will
sent you tomorrow. Is it OK?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Malik - 2012-07-02

Yes! i will be really thankful to you.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Malik - 2012-07-02

ill wait for your reply!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Selvin Fehric - 2012-07-02

Ok here is example of config file which I used to extract product information
from product pages from whole site.

I will try to explain it in short. In first part of the config file I first
download home page which is in unvisited, then extract all links from it and
add wanted urls to newLinks list, after I went through whole unvisited list, I
add all newLinks to that list and repeat that process. So on this way you can
go through all pages, and when you get link that you want to extract
information from you save it to product list. In other part of config file I
go through product list, and then using xquery extract infos to xml. Sorry on
my english, but I hope that you understand. Analyze this config file and ask
me if you have some questions. Kind regards!

I couldn't paste it here, it is large, so I uploaded it to:

http://www.sendspace.com/file/qua6hm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Malik - 2012-07-03

Hi,

Wish your day is going good,

I tried your example and i wanted to see the out put.The first part is working
correctly that it make text files for product and visited and unvisited links.

But then at 82 list item it freezes and i am never able to see the "petmeds"
file as it never reaches the end.

So my questions

1.I have to execute the code all at once or i have to execute it in Two parts.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Malik - 2012-07-03

This is what needed to be done:

Harvest relevant data from http://www.bhg.com/:

For each and every page of the site crawl the following data (examples in
brackets from http://www.bhg.com/gardening/vegetable/fruit/grow-your-own-
apples/

1) Page title (e.g. "Grow Your Own Apples")

2) Meta Name Description (e.g. "It's fun and easy to grow apples in your own
backyard. Follow the tips below to ensure beautifully grown apples!")

3) Breadcrumb part 2 (e.g. "Gardening")

4) Breadcrumb part 3 (e.g. "Edible Gardening")

5) Breadcrumb part 4 (e.g. "Fruit")

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Selvin Fehric - 2012-07-04

Well crawling that site is taking long time. Maybe 2-3 hours, because there is
a lot of products to be crawled.

You can change it to execute all in once, just modify to extract data when he
find and download product page.

I know what needed to be done and it can be done using config that I sent you
but you have to modify by itself because I don't have time to analyze that
site, its links, xpaths to elements etc.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Malik - 2012-07-04

Thanks alot!

I will try to implement it and do some changes, according to your code. Then i
will let you know and i would really like to appreciate you and your team and
the effort and time you give to my queries.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Malik - 2012-07-04

Hi,

I am trying your example but for me when the iteration(loop 82) reaches "82".
Its get freeze. I waited for 5 6 hours but it is getting stuck at loop 82. Can
you be able to run the code perfectly.

Do i have to change anything. I am trying to run your code so i can see the
results and then update it according to my requirements.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Selvin Fehric - 2012-07-05

You can uncomment the following line to do a partial crawl and see a results.

/if(subCats.size()<3)/ - instead of 3 you can put 5, depends how long you
want to wait.

Regarding freezing, I suppose that the problem is in webharvest heap space, so
if you run it through command line and set -Xmx1024M and -Xms1024M probably
that problem won't appear. I didn't run my crawl using webharvest so I forgot
to tell you for that problem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I need help regarding extraction of data

Forums

Help

I need help regarding extraction of data document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

I need help regarding extraction of data