Menu

I need help regarding extraction of data

Help
Malik
2012-06-27
2012-09-04
  • Malik

    Malik - 2012-06-27

    Harvest relevant data from http://www.bhg.com/:

    For each and every page of the site crawl the following data (examples in
    brackets from http://www.bhg.com/gardening/vegetable/fruit/grow-your-own-
    apples/

    1) Page title (e.g. "Grow Your Own Apples")

    2) Meta Name Description (e.g. "It's fun and easy to grow apples in your own
    backyard. Follow the tips below to ensure beautifully grown apples!")

    3) Breadcrumb part 2 (e.g. "Gardening")

    4) Breadcrumb part 3 (e.g. "Edible Gardening")

    5) Breadcrumb part 4 (e.g. "Fruit")

     
  • Malik

    Malik - 2012-06-27

    i have to extract from all pages of the site how can it be done.

    Kind regards,

    Malik

     
  • Selvin Fehric

    Selvin Fehric - 2012-06-27

    Go through example configuration files, and also take a look how to use
    xquery. You just need to use correct paths to that elements, and make some
    minor changes to one of example configuration. I don't know exactly which one.
    Take a look them on official website.

     
  • Malik

    Malik - 2012-06-27

    Hi,

    Yes i did and i also try to modefy the "Yahoo Shoping Example" from the
    site.But i am not able to do.I need the above mention task to be done for the
    whole website.

     
  • Malik

    Malik - 2012-06-27

    This is the code i am using for a single page and its working but i need it to
    work for all the pages of website.

    <config charset="ISO-8859-1">

    <var-def name="wholepage">

    <html-to-xml>

    <http url="&lt;a class=" "="" href="http://www.bhg.com/gardening/vegetable/fruit/grow-your-own- apples/%24">http://www.bhg.com/gardening/vegetable/fruit/grow-your-own-klzzwxh:0004apples/${uri}"/>

    </html-to-xml>

    </var-def>

    <file action="write" path="fg/catalog.xml" charset="UTF-8">

    <loop item="item" index="i">

    <list></list>

    <body>

    <xquery>

    <xq-param name="item" type="node()"></xq-param>

    <xq-expression><![CDATA)

    let $desc := data($item//*)

    return

    <product>

    <name>{normalize-space($name)}</name>

    {normalize-space($desc)}

    </product>

    ]]></xq-expression>

    </xquery>

    </body>

    </loop>

    </file>

    </config>

     
  • Malik

    Malik - 2012-06-27

    Now i am using this approach:

    <config charset="ISO-8859-1">

    <var-def name="startUrl">http://www.bhg.com/</var-def>

    <file action="write" path="asdasd/nytimes${sys.date()}.xml" charset="UTF-8">

    <template>

    </template>

    <loop item="articleUrl" index="i">

    <list>

    <xpath expression="//div/h5/a/@href">

    <html-to-xml>

    <http url="${startUrl}"/>

    </html-to-xml>

    </xpath>

    <xpath expression="//div/a/@href">

    <html-to-xml>

    <http url="${startUrl}"/>

    </html-to-xml>

    </xpath>

    </list>

    <body>

    <xquery>

    <xq-param name="doc">

    <html-to-xml>

    <http url="${sys.fullUrl(startUrl, articleUrl)}?&amp;pagewanted=print"/>

    </html-to-xml>

    </xq-param>

    <xq-expression><![CDATA)

    let $text := data($doc//div)

    return

    <author>{data($author)}</author> {data($text)}

    ]]></xq-expression>

    </xquery>

    </body>

    </loop>

    </file>

    </config>

     
  • Malik

    Malik - 2012-06-27

    I even try to replicate the Newyourk news letter example but i runs and i get
    empty lines just this appears

    <newyourk_times date="27.06.2012"> </newyourk_times>

    Why is it not working?

    I copied the code from the website examples.

    http://web-harvest.sourceforge.net/samples.php?num=4

     
  • Selvin Fehric

    Selvin Fehric - 2012-06-27

    Check the simple crawler example (http://web-
    harvest.sourceforge.net/samples.php?num=6),
    and then integrate that code
    which works for one site.

     
  • Selvin Fehric

    Selvin Fehric - 2012-06-27

    Sorry, for one page, not site. You can set some kind of function which parse
    url and check if it is valid for you. And then when you get that url and its
    content, then you do the functionality that you want, i.e. extract data.

     
  • Malik

    Malik - 2012-06-29

    Can you give me a example or some hint with code i really cant understand what
    to do.

     
  • Selvin Fehric

    Selvin Fehric - 2012-06-30

    Sorry, I just saw this. I don't have full example on this computer, but I will
    sent you tomorrow. Is it OK?

     
  • Malik

    Malik - 2012-07-02

    Yes! i will be really thankful to you.

     
  • Malik

    Malik - 2012-07-02

    ill wait for your reply!

     
  • Selvin Fehric

    Selvin Fehric - 2012-07-02

    Ok here is example of config file which I used to extract product information
    from product pages from whole site.

    I will try to explain it in short. In first part of the config file I first
    download home page which is in unvisited, then extract all links from it and
    add wanted urls to newLinks list, after I went through whole unvisited list, I
    add all newLinks to that list and repeat that process. So on this way you can
    go through all pages, and when you get link that you want to extract
    information from you save it to product list. In other part of config file I
    go through product list, and then using xquery extract infos to xml. Sorry on
    my english, but I hope that you understand. Analyze this config file and ask
    me if you have some questions. Kind regards!

    I couldn't paste it here, it is large, so I uploaded it to:

    http://www.sendspace.com/file/qua6hm

     
  • Malik

    Malik - 2012-07-03

    Hi,

    Wish your day is going good,

    I tried your example and i wanted to see the out put.The first part is working
    correctly that it make text files for product and visited and unvisited links.

    But then at 82 list item it freezes and i am never able to see the "petmeds"
    file as it never reaches the end.

    So my questions

    1.I have to execute the code all at once or i have to execute it in Two parts.

     
  • Malik

    Malik - 2012-07-03

    This is what needed to be done:

    Harvest relevant data from http://www.bhg.com/:

    For each and every page of the site crawl the following data (examples in
    brackets from http://www.bhg.com/gardening/vegetable/fruit/grow-your-own-
    apples/

    1) Page title (e.g. "Grow Your Own Apples")

    2) Meta Name Description (e.g. "It's fun and easy to grow apples in your own
    backyard. Follow the tips below to ensure beautifully grown apples!")

    3) Breadcrumb part 2 (e.g. "Gardening")

    4) Breadcrumb part 3 (e.g. "Edible Gardening")

    5) Breadcrumb part 4 (e.g. "Fruit")

     
  • Selvin Fehric

    Selvin Fehric - 2012-07-04

    Well crawling that site is taking long time. Maybe 2-3 hours, because there is
    a lot of products to be crawled.

    You can change it to execute all in once, just modify to extract data when he
    find and download product page.

    I know what needed to be done and it can be done using config that I sent you
    but you have to modify by itself because I don't have time to analyze that
    site, its links, xpaths to elements etc.

     
  • Malik

    Malik - 2012-07-04

    Thanks alot!

    I will try to implement it and do some changes, according to your code. Then i
    will let you know and i would really like to appreciate you and your team and
    the effort and time you give to my queries.

     
  • Malik

    Malik - 2012-07-04

    Hi,

    I am trying your example but for me when the iteration(loop 82) reaches "82".
    Its get freeze. I waited for 5 6 hours but it is getting stuck at loop 82. Can
    you be able to run the code perfectly.

    Do i have to change anything. I am trying to run your code so i can see the
    results and then update it according to my requirements.

     
  • Selvin Fehric

    Selvin Fehric - 2012-07-05

    You can uncomment the following line to do a partial crawl and see a results.

    /if(subCats.size()<3)/ - instead of 3 you can put 5, depends how long you
    want to wait.

    Regarding freezing, I suppose that the problem is in webharvest heap space, so
    if you run it through command line and set -Xmx1024M and -Xms1024M probably
    that problem won't appear. I didn't run my crawl using webharvest so I forgot
    to tell you for that problem.

     

Log in to post a comment.