Menu

#10 wizard for creating data mining/harvestors

Backlog
open
nobody
5
2014-08-16
2010-11-01
Anonymous
No

it would be useful if the GUI had suppot for a wizard to create new data mining applications, so that the GUI can be used for creating the corresponding source code wrappers and parameterizing these then.

Also, ideally there would be another tab for showing the HTML source code to be scraped. This would provide a good way for viewing the code side by side.
I think this is important because, you will inevitably need to look at a page's source code when creating a new configuration.

Another view could then show the rendered HTML in a browser-like tab, where the user could mark elements for scraping, where each marked element could be assigned to a corresponding WebHarvest variable.

It would then be possible to automatically derive the correct markup for extracting certain elements, just by feeding in 2-3 different HTML files and marking the relevant output. The correct xpath expressions for extracting data could then be created automatically.

Thanks for considering this request

Discussion

  • Nobody/Anonymous

    regarding automated field identification, this approach is supported and used by templatemaker:
    http://code.google.com/p/templatemaker/

     
  • Nobody/Anonymous

    templatemaker uses pattern matching for deriving the similarities between different versions of an input source:

    t1 = "green"
    t2 = "red"
    t3 = "blue"
    t.learn(t1)
    t.learn(t2)
    t.learn(t3)

    when all three inputs are used to 'teach' the same template processor instance, the processor knows that it needs to extract all content in between the tags. The very same technique can also be used for complex HTML code and nested variables - so that pattern matching (in the functional programming context) can be used to populate variables in a very powerul fashion.

     
  • Nobody/Anonymous

    to make this useful for WebHarvest, one could think about a way of inlining variable extraction, i.e. using sprintf-style format strings:

    r = html.match("$desc", url, title, desc);

    The variables url, title, desc would then be populated accordingly.

     
  • Nobody/Anonymous

    There is a windows freeware tool that does most of this (and actually a lot more), you can create scraping robots using a GUI wizard, where you mark certain elements (pieces of text) and assign them to a variable.
    Your actions are all the while being recorded by the program, the program will in turn create a scripted robot for doing exactly what you have done in an automated fashion.

    It supports multiple actions (like extract data, extract table, extract table row etc).

    "IRobot (named for Internet Robot) is a visual automation tool to create robot agents, or irobots, for Web data collection. An irobot agent is able to navigate Web sites, fill in Web forms, extract Web data and compute and integrate Web data with local databases. Using the user-friendly interfaces, you don't need to have programming skills to create irobots; but with some programming skills, you can create more powerful irobots. IRobot is the ultimate Web automation tool you would need to analyze and aggregate data from the Web"

    http://irobotsoft.com/

    It also comes with extensive docs:
    http://www.irobotsoft.com/help/irobot-manual.pdf
    http://irobotsoft.com/help/irobot-manual-advanced.pdf

     
  • Nobody/Anonymous

    FWIW, the coolest visual/GUI solution that I have seen for creating scraping bots is mozenda, but it's very pricey: http://www.mozenda.com/

    Make sure to check out the demo video or even the trial, it's a VERY compelling approach and product

     
  • Nobody/Anonymous

    Marking a piece of rendered HTML and then coming up with the underlying HTML source code is possible, it involves determining the bounding rectangle and then doing a DOM lookup. Basically, you can do a reverse XPATH lookup, where you feed in the DOMs of several valid input documents, and "mark" the relevant area of interest, in turn the proper XPATH expression for coming up with the marked output would be the greatest common denominator that works for all provided inputs.

     
  • Nobody/Anonymous

    And to add even more to this: you can use Firefox and a number of XPATH related addons to do this already, for example XPATHER + DOM INSPECTOR allows you to mark a piece of text (rendered HTML or HTML code) and it will then tell you the proper XPATH expression

     
  • Nobody/Anonymous

    I second this request, I would also find this very useful!

     
  • Robert Bala

    Robert Bala - 2012-09-21
    • milestone: --> Backlog
     
  • Anonymous

    Anonymous - 2014-06-11
    Post awaiting moderation.

Log in to post a comment.