WebHarvest - web data extraction tool / Feature Requests / #10 wizard for creating data mining/harvestors

#10 wizard for creating data mining/harvestors

Milestone: Backlog

Status: open

Owner: nobody

Labels: Interface Improvements (example) (2)

Priority: 5

Updated: 2014-08-16

Created: 2010-11-01

Creator: Anonymous

Private: No

it would be useful if the GUI had suppot for a wizard to create new data mining applications, so that the GUI can be used for creating the corresponding source code wrappers and parameterizing these then.

Also, ideally there would be another tab for showing the HTML source code to be scraped. This would provide a good way for viewing the code side by side.
I think this is important because, you will inevitably need to look at a page's source code when creating a new configuration.

Another view could then show the rendered HTML in a browser-like tab, where the user could mark elements for scraping, where each marked element could be assigned to a corresponding WebHarvest variable.

It would then be possible to automatically derive the correct markup for extracting certain elements, just by feeding in 2-3 different HTML files and marking the relevant output. The correct xpath expressions for extracting data could then be created automatically.

Thanks for considering this request

Discussion

Nobody/Anonymous - 2010-11-01

regarding automated field identification, this approach is supported and used by templatemaker:
http://code.google.com/p/templatemaker/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2010-11-01

This would also seem relevant to a web based querying language: http://developer.yahoo.com/yql/http://developer.yahoo.com/yql/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2010-11-01

templatemaker uses pattern matching for deriving the similarities between different versions of an input source:

t1 = "green"
t2 = "red"
t3 = "blue"
t.learn(t1)
t.learn(t2)
t.learn(t3)

when all three inputs are used to 'teach' the same template processor instance, the processor knows that it needs to extract all content in between the tags. The very same technique can also be used for complex HTML code and nested variables - so that pattern matching (in the functional programming context) can be used to populate variables in a very powerul fashion.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2010-11-01

to make this useful for WebHarvest, one could think about a way of inlining variable extraction, i.e. using sprintf-style format strings:

r = html.match("$desc", url, title, desc);

The variables url, title, desc would then be populated accordingly.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2010-11-02

There is a windows freeware tool that does most of this (and actually a lot more), you can create scraping robots using a GUI wizard, where you mark certain elements (pieces of text) and assign them to a variable.
Your actions are all the while being recorded by the program, the program will in turn create a scripted robot for doing exactly what you have done in an automated fashion.

It supports multiple actions (like extract data, extract table, extract table row etc).

"IRobot (named for Internet Robot) is a visual automation tool to create robot agents, or irobots, for Web data collection. An irobot agent is able to navigate Web sites, fill in Web forms, extract Web data and compute and integrate Web data with local databases. Using the user-friendly interfaces, you don't need to have programming skills to create irobots; but with some programming skills, you can create more powerful irobots. IRobot is the ultimate Web automation tool you would need to analyze and aggregate data from the Web"

http://irobotsoft.com/

It also comes with extensive docs:
http://www.irobotsoft.com/help/irobot-manual.pdf
http://irobotsoft.com/help/irobot-manual-advanced.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2010-11-02

FWIW, the coolest visual/GUI solution that I have seen for creating scraping bots is mozenda, but it's very pricey: http://www.mozenda.com/

Make sure to check out the demo video or even the trial, it's a VERY compelling approach and product

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2010-11-03

Marking a piece of rendered HTML and then coming up with the underlying HTML source code is possible, it involves determining the bounding rectangle and then doing a DOM lookup. Basically, you can do a reverse XPATH lookup, where you feed in the DOMs of several valid input documents, and "mark" the relevant area of interest, in turn the proper XPATH expression for coming up with the marked output would be the greatest common denominator that works for all provided inputs.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2010-11-03

I forgot, please also refer to "reverse xpath" (google):

http://blogs.sun.com/rajeshthekkadath/entry/reverse_xpath_finding_the_xpath

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2010-11-03

And to add even more to this: you can use Firefox and a number of XPATH related addons to do this already, for example XPATHER + DOM INSPECTOR allows you to mark a piece of text (rendered HTML or HTML code) and it will then tell you the proper XPATH expression

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2011-08-30

I second this request, I would also find this very useful!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Bala - 2012-09-21

milestone: --> Backlog
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-06-11

Post awaiting moderation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.