The initial goal of this project was to grab vacancies and resumes from a web sites.
But you can use spider to grab any kind of information from a web pages and persist
to your database e.g. emails or contact info.
You can use free the source code in your projects but do not delete the header comments.
Spider uses 'config\resources' folder where configuration per website is located.
For each such configuration Spider creates a java thread.
To create configuration for a website you need the following files:
(Please see the working example in folder config/resources/rabotaua for the russian website, attached to sources)
1) categories.properties - mapping website category names to id's of your database categories.
2) engine.properties - with properties:
2.1) resource.driver - driver class(in some cases you can extend js.spider.impl.driver.ResourceDriver)
2.2) resource.name - website name
2.3) resource.encoding - website encoding
2.4) resource.vac.href - template for vacancy page href
2.5) resource.res.href - template string for resume page href
2.6) resource.grabAttempts - number of empty pages after which spider stops grabbing resource
2.7) enabled - (true\false) is configuration enabled
2.8) showInfo - log all information about resume\vacancy
3) xpaths.properties - xpaths to all fields of resume\vacancy.There are flags for to specify xpath
3.1) vac/res/comm - xpath for vacancy/resume/both
3.2) man - mandatory field, spider skips page if this xpath returns nothing
3.3) a,b,c,d ... - flags of preceding. Spider concats strings
3.4) the last flag is a name of vacancy/resume field e.g. title, description, email
4) state.properties - spider uses to read and store the state of last processing.For the first time you should specify initial id's.
You can launch spider directly Main.java or by scheduled start time MainSchedule.java
(cronExpression property in the spring framework context.xml)
The example of working configuration is attached to sources.
Please contact me for any question by email email@example.com
I can help you to enhance spider for a web sources which is interested for you.