Looking for the latest version? Download Spider-src.zip (5.2 MB)
Home
Name Modified Size Downloads / Week Status
Totals: 2 Items   5.2 MB 2
Spider-src.zip 2011-10-23 5.2 MB 11 weekly downloads
README.txt 2011-10-23 2.5 kB 11 weekly downloads
Spider project The initial goal of this project was to grab vacancies and resumes from a web sites. But you can use spider to grab any kind of information from a web pages and persist to your database e.g. emails or contact info. You can use free the source code in your projects but do not delete the header comments. Spider uses 'config\resources' folder where configuration per website is located. For each such configuration Spider creates a java thread. To create configuration for a website you need the following files: (Please see the working example in folder config/resources/rabotaua for the russian website, attached to sources) 1) categories.properties - mapping website category names to id's of your database categories. 2) engine.properties - with properties: 2.1) resource.driver - driver class(in some cases you can extend js.spider.impl.driver.ResourceDriver) 2.2) resource.name - website name 2.3) resource.encoding - website encoding 2.4) resource.vac.href - template for vacancy page href 2.5) resource.res.href - template string for resume page href 2.6) resource.grabAttempts - number of empty pages after which spider stops grabbing resource 2.7) enabled - (true\false) is configuration enabled 2.8) showInfo - log all information about resume\vacancy 3) xpaths.properties - xpaths to all fields of resume\vacancy.There are flags for to specify xpath 3.1) vac/res/comm - xpath for vacancy/resume/both 3.2) man - mandatory field, spider skips page if this xpath returns nothing 3.3) a,b,c,d ... - flags of preceding. Spider concats strings 3.4) the last flag is a name of vacancy/resume field e.g. title, description, email 4) state.properties - spider uses to read and store the state of last processing.For the first time you should specify initial id's. You can launch spider directly Main.java or by scheduled start time MainSchedule.java (cronExpression property in the spring framework context.xml) The example of working configuration is attached to sources. Please contact me for any question by email workinua.custom.support@gmail.com I can help you to enhance spider for a web sources which is interested for you.
Source: README.txt, updated 2011-10-23

Thanks for helping keep SourceForge clean.

Screenshot instructions:
Windows
Mac
Red Hat Linux   Ubuntu

Click URL instructions:
Right-click on ad, choose "Copy Link", then paste here →
(This may not be possible with some types of ads)

More information about our ad policies
X

Briefly describe the problem (required):

Upload screenshot of ad (required):
Select a file, or drag & drop file here.

Please provide the ad click URL, if possible:

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks