I'm building WH from trunk as Maven project and when I run my own test app
against the snapshot JAR I get the following warning:
WARN org.webharvest.runtime.Scraper - You are using the DEPRECATED scraper
configuration version. We urge you to migrate to a newer one! Please visit http://web-harvest.sourceforge.net/release.php for details.
I've read where you are reworking the variable mechanism, so is there a way to
use the new context by default?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yep, in WH 2.1 as lot of improvements been done in the variable handling, in
particular a real dynamic scope was introduced. That required some syntax
clean up. And also the new dynamic scope breaks the compatibility with the
existing WH scrapers. For both of this purposes we decided to introduce
scraper configuration versioning, so that WH knows how to interpret it -
either in "old" or "new" way. The work on this has just been finished and it
need a lot of QA. Any help is appreciated.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Any way to post this in the wiki (i.e. the 2.1 differences) with examples? I
figured out a lot by converting existing scripts (thanks for no more <empy>
tags for var defs) after reading your responses and looking at wh-
core-2.0.xsd. One thing that would be nice is if I could get back non-wrapped
(not org.webharvest.runtime.variables.Variable) and say get back a
List<Map<String,Object>> instead (or any other non-wrapped object). Consider
the following code. I have to convert Variable list to Java List then iterate
that list to extract my Map objects.
/**
* Return List of Map<String, Object> containing key/value pairs of data
* after Web Harvest script completes.
*
* @param sourceFilePath Web Harvest script file path
* @param workingDir Web Harvest working directory path
* @return List of Map<String, Object> containing key/value pairs of data
* @throws FileNotFoundException Possible exception
*/
public List<Map<String, Object>> getList(final String sourceFilePath,
final String workingDir) throws FileNotFoundException {
final ScraperConfiguration config = new ScraperConfiguration(
sourceFilePath);
final Scraper scraper = new Scraper(config, workingDir);
scraper.setDebug(false);
log.info(String.format("Executing script %s", sourceFilePath));
final long startTime = System.currentTimeMillis();
scraper.execute();
log.info(String.format("Executed in: %d ms", System.currentTimeMillis()
- startTime));
// Script is expected to return Web Harvest Variable "mapList" which
// should be a List of Map objects
final Variable listVar = (Variable) scraper.getContext().getVar(
"mapList");
// List to return
List<Map<String, Object>> mapList = new ArrayList<Map<String, Object>>();
// If list returned from scraper is null there was a problem with the
// script, so return null mapList
if (listVar != null) {
// Convert Web Harvest list to Java List of Web Harvest Variable
// objects
final List<Variable> list = listVar.toList();
for (Variable var : list) {
// Get wrapped object (Map in this case)
mapList.add((Map) var.getWrappedObject());
}
} else {
mapList = null;
}
return mapList;
}
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Regarding unwrapping variables I have answered in another thread. I've been
thinking of it for awhile and, yes, it needs to be changed. I put it into
TODOs.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm building WH from trunk as Maven project and when I run my own test app
against the snapshot JAR I get the following warning:
WARN org.webharvest.runtime.Scraper - You are using the DEPRECATED scraper
configuration version. We urge you to migrate to a newer one! Please visit
http://web-harvest.sourceforge.net/release.php for details.
I've read where you are reworking the variable mechanism, so is there a way to
use the new context by default?
I figured this out by looking at the source.
<config xmlns="<a class=" "="" href="http ://web-harvest.sourceforge.net/schema/2.1/core">http://web-harvest.sourceforge.net/schema/2.1/core" charset="UTF-8">
Otherwise 2.1 trunk source uses 1.0 syntax.
Yep, in WH 2.1 as lot of improvements been done in the variable handling, in
particular a real dynamic scope was introduced. That required some syntax
clean up. And also the new dynamic scope breaks the compatibility with the
existing WH scrapers. For both of this purposes we decided to introduce
scraper configuration versioning, so that WH knows how to interpret it -
either in "old" or "new" way. The work on this has just been finished and it
need a lot of QA. Any help is appreciated.
Any way to post this in the wiki (i.e. the 2.1 differences) with examples? I
figured out a lot by converting existing scripts (thanks for no more <empy>
tags for var defs) after reading your responses and looking at wh-
core-2.0.xsd. One thing that would be nice is if I could get back non-wrapped
(not org.webharvest.runtime.variables.Variable) and say get back a
List<Map<String,Object>> instead (or any other non-wrapped object). Consider
the following code. I have to convert Variable list to Java List then iterate
that list to extract my Map objects.
Regarding unwrapping variables I have answered in another thread. I've been
thinking of it for awhile and, yes, it needs to be changed. I put it into
TODOs.