I just started using Web Harvest a few days ago and would love for it to
support the If-Modified-Since HTTP header in scripts like the following:
<fileaction="write"type="binary"path="images/${searchKey}.jpg"><httpurl="${sys.fullUrl(url,link)}"><http-headername="If-Modified-Since">Sat, 15 Oct 2011 01:00:00 GMT</http-header></http></file>
There are two issues standing in the way:
1) Line 145 of org/webharvest/runtime/processors/HttpProcessor.java contains
the line:
text = new String(responseBody, charset);
If the webserver doesn't respond with content (which would be the expected
case if the requested resource hasn't changed), the responseBody is null and
the execution of the Web Harvest script dies with a NullPointerException.
I fixed this in my local src copy by wrapping the call with:
if (responseBody != null) {
text = new String(responseBody, charset);
}
Fixing the null pointer exception results in a clean execution, except that
the FileProcessor sees a body with zero length and overwrites any existing
output file.
2) org/webharvest/runtime/processors/FileProcessor.java creates the
FileOutputStream before checking the length of the data. I've changed the
logic in my local copy to:
if (data.length == 0) {
} else {
FileOutputStream out = new FileOutputStream(fullPath, append);
out.write(data);
out.flush();
out.close();
}
Notice that the creation of the FileOutputStream has been moved to after the
data processing and included in the body of the if statement. It seems as
though an additional attribute would be needed to specify the desired behavior
for creating zero-length files because sometimes you want the file created and
sometimes you don't.
I did try wrapping my config elements inside a try processor, but it continue
to fail in the same manner as without it. I'm guessing that's because the try
processor doesn't catch the NullPointerException so it still propagates up the
hierarchy and causes the execution to fail.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I just started using Web Harvest a few days ago and would love for it to
support the If-Modified-Since HTTP header in scripts like the following:
There are two issues standing in the way:
1) Line 145 of org/webharvest/runtime/processors/HttpProcessor.java contains
the line:
If the webserver doesn't respond with content (which would be the expected
case if the requested resource hasn't changed), the responseBody is null and
the execution of the Web Harvest script dies with a NullPointerException.
I fixed this in my local src copy by wrapping the call with:
Fixing the null pointer exception results in a clean execution, except that
the FileProcessor sees a body with zero length and overwrites any existing
output file.
2) org/webharvest/runtime/processors/FileProcessor.java creates the
FileOutputStream before checking the length of the data. I've changed the
logic in my local copy to:
Notice that the creation of the FileOutputStream has been moved to after the
data processing and included in the body of the if statement. It seems as
though an additional attribute would be needed to specify the desired behavior
for creating zero-length files because sometimes you want the file created and
sometimes you don't.
I did try wrapping my config elements inside a try processor, but it continue
to fail in the same manner as without it. I'm guessing that's because the try
processor doesn't catch the NullPointerException so it still propagates up the
hierarchy and causes the execution to fail.