Are there any non-backward compatible changes between 2.0 beta and code in the
trunk? Because my config that works fine in 2b can't manage to successfully
log in to the website with the trunk version.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, unrelated to my login problem, but the database tag seems to be broken
in the trunk. From what I can see, the value from the database processor is
never returned to the parent processor (set/def for instance), because when
CommonUtil.createVariable is called in BodyProcessor.execute() to create a
variable from the tag body's execution, that method checks if that variable is
empty first, and DbRowVariable seem to always be "empty". This is because
DbRowVariable's isEmpty() method is inherited from NodeVariable, and so it
checks NodeVariable's own "data" member variable, which is always null is the
case of a DbRowVariable (!) because that class redefines (?) the data member
as an object array (Java allows this?). Weird behavior there.
Anyway, this used to work because, from the SVN log, it looks like
NodeVariable.isEmpty() used to also check if the toString() method returned an
empty string, which DbRowVariable.toString() does not. That condition was
removed though, so now database processors always return an EmptyVariable.
...Unless I'm doing something wrong.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Actually, it does make sense that Java allows DbRowVariable to have a member
called data, since NodeVariable.data is private, so the child class has no
visibility on it. That's not really the issue anyway. I guess the best
solution would be for DbRowVariable to implement its own isEmpty() method?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I found out why my login wasn't working. There are indeed changes in the way
the http processor works that can affect scripts that worked in 2.0 beta.
First is that this processor now supports a followRedirects attribute that, as
the name implies, determines if Web Harvest should follow redirection requests
returned from the website. In 2.0, this behavior was automatic and
redirections would always be followed. In the trunk (2.1?), if the new
attribute is not specified, the default behavior is to not follow
redirections, which seems a bit counter-intuitive to me. It seems like in most
cases you'd want redirections to be followed, while not following them would
be the exception?
But anyway, adding that attribute fixed my login issue, but then I would be
logged out on the next request. It seemed like a cookie issue, so looking
through the code, I found a new section in HttpClientManager.execute() that
sets the expiry date of cookies that didn't have an expiry date to the current
date, apparently to avoid an issue with HttpClient 3.1. I don't know about
this issue, but I didn't have any problem with the previous version, while now
my cookies are not working in a few of the sites I'm trying to crawl, and
commenting out that section solves it so... I don't know, maybe making this
"fix" optional with another attribute would be a good idea?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I found another odd discrepancy between the two versions. Take the following
code for example:
<functionname="func"><script>boolean x = true;</script><callname="func2"/><case><ifcondition="${x}"><template>${x}</template></if></case></function><functionname="func2"><script>boolean x = false;</script></function><callname="func"/>
In version 2.0, the variable scopes of the two functions are independent; the
script variable assignment in func2 doesn't affect the variable with the same
name in func. In the trunk version though, func doesn't enter the if statement
because func2 can apparently change the value of its x variable. Is this
intentional?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I see that functions now also inherit variables from their parent scopes, i.e.
in my example above, func2 would have access to any variables defined in func
as well as in the "global" scope (outside functions). This is a pretty cool
feature, and it would have saved me many headaches had I started using the
trunk right away.
The behavior explained in my previous post is probably related to this, but
I'm still not sure if it's intentional... Actually, scratch that, I just did
some more testing; it's probably intentional after all. I see now the
difference between and <def>. If func2 defs the x variable, it will have
its own local copy and func will be unaffected, while if it sets it, then
func's x will be overwritten. Script variable are set by default it seems.
Interesting.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'll note another curious behavior. It's not a very important thing, but it
can still break scripts in some cases when switching versions. It seems that
the (admittedly deprecated) <var-def> cannot overwrite a variable set as a
<loop>'s item variable. For instance, the following code:
In 2.0, the execution always goes through the last <else>, because the i
variable is set to 5 at the beginning of the loop's body. In the trunk
version, each if is executed once, the var-def seemingly ignored. If you use or <def> instead, it behaves as expected, like in 2.0.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
lord_samael, you are doing a thorough job! Thank you :)
I will go though your post more carefully checking everything, hopefully I'll
find a couple of hours for this next week, not earlier unfortunately :( The
lack of testing is apparently the main reason why WH 2.1 is not yet released.
That's why I encourage users here to switch to the newest one and share their
feedback.
Shortly answering a few of your questions:
Yes, as you have seen there are a lot of changes in variable and scoping handling in ver 2.1. And...
Yes, backward compatibility has been broken, not much, but for some cases yes :( No matter how badly I wished to keep it there, but there are places when it's too difficult to accomplish with little blood. There was no scoping before and many known bugs caused by attempts to emulate one in some spots. And after a decent scoping was introduced trying to support all weirdness of the old behaviour just in order to make old scripts run on a new tool quickly became a nightmare. You know, it was not like supporting an old contract having a new one in place, but rather it looked like an attempt to support (reimplement?) old conceptual mistakes in a new implementation which does not have those mistakes anymore :)
and <var-def> used to operate with a non-scoped context and hence after introducing scopes they cannot continue operate equally predictable in both scope-aware and scope-unaware manners. In a scoped context they create a lot of confusion, that's why I decided to deprecate them. Since it's impossible to sit comfortably on two chairs I thought that it's better to leave and <ver-def> where they used to belong, provide a backward compatibility for them in a required and limited extent and never try to mix both new and old approaches. I mean that - old scripts which have <var-def> never have or <def>, so if we guarantee that newer script will only use and <def> and never <var*> we could easily separate both approaches and when implementing one do not care about the other. Do you know what I mean?
Anyway, there are still some challenges :) I'll play with your loop example
and will see how I could make it working as before in WH 2.1.
To be honest, I would rather rename trunk to 3.0 and drop all the backward
compatibility crap and create a shiny new engine, but not sure I would have
enough time to do it...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, I get you. It's unfortunate to break backward-compatibility, but
sometimes it's inevitable in order to progress. The new scoping is certainly
an improvement, and definitely worth the change even though some scripts will
need to be fixed. Just make sure you have a comprehensive changelog when comes
release time so that people know what to adjust.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I put in quick fix hack for cookie issue in HttpClientManager.execute (2.1
trunk is indeed broken):
// If cookie expiry date is not specified in the response, HttClient 3.1 doesn't send it back.
// This leads to inability to login to some sites, being always redirected to login page.
// Workaround here is to set cookies with null expiry dates to the current date.
// todo: remove this code if next version fixes the problem
Calendar cal = Calendar.getInstance();
cal.setTime(new Date());
cal.add(Calendar.DATE, 1);
Cookie[] cookies = clientState.getCookies();
if (cookies != null) {
for (Cookie cookie : cookies) {
if (cookie.getExpiryDate() == null) {
cookie.setExpiryDate(cal.getTime());
}
}
}
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think the cookie issue and follow-redirects not defaulting to true were the
main one's I found so far as mention above by lord_samael. The database tag
doesn't work, but that's OK since I don't think it was optimal for larger data
sets and no connection pooling. I just standardized on
List<Map<String,Object>> to handle sending back parsed data to a persistence
facade. I'll try to pull the latest trunk and run against my current unit
tests. The saxon issue is that there is no legit public site even though it's
floating around out there http://maven.40175.n5.nabble.com/saxon-and-maven-
td117856.html
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Are there any non-backward compatible changes between 2.0 beta and code in the
trunk? Because my config that works fine in 2b can't manage to successfully
log in to the website with the trunk version.
Well, unrelated to my login problem, but the database tag seems to be broken
in the trunk. From what I can see, the value from the database processor is
never returned to the parent processor (set/def for instance), because when
CommonUtil.createVariable is called in BodyProcessor.execute() to create a
variable from the tag body's execution, that method checks if that variable is
empty first, and DbRowVariable seem to always be "empty". This is because
DbRowVariable's isEmpty() method is inherited from NodeVariable, and so it
checks NodeVariable's own "data" member variable, which is always null is the
case of a DbRowVariable (!) because that class redefines (?) the data member
as an object array (Java allows this?). Weird behavior there.
Anyway, this used to work because, from the SVN log, it looks like
NodeVariable.isEmpty() used to also check if the toString() method returned an
empty string, which DbRowVariable.toString() does not. That condition was
removed though, so now database processors always return an EmptyVariable.
...Unless I'm doing something wrong.
Actually, it does make sense that Java allows DbRowVariable to have a member
called data, since NodeVariable.data is private, so the child class has no
visibility on it. That's not really the issue anyway. I guess the best
solution would be for DbRowVariable to implement its own isEmpty() method?
I found out why my login wasn't working. There are indeed changes in the way
the http processor works that can affect scripts that worked in 2.0 beta.
First is that this processor now supports a followRedirects attribute that, as
the name implies, determines if Web Harvest should follow redirection requests
returned from the website. In 2.0, this behavior was automatic and
redirections would always be followed. In the trunk (2.1?), if the new
attribute is not specified, the default behavior is to not follow
redirections, which seems a bit counter-intuitive to me. It seems like in most
cases you'd want redirections to be followed, while not following them would
be the exception?
But anyway, adding that attribute fixed my login issue, but then I would be
logged out on the next request. It seemed like a cookie issue, so looking
through the code, I found a new section in HttpClientManager.execute() that
sets the expiry date of cookies that didn't have an expiry date to the current
date, apparently to avoid an issue with HttpClient 3.1. I don't know about
this issue, but I didn't have any problem with the previous version, while now
my cookies are not working in a few of the sites I'm trying to crawl, and
commenting out that section solves it so... I don't know, maybe making this
"fix" optional with another attribute would be a good idea?
I found another odd discrepancy between the two versions. Take the following
code for example:
In version 2.0, the variable scopes of the two functions are independent; the
script variable assignment in func2 doesn't affect the variable with the same
name in func. In the trunk version though, func doesn't enter the if statement
because func2 can apparently change the value of its x variable. Is this
intentional?
I see that functions now also inherit variables from their parent scopes, i.e.
in my example above, func2 would have access to any variables defined in func
as well as in the "global" scope (outside functions). This is a pretty cool
feature, and it would have saved me many headaches had I started using the
trunk right away.
The behavior explained in my previous post is probably related to this, but and <def>. If func2 defs the x variable, it will have
I'm still not sure if it's intentional... Actually, scratch that, I just did
some more testing; it's probably intentional after all. I see now the
difference between
its own local copy and func will be unaffected, while if it sets it, then
func's x will be overwritten. Script variable are set by default it seems.
Interesting.
I'll note another curious behavior. It's not a very important thing, but it
can still break scripts in some cases when switching versions. It seems that
the (admittedly deprecated) <var-def> cannot overwrite a variable set as a
<loop>'s item variable. For instance, the following code:
In 2.0, the execution always goes through the last <else>, because the i
or <def> instead, it behaves as expected, like in 2.0.
variable is set to 5 at the beginning of the loop's body. In the trunk
version, each if is executed once, the var-def seemingly ignored. If you use
lord_samael, you are doing a thorough job! Thank you :)
I will go though your post more carefully checking everything, hopefully I'll
find a couple of hours for this next week, not earlier unfortunately :( The
lack of testing is apparently the main reason why WH 2.1 is not yet released.
That's why I encourage users here to switch to the newest one and share their
feedback.
Shortly answering a few of your questions:
Yes, backward compatibility has been broken, not much, but for some cases yes :( No matter how badly I wished to keep it there, but there are places when it's too difficult to accomplish with little blood. There was no scoping before and many known bugs caused by attempts to emulate one in some spots. And after a decent scoping was introduced trying to support all weirdness of the old behaviour just in order to make old scripts run on a new tool quickly became a nightmare. You know, it was not like supporting an old contract having a new one in place, but rather it looked like an attempt to support (reimplement?) old conceptual mistakes in a new implementation which does not have those mistakes anymore :)
and <var-def> used to operate with a non-scoped context and hence after introducing scopes they cannot continue operate equally predictable in both scope-aware and scope-unaware manners. In a scoped context they create a lot of confusion, that's why I decided to deprecate them. Since it's impossible to sit comfortably on two chairs I thought that it's better to leave and <ver-def> where they used to belong, provide a backward compatibility for them in a required and limited extent and never try to mix both new and old approaches. I mean that - old scripts which have <var-def> never have or <def>, so if we guarantee that newer script will only use and <def> and never <var*> we could easily separate both approaches and when implementing one do not care about the other. Do you know what I mean?
Anyway, there are still some challenges :) I'll play with your loop example
and will see how I could make it working as before in WH 2.1.
To be honest, I would rather rename trunk to 3.0 and drop all the backward
compatibility crap and create a shiny new engine, but not sure I would have
enough time to do it...
Yes, I get you. It's unfortunate to break backward-compatibility, but
sometimes it's inevitable in order to progress. The new scoping is certainly
an improvement, and definitely worth the change even though some scripts will
need to be fixed. Just make sure you have a comprehensive changelog when comes
release time so that people know what to adjust.
I put in quick fix hack for cookie issue in HttpClientManager.execute (2.1
trunk is indeed broken):
Thanks for the patch, it is applied.
What else is broken in trunk?
I think the cookie issue and follow-redirects not defaulting to true were the
main one's I found so far as mention above by lord_samael. The database tag
doesn't work, but that's OK since I don't think it was optimal for larger data
sets and no connection pooling. I just standardized on
List<Map<String,Object>> to handle sending back parsed data to a persistence
facade. I'll try to pull the latest trunk and run against my current unit
tests. The saxon issue is that there is no legit public site even though it's
floating around out there http://maven.40175.n5.nabble.com/saxon-and-maven-
td117856.html