Menu

Breaking changes in trunk version?

Help
2011-02-17
2012-09-04
  • Lord Samael

    Lord Samael - 2011-02-17

    Are there any non-backward compatible changes between 2.0 beta and code in the
    trunk? Because my config that works fine in 2b can't manage to successfully
    log in to the website with the trunk version.

     
  • Lord Samael

    Lord Samael - 2011-02-17

    Well, unrelated to my login problem, but the database tag seems to be broken
    in the trunk. From what I can see, the value from the database processor is
    never returned to the parent processor (set/def for instance), because when
    CommonUtil.createVariable is called in BodyProcessor.execute() to create a
    variable from the tag body's execution, that method checks if that variable is
    empty first, and DbRowVariable seem to always be "empty". This is because
    DbRowVariable's isEmpty() method is inherited from NodeVariable, and so it
    checks NodeVariable's own "data" member variable, which is always null is the
    case of a DbRowVariable (!) because that class redefines (?) the data member
    as an object array (Java allows this?). Weird behavior there.

    Anyway, this used to work because, from the SVN log, it looks like
    NodeVariable.isEmpty() used to also check if the toString() method returned an
    empty string, which DbRowVariable.toString() does not. That condition was
    removed though, so now database processors always return an EmptyVariable.

    ...Unless I'm doing something wrong.

     
  • Lord Samael

    Lord Samael - 2011-02-17

    Actually, it does make sense that Java allows DbRowVariable to have a member
    called data, since NodeVariable.data is private, so the child class has no
    visibility on it. That's not really the issue anyway. I guess the best
    solution would be for DbRowVariable to implement its own isEmpty() method?

     
  • Lord Samael

    Lord Samael - 2011-02-17

    I found out why my login wasn't working. There are indeed changes in the way
    the http processor works that can affect scripts that worked in 2.0 beta.
    First is that this processor now supports a followRedirects attribute that, as
    the name implies, determines if Web Harvest should follow redirection requests
    returned from the website. In 2.0, this behavior was automatic and
    redirections would always be followed. In the trunk (2.1?), if the new
    attribute is not specified, the default behavior is to not follow
    redirections, which seems a bit counter-intuitive to me. It seems like in most
    cases you'd want redirections to be followed, while not following them would
    be the exception?

    But anyway, adding that attribute fixed my login issue, but then I would be
    logged out on the next request. It seemed like a cookie issue, so looking
    through the code, I found a new section in HttpClientManager.execute() that
    sets the expiry date of cookies that didn't have an expiry date to the current
    date, apparently to avoid an issue with HttpClient 3.1. I don't know about
    this issue, but I didn't have any problem with the previous version, while now
    my cookies are not working in a few of the sites I'm trying to crawl, and
    commenting out that section solves it so... I don't know, maybe making this
    "fix" optional with another attribute would be a good idea?

     
  • Lord Samael

    Lord Samael - 2011-02-21

    I found another odd discrepancy between the two versions. Take the following
    code for example:

    <function name="func">
        <script>boolean x = true;</script>
        <call name="func2" />
        <case>
            <if condition="${x}">
                <template>${x}</template>
            </if>
        </case>
    </function>
    
    <function name="func2">
        <script>boolean x = false;</script>
    </function>
    
    <call name="func" />
    

    In version 2.0, the variable scopes of the two functions are independent; the
    script variable assignment in func2 doesn't affect the variable with the same
    name in func. In the trunk version though, func doesn't enter the if statement
    because func2 can apparently change the value of its x variable. Is this
    intentional?

     
  • Lord Samael

    Lord Samael - 2011-02-21

    I see that functions now also inherit variables from their parent scopes, i.e.
    in my example above, func2 would have access to any variables defined in func
    as well as in the "global" scope (outside functions). This is a pretty cool
    feature, and it would have saved me many headaches had I started using the
    trunk right away.

    The behavior explained in my previous post is probably related to this, but
    I'm still not sure if it's intentional... Actually, scratch that, I just did
    some more testing; it's probably intentional after all. I see now the
    difference between and <def>. If func2 defs the x variable, it will have
    its own local copy and func will be unaffected, while if it sets it, then
    func's x will be overwritten. Script variable are set by default it seems.
    Interesting.

     
  • Lord Samael

    Lord Samael - 2011-02-22

    I'll note another curious behavior. It's not a very important thing, but it
    can still break scripts in some cases when switching versions. It seems that
    the (admittedly deprecated) <var-def> cannot overwrite a variable set as a
    <loop>'s item variable. For instance, the following code:

            <loop item="i">
            <list>
                1
                2
                3
                4
                5
            </list>
            <body>
                <var-def name="i" overwrite="true">5</var-def>
                <case>
                    <if condition="${i.toInt() == 1}">1</if>
                    <else>
                        <case>
                            <if condition="${i.toInt() == 2}">2</if>
                            <else>
                                <case>
                                    <if condition="${i.toInt() == 3}">3</if>
                                    <else>
                                        <case>
                                            <if condition="${i.toInt() == 4}">4</if>
                                            <else>5</else>
                                        </case>
                                    </else>
                                </case>
                            </else>
                        </case>
                    </else>
                </case>
            </body>
            </loop>
    

    In 2.0, the execution always goes through the last <else>, because the i
    variable is set to 5 at the beginning of the loop's body. In the trunk
    version, each if is executed once, the var-def seemingly ignored. If you use
    or <def> instead, it behaves as expected, like in 2.0.

     
  • Alex Wajda

    Alex Wajda - 2011-02-23

    lord_samael, you are doing a thorough job! Thank you :)
    I will go though your post more carefully checking everything, hopefully I'll
    find a couple of hours for this next week, not earlier unfortunately :( The
    lack of testing is apparently the main reason why WH 2.1 is not yet released.
    That's why I encourage users here to switch to the newest one and share their
    feedback.

    Shortly answering a few of your questions:

    • Yes, as you have seen there are a lot of changes in variable and scoping handling in ver 2.1. And...
    • Yes, backward compatibility has been broken, not much, but for some cases yes :( No matter how badly I wished to keep it there, but there are places when it's too difficult to accomplish with little blood. There was no scoping before and many known bugs caused by attempts to emulate one in some spots. And after a decent scoping was introduced trying to support all weirdness of the old behaviour just in order to make old scripts run on a new tool quickly became a nightmare. You know, it was not like supporting an old contract having a new one in place, but rather it looked like an attempt to support (reimplement?) old conceptual mistakes in a new implementation which does not have those mistakes anymore :)

    • and <var-def> used to operate with a non-scoped context and hence after introducing scopes they cannot continue operate equally predictable in both scope-aware and scope-unaware manners. In a scoped context they create a lot of confusion, that's why I decided to deprecate them. Since it's impossible to sit comfortably on two chairs I thought that it's better to leave and <ver-def> where they used to belong, provide a backward compatibility for them in a required and limited extent and never try to mix both new and old approaches. I mean that - old scripts which have <var-def> never have or <def>, so if we guarantee that newer script will only use and <def> and never <var*> we could easily separate both approaches and when implementing one do not care about the other. Do you know what I mean?

    Anyway, there are still some challenges :) I'll play with your loop example
    and will see how I could make it working as before in WH 2.1.

    To be honest, I would rather rename trunk to 3.0 and drop all the backward
    compatibility crap and create a shiny new engine, but not sure I would have
    enough time to do it...

     
  • Lord Samael

    Lord Samael - 2011-02-23

    Yes, I get you. It's unfortunate to break backward-compatibility, but
    sometimes it's inevitable in order to progress. The new scoping is certainly
    an improvement, and definitely worth the change even though some scripts will
    need to be fixed. Just make sure you have a comprehensive changelog when comes
    release time so that people know what to adjust.

     
  • Steven P. Goldsmith

    I put in quick fix hack for cookie issue in HttpClientManager.execute (2.1
    trunk is indeed broken):

            // If cookie expiry date is not specified in the response, HttClient 3.1 doesn't send it back.
            // This leads to inability to login to some sites, being always redirected to login page.
            // Workaround here is to set cookies with null expiry dates to the current date.
            // todo: remove this code if next version fixes the problem
    
            Calendar cal = Calendar.getInstance();
            cal.setTime(new Date());
            cal.add(Calendar.DATE, 1);
    
            Cookie[] cookies = clientState.getCookies();
            if (cookies != null) {
                for (Cookie cookie : cookies) {
                    if (cookie.getExpiryDate() == null) {
                        cookie.setExpiryDate(cal.getTime());
                    }
                }
            }
    
     
  • Alex Wajda

    Alex Wajda - 2011-11-17

    Thanks for the patch, it is applied.

    What else is broken in trunk?

     
  • Steven P. Goldsmith

    I think the cookie issue and follow-redirects not defaulting to true were the
    main one's I found so far as mention above by lord_samael. The database tag
    doesn't work, but that's OK since I don't think it was optimal for larger data
    sets and no connection pooling. I just standardized on
    List<Map<String,Object>> to handle sending back parsed data to a persistence
    facade. I'll try to pull the latest trunk and run against my current unit
    tests. The saxon issue is that there is no legit public site even though it's
    floating around out there http://maven.40175.n5.nabble.com/saxon-and-maven-
    td117856.html

     

Log in to post a comment.