WebHarvest - web data extraction tool / Discussion / Help: Optimize <script> blocks

Is there any way to improve on the performance of <script> blocks? I noticed
it doubled the time because I wanted a List of Maps returning data to be
process further in a Java program.

    <!-- List of Maps will be returned -->

    <script>
        <![CDATA[
        List list = new ArrayList();
        java.text.SimpleDateFormat simpleDateFormat = new java.text.SimpleDateFormat("MM/dd/yyyy hh:ss");
        ]]>
    </script>

     <!-- Parse field elements into variable and store in Map -->

    <loop item="item" index="i" empty="true">
        <list>
            <var name="callXml"/>
        </list>
        <body>
            <script>
                <![CDATA[
                Map map = new HashMap();
                ]]>
            </script>

            <var-def name="incident">
                <xpath expression="//span[@name='INCIDENT']/text()">
                    <var name="item"/>
                </xpath>
            </var-def>

            <script>
                <![CDATA[
                map.put("event_num", incident.toString());
                ]]>
            </script>

            <var-def name="date">
                <xpath expression="//span[@name='DATE_TIME']/text()">
                    <var name="item"/>
                </xpath>
            </var-def>

            <var-def name="time">
                <xpath expression="//span[@name='eTIME']/text()">
                    <var name="item"/>
                </xpath>
            </var-def>

            <!-- Convert date and time strings to Java Date -->
            <script>
                <![CDATA[
                Date timestamp = simpleDateFormat.parse(date.toString()+" "+time.toString());
                map.put("event_time", timestamp);
                ]]>
            </script>

            <var-def name="description">
                <xpath expression="//span[@name='DESCRIPTION']/text()">
                    <var name="item"/>
                </xpath>
            </var-def>

            <script>
                <![CDATA[
                map.put("description", description.toString());
                ]]>
            </script>

            <var-def name="street">
                <xpath expression="//span[@name='STREET']/text()">
                    <var name="item"/>
                </xpath>
            </var-def>

            <script>
                <![CDATA[
                map.put("location_main", street.toString());
                ]]>
            </script>

            <var-def name="subdivision">
                <xpath expression="//span[@name='SUBDIVISION']/text()">
                    <var name="item"/>
                </xpath>
            </var-def>

            <script>
                <![CDATA[
                if (subdivision.toString().equals("")) {
                    map.put("location_alt1", null);
                } else {
                    map.put("location_alt1", subdivision.toString());
                }
                list.add(map);
                ]]>
            </script>            
        </body>
    </loop>

    <!-- Put list into WebHarvest variable context -->
    <script>
        <![CDATA[
        sys.defineVariable("callList", list);
        ]]>
    </script>

Alex Wajda - 2011-11-17

I think the problem is in how the collection type variables are handled in WH.
According to the original idea when the collection is put into the context a
shallow copy is created and each item of the copy is wrapped by Variable
instance. That's what is happening when you call sys.defineVariable(..)
method.

Partially this issue was addressed in WH 2.1 by introducing ScriptingVariable
and making native dynamic language context integrated with WH context, so that
there is no need to explicitly pass variables to and from the <script> blocks.
All the variables created at the top level of the <script> scope and passed
by reference to the corresponding WH scope. That not only gives you
flexibility and removes clutter related to passing data around the <script>
blocks, but also it performs better as no shallow copies are created for the
collection type variables passed to/from scripts.

Consider the example:

<script> list = ["foo", "bar"] </script> <script> list += "baz" // the 'list' variable is passed by reference </script> <get var="list"/> 

Also note that <script> blocks returns the value of the last statement and
that value is passed to the WH context in a common way, i.e. the shallow copy
of that value will be created. Currently there is no way to tell <script> to
ignore the execution result (that's another item for my todo list), so if
'list' is a long collection you would need to make sure it is not implicitly
returned from the <script> block. In the example above simply adding some
stupid empty value statement (like '' or null) in the end of the script block
would do the trick.

<script> list = ["foo", "bar"] '' </script> <script> list += "baz" null </script> <get var="list"/>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alex Wajda - 2011-11-17

I'm thinking of deprecating the idea of total wrapping all the variables by
Variable objects. The author says it was done that way to allow users not to
care about NPEs or variable types when massaging the data. But on practice it
doesn't actually work as expected. You will get "variable not defined" error
if trying to access non-existing variable and in the <script> as well as ${}
blocks you have to explicitly unwrap any variable and use one of .toXXX()
methods to get the actual value before you are able to use it, which implies
that you have to be aware of the variable type. In other words, to me,
revealing Variable object to the user creates more problems then it solves and
it should be either removed or hidden from the user, so instead of writing
${x.toInt() < y.toInt()} I would simply write ${x < y} and so forth.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Optimize &lt;script&gt; blocks

Forums

Help

Optimize &lt;script&gt; blocks document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Optimize <script> blocks

Optimize <script> blocks