Menu

#482 Function call blocks and local variable/function names

None
closed-accepted
nobody
None
5
2022-10-21
2018-09-26
No

gnuplot currently provides three general methods for supporting code modularization: function calls, 'load', and 'call'.

  1. The 'load' command references the file argument and executes the commands sequentially until the end-of-file, returning to the originating script and executing any subsequent commands. This has the general effect similar to the #include "filename" feature in C and C-like languages. The contents of the loaded script are executed "as if" the commands where written verbatim at the point the load command was written.
  2. The 'call' command is similar to the 'load' command with the exception that it accepts arguments that are passed to the called script which are then stored in ARG[0-9] variables and an ARGV array. The ARG[0-9] and ARGV variables are local to the called function. That is, given two scripts script1.gp which calls script2.gp and the command call 'script1.gp' 42, the ARG1 variable inside script1.gp will contain 42 before and after script1.gp calls to script2.gp even if script2.gp receives a different ARG1 value.
  3. A function call conforms to most language function call behavior in that it can be named and referred to anywhere after the definition. It accepts arguments whose type is maintained across a function call, that is given the definition foo(x)=... and a call foo(42), the parameter x contains the number 42. Function call parameters are local to the function and are not injected into the calling context. That is, given a definition of foo(x)=..., the parameter x is not visible outside of the function call. Function calls are expressions in that there is a value that is implicitly returned as a result of executing the function call.

Each of these mechanisms have limitations that hinder generally-accepted code modularity practices.

1) Both the 'load' and the 'call' command allow the execution of arbitrarily-complex commands however any names introduced during execution are visible in the parent context. That is, there is no way to define a local variable or function nor are there return values. This means that using 'load' and 'call' always comes with the side effect of polluting or overwriting variables or functions in the parent namespace--whether welcome or not. It is possible to work around some of these limitations. For example, given call 'myscript.gp' 42 and the contents:

# preface local variables with '_[scriptname]_' to avoid name clashes
_myscript_foo=ARG1+1

# do something interesting with _myscript_foo

# 'return' value is stored in the variable name 'RETURNVAL'
RETURNVAL=_myscript_foo

undefine _myscript_foo

As long as local variables and functions are named to avoid clashes and undefined at the end of the script, all is well but this requires discipline and is error prone. That is, it is a style convention rather than part of the script language.

Likewise, both 'load' and 'call' are commands and can not be used anywhere. For example: x=call 'foo.gp' is invalid.

2) Function calls partially overcome the local variable problem through their parameters but arbitrarily complex operations cannot be performed inside of a function call--for example a command. Introduction of additional variables are not local to the function and can only be achieved via a serial operation. For example:

gnuplot> foo(x)=(bar=5*x,10*x)
gnuplot> print foo(10)
100
gnuplot> # bar is visible outside of the function
gnuplot> print bar
50

Proposed solutions:

I propose two solutions to the above limitation for discussion and consideration.

1) Introduction of the 'local' keyword that can be applied to variable and function definitions. For example:

local foo=42;
local bar(x)=...

Specifying the 'local' keyword indicates that the variable or function is only visible inside the current execution context (current script for example). By making locality explicit, users can choose whether or not a variable or function definition is injected into the global namespace. Therefore, users can choose when to adopt this feature or not at all.

This could be implemented in at least a couple of different ways; a typical call stack (optimal) or by gnuplot silently prefacing local names and their access with a context-unique identifier (less than ideal)--essentially a built-in version of the workaround above.

2) Introduction of a code block and 'return' keyword as part of a function call definition. For example:

foo(x,y,z) {
  # something interesting

  # optional return value
  return _somevalue_
}

By allowing a code block, arbitrary gnuplot commands could be executed inside of a function call to include variables definitions (potentially prefaced with the 'local' keyword as the code block would be its own execution context).

By introducing a 'return' keyword, the function could return arbitrary values to the calling context. If the return value is optional, it will require some changes as to how function calls are evaluated. That is, currently, it is not possible to create an "empty" function--or function that has no value.

gnuplot> # nothing after the '='
gnuplot> foo(x)=
                ^
         function definition expected

Whereas if the return keyword was optional, the following would need to succeed

gnuplot> foo(x) {}

But then fail depending on its use:

gnuplot> # invalid
gnuplot> bar = foo(42)
               ^
         value required (or something similar)
gnuplot> # OK
gnuplot> foo(42)

Similarly, by supporting both function styles; foo(x)=... and foo(x) {...}, users can choose when to adopt this feature or not at all.

Advantages of the above suggestions.

In my opinion, increasing gnuplot's support for modular code will allow future plotting behavior to be implemented as part of a "standard library" rather than as an internal function. Additionally, this allows a mechanism for loading or calling scripts, or executing function blocks that not pollute the calling context through the 'local' keyword unless desired (though omission of the 'local' keyword).

For example, the following separates the plotting of the data (which could include many, many style customizations, multiplotting, etc) a extracting the data to be plotted. Granted, complex data manipulation is outside of the scope of gnuplot but simple extraction is not.

File "binaryop_dataset.gp"

# extract two columns of data by referencing the csv file columnnames
# and apply the given binary operand

binaryop_dataset(filename,colname1,colname2,binop) {
  set style data points
  set datafile separator ','
  set key autotitle columnhead

  eval(sprintf("local _binop(x,y)=%s(x,y)",binop))

  set xrange [*:*]
  set yrange [*:*]

  set table $BLOCK
  plot ARG1 using 0:(_binop(column(ARG2),column(ARG3))) with table
  unset table

  return $BLOCK
}

File "myplot.gp"

# Plot the contents of 'datablock' according to the style arguments
# 'datablock' must have two columns, doesn't care where the
# data came from
myplot(datablock,...style_args...) {
  local my_local_var....
  # plot datablock according to style_args
}
gnuplot> multiply(x,y)=x*y
gnuplot> $FOOBLOCK=binaryop_dataset('_somefile','foocol','barcol','multiply')
gnuplot> # setup output terminal, styles, etc
gnuplot> load "myplot.gp"
gnuplot> myplot($FOOBLOCK,...)

The above is intended to start a discussion so I'm happy to change my mind about any of the above.

Discussion

  • Ethan Merritt

    Ethan Merritt - 2018-10-02

    Musing #1
    Leaving aside the question of how one might internally implement local variables, I think your overview fails to cover all the necessary scope decisions. Suppose
    top level code does A=5 and calls (or loads or evaluates or whatever) SUB1
    SUB1 declares local A=7 (does not destroy caller's A) and calls SUB2
    If SUB2 refers to A, whose A does it get? The top level A or the SUB1 A?
    That is, does the "local" property correspond to a scope that is "this level only" or "this level and below"?

    Musing #2
    Gnuplot is written in C, which doesn't provide the sort of automatic garbage collection you would want to clean up local variables when you error-exit out of the scope they were created in, or even to clean up on a normal exit. Right now all variables are stored in a single linked list. Would you start a new linked list for each scope? How do ensure it is freed on all exit paths? Which is harder - writing a garbage-collection mechanism de novo or porting to a language that already has one?

    Musing #3
    Although gnuplot has evolved towards being a full-blown programming language, I don't think it has gotten all the way there and those final steps are non-trivial. Your last set of examples essentially introduces a lambda calculus, but does so using operations like "eval" that are hiddeously inefficient. Kind of perversely clever though, using the program to format new ascii source code that it then executes on its own behalf. If we become serious about adding this I think it could be done some other way. For example there could be new syntax for defining a binary operator so that the evaluation stack can use it directly rather than your sprintf+eval hack.

     
    • Mike Tegtmeyer

      Mike Tegtmeyer - 2018-10-02

      Re-musing #1:

      In my opinion, scoping local variables should conform to normal programming language conventions. That is, given a variable definition of local A, it should be visible in the current and any nested context whereas as an undecorated definition of 'A' explicitly places the variable in the global context visible everywhere. A reference to 'A' should be the most immediately scoped definition to 'A'.

      Re-musing #2:

      I understand that the phrase 'garbage collection' can mean different things depending on context and to different people but I would specifically not recommend implementing local variable cleanup using 'garbage collection' in the classic sense of the phrase. Typically a language that supports garbage collection means that you can forget/lose/ignore your reference to some chunk of memory and some mechanism later (or in a different thread) sweeps your address space or allocation storage looking for unreferenced memory, performs any cleanup, and then marks if available for reallocation. For typical language parsing, this behavior is not needed but rather handled through a call stack and a symbol table. For the examples below, I'll use C or a C-like language syntax. Many of these steps are typically handled during compilation but a straightforward implementation of a dynamic language basically does the same thing.

      You may be very familiar with the concepts below so I apologize in advance but at a minimum it will give us examples for further discussion.

      A straightforward implementation of a typical programming language uses a 'symbol table' to manage the bookkeeping. This is a stack (in the Comp Sci sense of the word) where each element is a table containing the current symbols in the call stack. This table contains a mapping from a name (eg a string) to the necessary details of the symbol itself (what type it is, current value, etc). Each time a new scoping context is created, a new (potentially empty) symbol table is pushed onto the stack (created) and then popped when the scope is exited (destroyed/cleaned up). So to answer your question, cleanup happens when the scope exits. It is a well-defined point for cleanup of any local variables and allocated memory. Variable access is a straightforward lookup for the first variable with the name starting at the top of the stack.

      I am not familiar with the internals of gnuplot so this pseudocode may be overly simplistic but suppose the call stack takes lists of structures containing the mapping:

      /*
        assume 'definition_type' is an enumeration of 'STRING', 'NUMBER',
        'ARRAY', 'FUNCTION', 'UNDEFINED', whatever.
        'value' is the current value, void * for simplicity.
      */
      struct symbol {
        char *name;
        definition_type type;
        void *value;
        symbol *next;
      };
      

      Then the stack simply contains the head of the lists:

      struct call_stack {
        symbol *head;
        call_stack *parent;
      };
      

      Define the following functions:

      /*
        'stack' contains the top of the stack. malloc a new call_stack
        element containing 'head' and set 'parent' to 'stack'. Return the
        new top of the stack.
      */
      call_stack * push(call_stack *stack);
      
      /*
        'stack' contains the top of the stack. Cleanup/deallocate list
        pointed to by 'stack.head'. Dealloc stack and return stack.parent
        as the new top of the stack.
      */
      call_stack * pop(call_stack *stack);
      
      /*
        allocate a new symbol structure, copy 'name' to symbol.name, set
        symbol.type to 'type'. Add the new symbol to the list pointed to
        by 'head' and return the new symbol.
      */
      symbol * make_symbol(char *name, definition_type type, symbol *head);
      
      /*
        traverse the list pointed to by top.head looking for 'name'. If it
        does not exist, move on to 'top.parent' and so forth until either
        the symbol is found or the bottom of the stack is reached
        (parent == NULL or "not found"). Return the result.
      */
      symbol * lookup_symbol(char *name, call_stack *top);
      

      Then suppose you have the following (pseudo C-ish + gnuplot)

      # global scope
      {
        # scope 1
        local A;
        B;
      
        A = 5;
      
        {
          # scope 2
          local A;
      
          B = 21;
          A = 42;
        }
      }
      

      Suppose the top of the call stack is declared as follows. Keep track of the global context.

      /*
        Init call stack
      */
      call_stack *current_context;
      current_context = push(NULL);
      call_stack *global_context;
      global_context = current_context;
      

      Prior to entering scope 1, the call stack 'current_context' contains one element--the symbol table for all the global symbols. On entry to 'scope 1', a new element is added:

      current_context = push(current_context);
      

      'local A' is parsed, add it to the symbol table of the current context

      A_sym = make_symbol("A",UNDEFINED,current_context.head);
      

      'B' is parsed, it is injected into the global symbol table since it omits the 'local' keyword.

      B_sym = make_symbol("B",UNDEFINED,global_context.head);
      

      'A = 5' is parsed. Lookup a symbol named 'A' starting in the current context. Finds "A" in the current context

      sym = lookup_symbol("A",current_context);
      

      Enter scope 2, push a new context onto the call stack

      current_context = push(current_context);
      

      'local A' is parsed, add it to the symbol table of the current context. There are two symbols named "A" now. One in this context and another in the parent.

      A_sym = make_symbol("A",UNDEFINED,current_context.head);
      

      'B = 21' is parsed. Lookup a symbol named 'B' starting in the current context. Ultimately finds "B" in the global context.

      'A = 42' is parsed. Lookup a symbol named 'A' starting in the current context. Finds the symbol named "A" in the current context, not the one in the parent context.

      Exit scope 2, pop the current context deallocating the symbols in the table in the process. New current context is scope 1

      Exit scope 1, pop the current context deallocating the symbols in the table in the process. New current context is the globale context

      I apologize again if I am telling something that you already know--I did compiler research in my grad student days :)

      Re-musing #3

      I expected that the use of the 'eval' is inefficient and should probably use some other mechanism but I didn't want to throw too much at you and the feature request at once. One solution is embedded in the above and our other discussions; treating a function name as just another symbol that can be passed around. For example, a function name is preserved across function parameters.

      foo(x,y)=x*y
      bar(a,y,binop)=binop(x,y)
      

      That is, the fact that the value of the 'binop' argument is a symbol and is preserved across the function call.

      Creating programming languages are tricky and often done for the wrong reasons--for example "I just don't like the syntax...". Programming languages are also used to do things that are way outside their intended usage. So ultimately I am not necessarily advocating gnuplot evolve to have a full programming language for its own sake but rather address (I believe) a reasonable pattern of use.

      I think a reasonable goal would be to evolve to the point where a plot script (or function) can be standalone and called upon not as a built in. In the "old days" it seemed the pattern was "one script = one plot" and any new major plotting behavior relied on it being implemented as a built in. To me, gnuplot would gain a lot of advantages moving away from this. Any large scale data and analysis workflow that involves the plotting of many figures ultimately begs for modularization.

      Suppose someone created a very custom plot, what would gnuplot need for that person to give it to someone else to use? The word 'use' is important here because 'use' doens't require that the recipient needs to to understand all the internal details. To me, the following would be required:

      1) A clear and concise interface. Recipients only interact with this interface.

      2) Any variables or other symbols internal to the script don't overwrite or pollute the users namespace (ie our local variable discussion).

      4) Upon exit, the current "plotting/graphics/settings" state is reasonable (where the caller left it?) Discussion for another thread I suspect...

      5) Data gathering that is distinctly separate from data usage. Again, I agree that complex data manipulation as not the purpose of gnuplot but I do believe that semi-complex "data gathering" is. If "data gathering" is manipulations inside of a 'plot using' statement, then someone receiving a plot script must go into the script, understanding it, and changing the 'using' statement appropriately to adjust accordingly based on how their input data is structured. This is bad in my opinion.

      The current way I handle the separation of data gathering and data use is through datablocks. That is, if I create a custom plot that requires X columns of data, I make a script that requires a datablock with X columns in it. This is "data usage" in that the plot script doesn't care how it was gathered, as long as it receives a datablock with the correct number of columns. I likewise have generic scripts that gather data from tables, 'csv's, etc, perform simple manipulation, and output them into the appropriate datablock. This is "data gathering" that doesn't care how the data will ultimately be used.

      Use is simply a wrapper script that executes:

      ASCII      \
      csv         -> "gather" script -> $DATABLOCK -> "plot" script -> output
      functions  /
      

      Each of my "gather" and "plot" scripts strive to be self-contained and distributable. My 'binaryop_dataset.gp' script above is an example of this. Unfortunately, this uses 'eval' a lot more than I would like.

       
      • Ethan Merritt

        Ethan Merritt - 2018-10-02

        Re: scoping variables

        In my opinion, scoping local variables should conform to normal programming language conventions.

        I am not aware of any general convention. Each language gets to choose its own poison.

        Declarations:

        In C there is already quite a mixture. Outside of curly brackets "static" has a scope of current file only, bare variable name has global scope. Inside curly brackets automatic variables have scope that descends into nested curly brackets but not into subroutine or function calls.

        By contrast in javascript automatic variables inside curly brackets have global scope unless marked otherwise.

        Julia uses keyword blocks rather than curly brackets. "for" blocks limit the scope of variables introduced in the block, but "if" and "begin" blocks do not.

        The point is, whatever model you can imagine has probably already been used by some language. I think we are better off looking at what is practical rather than what some particular other language has done.

        Subroutine arguments:

        In C you can pass variables into a subroutine either by value (the default) or by reference (using a pointer). In the former case the scope is not inherited, in the latter it is (sort of).

        In the old Fortran that I learned as a youngster variables were always passed by reference rather than by value so the scope was always inherited.

        In contrast to both of these, gnuplot deals with variables by name (rather than by reference or by value). If a subroutine contains a line C = A+B the evaluation stack looks into a big list of known variables to find ones named "A" "B" and "C". It is not clear to me how you would attach any scope at all to this. I'm sure it could be done somehow (separate variable lists for each possible scope? named scopes that are stored as part of the variable?) but it sounds complicated. Introduction of the ARGV[n] convention bypasses this as a special case but it only works for those 9 reserved variable names.

         
        • Mike Tegtmeyer

          Mike Tegtmeyer - 2018-10-02

          I am not aware of any general convention. Each language gets to choose its own poison.

          This is true but the comment that I was referring to was which variable is referenced when there are multiple in the load/eval stack.

          Suppose top level code does A=5 and calls (or loads or evaluates or whatever) SUB1 SUB1 declares local A=7 (does not destroy caller's A) and calls SUB2 If SUB2 refers to A, whose A does it get? The top level A or the SUB1 A?

          I'm sure there is some obscure language out there that would refer to top level A in your example but I believe the vast majority would refer to SUB1A.

          You are correct though that I didn't discuss function calls. A quick mental survey leads me to believe that most languages only inherit the global scope.

          The point is, whatever model you can imagine has probably already been used by some language. I think we are better off looking at what is practical rather than what some particular other language has done.

          One key thing though is that some language behavior in this regard is dictated by other factors. For example, it is true that the 'static' keyword limits the symbols scope but only to a compilation unit and has nothing to do with parsing. Try it. Define a static variable and then include a file with the same variable name. Your compiler will complain. gnuplot doesn't compile anything so this entire class of behavior/issues don't apply. Said another way, I suspect that the problem space can be pruned significantly simply because gnuplot 1) doesn't compile anything, 2) is a runtime language, 3) and not intended to be a general purpose language.

          Likewise, languages that support an "include/import" style feature where 'including' works "as if" the imported file was copied verbatium mean there is nothing special about how symbol scoping behavior works during compilation/execution. I know of at least one language whose import feature essentially creates a new execution context with new symbol scoping. In terms of 'load/call', it seems gnuplot works mostly like the former. Could work like the latter if desired but the implementation is nearly the same with a full symbol table implementation.

          There are other examples.

          So I agree, best to pick the behavior and features that make sense but in my opinion, I wouldn't stray too far or get too creative. Anything that breaks folks mental model of tried and true behavior is going annoy at best and get unused at worst.

          In C you can pass variables into a subroutine either by value (the default) or by reference (using a pointer). In the former case the scope is not inherited, in the latter it is (sort of).

          So pass by value vs pass by reference is something I hadn't yet considered but is definitely an appropriate discussion. To a user's view, seems to me that gnuplot is pass by value. That is given:

          foo=42
          doit(a)=a*10;
          # prints 420
          print doit(foo)
          # prints 42
          print foo
          

          Few observations:

          • Pass by value simply means that a copy of the variable is made at the call site with a scope local to the function.
          • Pass by reference simply addresses efficiency issues with pass by value. That is, there is nothing that can be done using pass by reference that you can't do using pass by value albiet with more copying.
          • C is 100% pass by value. It has the appearance of being pass by reference if you give a function a pointer but in actuality, all you are doing is passing a pointer by value. The original pointer and the copied pointer both point to the same location in memory. The called function cannot actually change the callers pointer.

          I would recommend gnuplot be pass by value. Advantages are not having to introduce 'references' as in C++ or differentiating between when and where pass by value vs pass by reference happens. A pass by reference could be added later if the effeciency value was there.

          It is not clear to me how you would attach any scope at all to this.

          Again, having a symbol table stack that mirrors your execution stack. See my pseudocode above. Unfortunately wikipedia does not have a good article on symbole tables but simply googling "symbol table" will provide some info. Most are focused on compilation but runtime languages are essentially the same.

          Can you point me to where this is currently handled in the source code and maybe I can get a better idea of the challenges?

           
  • Ethan Merritt

    Ethan Merritt - 2018-10-02

    It is not clear to me how you would attach any scope at all to this.

    Again, having a symbol table stack that mirrors your execution stack. See my pseudocode above.
    Can you point me to where this is currently handled in the source code and maybe I can get a better idea of the challenges?

    Yeah.
    On entry the program establishes a baseline environment and then calls SETJMP
    (plot.c:502). The program itself is executing from the usual stack. However program state variables are global; user-defined variables and functions are stored in linked lists managed by alloc/free. Error handling is funneled through common_error_exit (util.c:1180) which performs a limited set of cleanup operations on global state variables and then calls bail_to_command_line() which invokes LONGJMP to restore the original baseline environment and stack (plot.c: 257).

    The SETJMP/LONGJMP gets you back to a clean execution stack after an error. But there is no provision for unwinding any changes to the state variables or user-defined variables. And in general you would not want this anyhow. If you have typed a whole series of correct commands and then botch typing the next one, you don't want the resulting error message to wipe out all your previous work.

    User-defined variables are in general created and accessed via the routines add_udv_by_name (eval.c:748) and get_udv_by_name (eval.c:769). User-defined functions are managed via add_udf (parse.c: 1209). We don't currently have "by_name" variants for function access.

    The remaining relevant structure and access routines is lf_push / lf_pop (misc.c:449). This is probably closest to what you had in mind as an execution stack. But it is only used for "call" and "load", not for normal function evaluation. The saved state does not include anything to do with user variables, only the ARGV[n] values and some context information for the parser.

     
  • Ethan Merritt

    Ethan Merritt - 2022-10-21
    • status: open --> closed-accepted
    • Group: -->
     
  • Ethan Merritt

    Ethan Merritt - 2022-10-21

    All of this is now supported in the development version. See "function blocks".

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.