gnuplot / Feature Requests / #482 Function call blocks and local variable/function names

Ethan Merritt - 2018-10-02

Musing #1
Leaving aside the question of how one might internally implement local variables, I think your overview fails to cover all the necessary scope decisions. Suppose
top level code does A=5 and calls (or loads or evaluates or whatever) SUB1
SUB1 declares local A=7 (does not destroy caller's A) and calls SUB2
If SUB2 refers to A, whose A does it get? The top level A or the SUB1 A?
That is, does the "local" property correspond to a scope that is "this level only" or "this level and below"?

Musing #2
Gnuplot is written in C, which doesn't provide the sort of automatic garbage collection you would want to clean up local variables when you error-exit out of the scope they were created in, or even to clean up on a normal exit. Right now all variables are stored in a single linked list. Would you start a new linked list for each scope? How do ensure it is freed on all exit paths? Which is harder - writing a garbage-collection mechanism de novo or porting to a language that already has one?

Musing #3
Although gnuplot has evolved towards being a full-blown programming language, I don't think it has gotten all the way there and those final steps are non-trivial. Your last set of examples essentially introduces a lambda calculus, but does so using operations like "eval" that are hiddeously inefficient. Kind of perversely clever though, using the program to format new ascii source code that it then executes on its own behalf. If we become serious about adding this I think it could be done some other way. For example there could be new syntax for defining a binary operator so that the evaluation stack can use it directly rather than your sprintf+eval hack.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mike Tegtmeyer - 2018-10-02
  
  Re-musing #1:
  
  In my opinion, scoping local variables should conform to normal programming language conventions. That is, given a variable definition of local A, it should be visible in the current and any nested context whereas as an undecorated definition of 'A' explicitly places the variable in the global context visible everywhere. A reference to 'A' should be the most immediately scoped definition to 'A'.
  
  Re-musing #2:
  
  I understand that the phrase 'garbage collection' can mean different things depending on context and to different people but I would specifically not recommend implementing local variable cleanup using 'garbage collection' in the classic sense of the phrase. Typically a language that supports garbage collection means that you can forget/lose/ignore your reference to some chunk of memory and some mechanism later (or in a different thread) sweeps your address space or allocation storage looking for unreferenced memory, performs any cleanup, and then marks if available for reallocation. For typical language parsing, this behavior is not needed but rather handled through a call stack and a symbol table. For the examples below, I'll use C or a C-like language syntax. Many of these steps are typically handled during compilation but a straightforward implementation of a dynamic language basically does the same thing.
  
  You may be very familiar with the concepts below so I apologize in advance but at a minimum it will give us examples for further discussion.
  
  A straightforward implementation of a typical programming language uses a 'symbol table' to manage the bookkeeping. This is a stack (in the Comp Sci sense of the word) where each element is a table containing the current symbols in the call stack. This table contains a mapping from a name (eg a string) to the necessary details of the symbol itself (what type it is, current value, etc). Each time a new scoping context is created, a new (potentially empty) symbol table is pushed onto the stack (created) and then popped when the scope is exited (destroyed/cleaned up). So to answer your question, cleanup happens when the scope exits. It is a well-defined point for cleanup of any local variables and allocated memory. Variable access is a straightforward lookup for the first variable with the name starting at the top of the stack.
  
  I am not familiar with the internals of gnuplot so this pseudocode may be overly simplistic but suppose the call stack takes lists of structures containing the mapping:
  
  /* assume 'definition_type' is an enumeration of 'STRING', 'NUMBER', 'ARRAY', 'FUNCTION', 'UNDEFINED', whatever. 'value' is the current value, void * for simplicity. */ struct symbol { char *name; definition_type type; void *value; symbol *next; };
  
  Then the stack simply contains the head of the lists:
  
  struct call_stack { symbol *head; call_stack *parent; };
  
  Define the following functions:
  
  /* 'stack' contains the top of the stack. malloc a new call_stack element containing 'head' and set 'parent' to 'stack'. Return the new top of the stack. */ call_stack * push(call_stack *stack); /* 'stack' contains the top of the stack. Cleanup/deallocate list pointed to by 'stack.head'. Dealloc stack and return stack.parent as the new top of the stack. */ call_stack * pop(call_stack *stack); /* allocate a new symbol structure, copy 'name' to symbol.name, set symbol.type to 'type'. Add the new symbol to the list pointed to by 'head' and return the new symbol. */ symbol * make_symbol(char *name, definition_type type, symbol *head); /* traverse the list pointed to by top.head looking for 'name'. If it does not exist, move on to 'top.parent' and so forth until either the symbol is found or the bottom of the stack is reached (parent == NULL or "not found"). Return the result. */ symbol * lookup_symbol(char *name, call_stack *top);
  
  Then suppose you have the following (pseudo C-ish + gnuplot)
  
  # global scope { # scope 1 local A; B; A = 5; { # scope 2 local A; B = 21; A = 42; } }
  
  Suppose the top of the call stack is declared as follows. Keep track of the global context.
  
  /* Init call stack */ call_stack *current_context; current_context = push(NULL); call_stack *global_context; global_context = current_context;
  
  Prior to entering scope 1, the call stack 'current_context' contains one element--the symbol table for all the global symbols. On entry to 'scope 1', a new element is added:
  
  current_context = push(current_context);
  
  'local A' is parsed, add it to the symbol table of the current context
  
  A_sym = make_symbol("A",UNDEFINED,current_context.head);
  
  'B' is parsed, it is injected into the global symbol table since it omits the 'local' keyword.
  
  B_sym = make_symbol("B",UNDEFINED,global_context.head);
  
  'A = 5' is parsed. Lookup a symbol named 'A' starting in the current context. Finds "A" in the current context
  
  sym = lookup_symbol("A",current_context);
  
  Enter scope 2, push a new context onto the call stack
  
  current_context = push(current_context);
  
  'local A' is parsed, add it to the symbol table of the current context. There are two symbols named "A" now. One in this context and another in the parent.
  
  A_sym = make_symbol("A",UNDEFINED,current_context.head);
  
  'B = 21' is parsed. Lookup a symbol named 'B' starting in the current context. Ultimately finds "B" in the global context.
  
  'A = 42' is parsed. Lookup a symbol named 'A' starting in the current context. Finds the symbol named "A" in the current context, not the one in the parent context.
  
  Exit scope 2, pop the current context deallocating the symbols in the table in the process. New current context is scope 1
  
  Exit scope 1, pop the current context deallocating the symbols in the table in the process. New current context is the globale context
  
  I apologize again if I am telling something that you already know--I did compiler research in my grad student days :)
  
  Re-musing #3
  
  I expected that the use of the 'eval' is inefficient and should probably use some other mechanism but I didn't want to throw too much at you and the feature request at once. One solution is embedded in the above and our other discussions; treating a function name as just another symbol that can be passed around. For example, a function name is preserved across function parameters.
  
  foo(x,y)=x*y bar(a,y,binop)=binop(x,y)
  
  That is, the fact that the value of the 'binop' argument is a symbol and is preserved across the function call.
  
  Creating programming languages are tricky and often done for the wrong reasons--for example "I just don't like the syntax...". Programming languages are also used to do things that are way outside their intended usage. So ultimately I am not necessarily advocating gnuplot evolve to have a full programming language for its own sake but rather address (I believe) a reasonable pattern of use.
  
  I think a reasonable goal would be to evolve to the point where a plot script (or function) can be standalone and called upon not as a built in. In the "old days" it seemed the pattern was "one script = one plot" and any new major plotting behavior relied on it being implemented as a built in. To me, gnuplot would gain a lot of advantages moving away from this. Any large scale data and analysis workflow that involves the plotting of many figures ultimately begs for modularization.
  
  Suppose someone created a very custom plot, what would gnuplot need for that person to give it to someone else to use? The word 'use' is important here because 'use' doens't require that the recipient needs to to understand all the internal details. To me, the following would be required:
  
  1) A clear and concise interface. Recipients only interact with this interface.
  
  2) Any variables or other symbols internal to the script don't overwrite or pollute the users namespace (ie our local variable discussion).
  
  4) Upon exit, the current "plotting/graphics/settings" state is reasonable (where the caller left it?) Discussion for another thread I suspect...
  
  5) Data gathering that is distinctly separate from data usage. Again, I agree that complex data manipulation as not the purpose of gnuplot but I do believe that semi-complex "data gathering" is. If "data gathering" is manipulations inside of a 'plot using' statement, then someone receiving a plot script must go into the script, understanding it, and changing the 'using' statement appropriately to adjust accordingly based on how their input data is structured. This is bad in my opinion.
  
  The current way I handle the separation of data gathering and data use is through datablocks. That is, if I create a custom plot that requires X columns of data, I make a script that requires a datablock with X columns in it. This is "data usage" in that the plot script doesn't care how it was gathered, as long as it receives a datablock with the correct number of columns. I likewise have generic scripts that gather data from tables, 'csv's, etc, perform simple manipulation, and output them into the appropriate datablock. This is "data gathering" that doesn't care how the data will ultimately be used.
  
  Use is simply a wrapper script that executes:
  
  ASCII \ csv -> "gather" script -> $DATABLOCK -> "plot" script -> output functions /
  
  Each of my "gather" and "plot" scripts strive to be self-contained and distributable. My 'binaryop_dataset.gp' script above is an example of this. Unfortunately, this uses 'eval' a lot more than I would like.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ethan Merritt - 2018-10-02
    
    Re: scoping variables
    
    In my opinion, scoping local variables should conform to normal programming language conventions.
    
    I am not aware of any general convention. Each language gets to choose its own poison.
    
    Declarations:
    
    In C there is already quite a mixture. Outside of curly brackets "static" has a scope of current file only, bare variable name has global scope. Inside curly brackets automatic variables have scope that descends into nested curly brackets but not into subroutine or function calls.
    
    By contrast in javascript automatic variables inside curly brackets have global scope unless marked otherwise.
    
    Julia uses keyword blocks rather than curly brackets. "for" blocks limit the scope of variables introduced in the block, but "if" and "begin" blocks do not.
    
    The point is, whatever model you can imagine has probably already been used by some language. I think we are better off looking at what is practical rather than what some particular other language has done.
    
    Subroutine arguments:
    
    In C you can pass variables into a subroutine either by value (the default) or by reference (using a pointer). In the former case the scope is not inherited, in the latter it is (sort of).
    
    In the old Fortran that I learned as a youngster variables were always passed by reference rather than by value so the scope was always inherited.
    
    In contrast to both of these, gnuplot deals with variables by name (rather than by reference or by value). If a subroutine contains a line C = A+B the evaluation stack looks into a big list of known variables to find ones named "A" "B" and "C". It is not clear to me how you would attach any scope at all to this. I'm sure it could be done somehow (separate variable lists for each possible scope? named scopes that are stored as part of the variable?) but it sounds complicated. Introduction of the ARGV[n] convention bypasses this as a special case but it only works for those 9 reserved variable names.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mike Tegtmeyer - 2018-10-02
      
      I am not aware of any general convention. Each language gets to choose its own poison.
      
      This is true but the comment that I was referring to was which variable is referenced when there are multiple in the load/eval stack.
      
      Suppose top level code does A=5 and calls (or loads or evaluates or whatever) SUB1 SUB1 declares local A=7 (does not destroy caller's A) and calls SUB2 If SUB2 refers to A, whose A does it get? The top level A or the SUB1 A?
      
      I'm sure there is some obscure language out there that would refer to top level A in your example but I believe the vast majority would refer to SUB1A.
      
      You are correct though that I didn't discuss function calls. A quick mental survey leads me to believe that most languages only inherit the global scope.
      
      The point is, whatever model you can imagine has probably already been used by some language. I think we are better off looking at what is practical rather than what some particular other language has done.
      
      One key thing though is that some language behavior in this regard is dictated by other factors. For example, it is true that the 'static' keyword limits the symbols scope but only to a compilation unit and has nothing to do with parsing. Try it. Define a static variable and then include a file with the same variable name. Your compiler will complain. gnuplot doesn't compile anything so this entire class of behavior/issues don't apply. Said another way, I suspect that the problem space can be pruned significantly simply because gnuplot 1) doesn't compile anything, 2) is a runtime language, 3) and not intended to be a general purpose language.
      
      Likewise, languages that support an "include/import" style feature where 'including' works "as if" the imported file was copied verbatium mean there is nothing special about how symbol scoping behavior works during compilation/execution. I know of at least one language whose import feature essentially creates a new execution context with new symbol scoping. In terms of 'load/call', it seems gnuplot works mostly like the former. Could work like the latter if desired but the implementation is nearly the same with a full symbol table implementation.
      
      There are other examples.
      
      So I agree, best to pick the behavior and features that make sense but in my opinion, I wouldn't stray too far or get too creative. Anything that breaks folks mental model of tried and true behavior is going annoy at best and get unused at worst.
      
      In C you can pass variables into a subroutine either by value (the default) or by reference (using a pointer). In the former case the scope is not inherited, in the latter it is (sort of).
      
      So pass by value vs pass by reference is something I hadn't yet considered but is definitely an appropriate discussion. To a user's view, seems to me that gnuplot is pass by value. That is given:
      
      foo=42 doit(a)=a*10; # prints 420 print doit(foo) # prints 42 print foo
      
      Few observations:
      
      Pass by value simply means that a copy of the variable is made at the call site with a scope local to the function.
      
      Pass by reference simply addresses efficiency issues with pass by value. That is, there is nothing that can be done using pass by reference that you can't do using pass by value albiet with more copying.
      
      C is 100% pass by value. It has the appearance of being pass by reference if you give a function a pointer but in actuality, all you are doing is passing a pointer by value. The original pointer and the copied pointer both point to the same location in memory. The called function cannot actually change the callers pointer.
      
      I would recommend gnuplot be pass by value. Advantages are not having to introduce 'references' as in C++ or differentiating between when and where pass by value vs pass by reference happens. A pass by reference could be added later if the effeciency value was there.
      
      It is not clear to me how you would attach any scope at all to this.
      
      Again, having a symbol table stack that mirrors your execution stack. See my pseudocode above. Unfortunately wikipedia does not have a good article on symbole tables but simply googling "symbol table" will provide some info. Most are focused on compilation but runtime languages are essentially the same.
      
      Can you point me to where this is currently handled in the source code and maybe I can get a better idea of the challenges?
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ethan Merritt - 2018-10-02

It is not clear to me how you would attach any scope at all to this.

Again, having a symbol table stack that mirrors your execution stack. See my pseudocode above.
Can you point me to where this is currently handled in the source code and maybe I can get a better idea of the challenges?

Yeah.
On entry the program establishes a baseline environment and then calls SETJMP
(plot.c:502). The program itself is executing from the usual stack. However program state variables are global; user-defined variables and functions are stored in linked lists managed by alloc/free. Error handling is funneled through common_error_exit (util.c:1180) which performs a limited set of cleanup operations on global state variables and then calls bail_to_command_line() which invokes LONGJMP to restore the original baseline environment and stack (plot.c: 257).

The SETJMP/LONGJMP gets you back to a clean execution stack after an error. But there is no provision for unwinding any changes to the state variables or user-defined variables. And in general you would not want this anyhow. If you have typed a whole series of correct commands and then botch typing the next one, you don't want the resulting error message to wipe out all your previous work.

User-defined variables are in general created and accessed via the routines add_udv_by_name (eval.c:748) and get_udv_by_name (eval.c:769). User-defined functions are managed via add_udf (parse.c: 1209). We don't currently have "by_name" variants for function access.

The remaining relevant structure and access routines is lf_push / lf_pop (misc.c:449). This is probably closest to what you had in mind as an execution stack. But it is only used for "call" and "load", not for normal function evaluation. The saved state does not include anything to do with user variables, only the ARGV[n] values and some context information for the parser.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ethan Merritt - 2022-10-21

status: open --> closed-accepted

Group: -->
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ethan Merritt - 2022-10-21

All of this is now supported in the development version. See "function blocks".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Function call blocks and local variable/function names

A portable, multi-platform, command-line driven graphing utility

Group

Searches

Help

#482 Function call blocks and local variable/function names

Discussion