Menu

#542 findword function

None
closed-accepted
nobody
None
5
2023-03-16
2022-08-11
No

There are 'word' and 'words' function that tokenize strings and extract tokens. Add 'findword' function that will find token in string and return index or 0 when not found.

This function will allow to nicely handle the 'string value to color/position' or similar use case. Today I am doing it in a very ugly way (not full code):

# Collect unique values of a column in one string
addToList(list,col) = list.( strstrt(list,' "'.strcol(col).'"') > 0 ? '' : ' "'.strcol(col).'"')

# Classes
Classes=''
stats $FILE u 1:(Classes=addToList(Classes,1)) nooutput
array Classes_idx[strlen(Classes)]
i=1
# Array of positions of substrings in a string
do for [Class in Classes] {
    Classes_idx[strstrt(Classes,Class)]=i
    i=i+1
}
# helper function for Classes_idx - returns index or NaN
c_idx_n(col, ii)=(Classes_idx[strstrt(Classes,strcol(col))] == ii ? ii : NaN)

...
# n_idx is doing same for Y position

 plot for [ii = 1:words(Classes)] $FILE \
       u 3:(n_idx(2)):3:4:(n_idx(2)):(n_idx(2)+0.95):(c_idx_n(1, ii)) \
       w boxxyerror fs solid lc var title word(Classes, ii), \

Discussion

  • Ethan Merritt

    Ethan Merritt - 2022-08-11

    I have in mind an alternative approach, based on a proof-of-principle implementation in a private branch of the development source where I have been playing with a larger set of possible array operators. These include array operations "split" and "join" analogous to those in other scripting languages.

    Brief summary

    array A = split("orginal string", "separator")
           - e.g. split("Aa Bb C", " ") produces array A = ["Aa", "Bb", "C"]
           - perl has a third parameter that limits the number of resulting pieces
           - perl also treats separator as a regexp, not just a character
    
    join( A, "sep" [, "format"] )
           - inverse operation to split(). Joins string array elements into a single string alternating with separator. The format, if present, would convert numerical array elements to strings.
           - join( A, "") is a pure "cat" operation
           - join( A, ";", "%.3g" ) could be used to generate lines in a csv file from a numerical array
    

    To get the index of a word "Target" in a string you would then be able to do

    A = split( String, " ")
    i = index(A, "Target")
    

    Larger context
    The split and join operations themselves do not seem particularly problematic, althrough the exact syntax is up for discussion. My enthusiasm for a larget set of possible array operations has foundered on uncertainty about a fundamental decision. There is a basic issue here of whether the order of elements in an array is considered immutable. Algorithms that use a pair of arrays to associate two properties depend on the order remaining fixed. union/intersection/sort operators would break this.

        A = sort(A)                     # sort the defined entries, collapse size
        C = union(A,B)                  # AB
        D = intersection(A,B)           # AB
    
        E = insert(E,<value>)
        push(A,<value>)                 # insert new value at the end
        value = pop(A)                  # return last element and remove it
    

    Comments and suggestions welcome.

     
  • Piotr Winiarczyk

    My not finished code is below. I was trying to make it simpler by using data-set and index. It did not work since I cannot add a "index name" as title for data set.
    From my perspective, the Gnuplot script is hard to be run without additional parameters (ARGs). For example, reading points and title from the same data file is hard.
    One data set in a file could be a key-value definition of variables, while next data sets from the file would be a data to plot. Thus ARGs will not be needed and the script and the data file would recreated a figure. Storing the ARGs, script and data triplet is less convenient than storing just a script and data file.
    Anything that allows reading variables, arrays or strings and use then in script is a good choice.
    The array approach is better than string hacking. I my case probably "push" and "index" would make the code much simpler (push unique 'class' or 'name' values to an array, and use 'name' array as labels (Y) and 'class' array as key title and color.

    reset session
    
    $FILE << EOD
    Class   Name    Start date  End date
    Naval Headquarters  Ladybird    1 September 1950    30 April 1953
    Naval Headquarters  Tyne    1 April 1953    31 July 1953
    Aircraft Carrier    Unicorn 1 July 1950 31 July 1953
    Aircraft Carrier    Triumph 1 July 1950 30 September 1950
    Aircraft Carrier    Theseus 1 October 1950  30 April 1951
    Aircraft Carrier    Glory   1 April 1951    30 September 1951
    Aircraft Carrier    Glory   1 January 1952  30 September 1952
    Aircraft Carrier    Glory   1 November 1952 31 May 1953
    Aircraft Carrier    Ocean   1 May 1952  31 October 1952
    Aircraft Carrier    Ocean   1 May 1953  31 July 1953
    Cruiser Belfast 1 July 1950 31 August 1950
    Cruiser Belfast 1 January 1951  30 September 1952
    Cruiser Jamaica 1 June 1950 31 October 1950
    Cruiser Kenya   1 July 1950 31 August 1952
    Cruiser Ceylon  1 August 1950   31 July 1952
    Cruiser Newcastle   1 July 1952 31 July 1953
    Cruiser Birmingham  1 September 1952    31 July 1953
    Destroyer   Charity 1 July 1950 31 January 1951
    Destroyer   Charity 1 July 1951 30 September 1951
    Destroyer   Charity 1 December 1951 31 March 1952
    Destroyer   Charity 1 August 1952   30 November 1952
    Destroyer   Charity 1 February 1953 30 April 1953
    Destroyer   Charity 1 June 1953 31 July 1953
    Destroyer   Cockade 1 July 1950 30 November 1950
    Destroyer   Cockade 1 March 1951    31 August 1951
    Destroyer   Cockade 1 October 1951  31 December 1951
    Destroyer   Cockade 1 January 1952  31 March 1952
    Destroyer   Cockade 1 December 1952 28 February 1953
    Destroyer   Cockade 1 April 1953    31 July 1953
    Destroyer   Comus   1 July 1950 30 November 1950
    Destroyer   Comus   1 March 1951    31 August 1951
    Destroyer   Comus   1 October 1951  31 December 1951
    Destroyer   Comus   1 May 1952  30 September 1952
    Destroyer   Comus   1 November 1952 28 February 1953
    Destroyer   Concord 1 September 1950    31 January 1951
    Destroyer   Concord 1 April 1951    31 May 1951
    Destroyer   Concord 1 August 1951   30 November 1951
    Destroyer   Concord 1 January 1952  30 April 1952
    Destroyer   Concord 1 July 1952 31 August 1952
    Destroyer   Concord 1 May 1953  31 July 1953
    Destroyer   Consort 1 June 1950 30 April 1951
    Destroyer   Consort 1 June 1951 30 September 1951
    Destroyer   Consort 1 May 1952  31 August 1952
    Destroyer   Consort 1 November 1952 28 February 1953
    Destroyer   Consort 1 March 1953    31 May 1953
    Destroyer   Constance   1 October 1950  31 March 1951
    Destroyer   Constance   1 June 1951 31 July 1951
    Destroyer   Constance   1 November 1951 28 February 1952
    Destroyer   Constance   1 June 1952 31 December 1952
    Destroyer   Cossack 1 June 1950 31 October 1951
    Destroyer   Cossack 1 February 1952 31 May 1952
    Destroyer   Cossack 1 July 1952 31 July 1952
    Destroyer   Cossack 1 September 1952    31 January 1953
    Destroyer   Cossack 1 May 1953  31 July 1953
    Frigate Alacrity    1 June 1950 31 August 1950
    Frigate Alacrity    1 February 1951 30 June 1951
    Frigate Alacrity    1 December 1951 28 February 1952
    Frigate Alert   1 August 1950   31 October 1950
    Frigate Alert   1 October 1951  31 October 1951
    Frigate Amethyst    1 February 1951 30 June 1951
    Frigate Amethyst    1 September 1951    31 January 1952
    Frigate Amethyst    1 April 1952    31 July 1952
    Frigate Black Swan  1 June 1950 31 August 1950
    Frigate Black Swan  1 February 1951 30 June 1951
    Frigate Black Swan  1 September 1951    30 November 1951
    Frigate Cardigan Bay    1 November 1950 31 January 1951
    Frigate Cardigan Bay    1 June 1951 30 September 1951
    Frigate Cardigan Bay    1 January 1952  30 April 1952
    Frigate Cardigan Bay    1 June 1952 30 September 1952
    Frigate Cardigan Bay    1 January 1953  31 July 1953
    Frigate Crane   1 March 1952    30 June 1952
    Frigate Crane   1 August 1952   30 September 1952
    Frigate Crane   1 November 1952 31 March 1953
    Frigate Crane   1 July 1953 31 July 1953
    Frigate Hart    1 June 1950 31 August 1950
    Frigate Hart    1 February 1951 31 March 1951
    Frigate Modeste 1 April 1953    30 June 1953
    Frigate Morecambe Bay   1 October 1950  31 January 1951
    Frigate Morecambe Bay   1 June 1951 30 September 1951
    Frigate Morecambe Bay   1 March 1952    31 May 1952
    Frigate Morecambe Bay   1 August 1952   30 November 1952
    Frigate Morecambe Bay   1 May 1953  31 July 1953
    Frigate Mounts Bay  1 August 1950   30 November 1950
    Frigate Mounts Bay  1 December 1950 31 January 1951
    Frigate Mounts Bay  1 June 1951 30 September 1951
    Frigate Mounts Bay  1 December 1951 30 April 1952
    Frigate Mounts Bay  1 June 1952 31 October 1952
    Frigate Mounts Bay  1 March 1953    30 June 1953
    Frigate Opossum 1 November 1952 30 April 1953
    Frigate St Bride's Bay  1 December 1950 31 January 1951
    Frigate St Bride's Bay  1 August 1951   31 December 1951
    Frigate St Bride's Bay  1 July 1952 31 October 1952
    Frigate St Bride's Bay  1 April 1953    30 June 1953
    Frigate Sparrow 1 December 1952 28 February 1953
    Frigate Sparrow 1 April 1953    30 June 1953
    Frigate Whitesand Bay   1 August 1950   31 December 1950
    Frigate Whitesand Bay   1 June 1951 31 July 1951
    Frigate Whitesand Bay   1 October 1951  28 February 1952
    Frigate Whitesand Bay   1 April 1953    31 July 1953
    EOD
    
    set title "Royal Fleet deployment in Korea war"
    
    set encoding utf8
    set datafile separator tab
    set datafile columnheaders 
    set datafile missing "NaN"
    
    # helper func
    addToList(list,col) = list.( strstrt(list,' "'.strcol(col).'"') > 0 ? '' : ' "'.strcol(col).'"')
    
    # Classes
    Classes=''
    stats $FILE u (Classes=addToList(Classes,1)) nooutput
    array Classes_idx[strlen(Classes)]
    i=1
    do for [Class in Classes] {
        Classes_idx[strstrt(Classes,Class)]=i
        i=i+1
    }
    
    c_idx_n(col, i)=(Classes_idx[strstrt(Classes,strcol(col))] == i ? i : NaN)
    
    # Names
    Names=''
    stats $FILE u (Names=addToList(Names,2))  nooutput
    array Names_idx[strlen(Names)]
    i=1
    do for [Name in Names] {
        Names_idx[strstrt(Names,Name)]=words(Names)-i
        i=i+1
    }
    
    n_idx(col)=Names_idx[strstrt(Names,strcol(col))]
    
    # Axes
    set xdata time
    set timefmt '%d %B %Y'
    set format x '%b %Y'
    set xrange ['1 May 1950': '31 July 1953']
    set lmargin 12
    
    # Tics
    set xtics out nomirror
    unset ytics
    set border 1
    
    # Key
    set key outside
    set rmargin 25
    
    plot for [i = 1:words(Classes)] $FILE \
          u 3:(n_idx(2)):3:4:(n_idx(2)):(n_idx(2)+0.95):(c_idx_n(1,i)) \
          w boxxyerror fs solid lc var title word(Classes,i) , \
         $FILE \
          u ("15 May 1950"):(n_idx(2)+0.5):2 \
          w labels right notitle 
    
     
  • Ethan Merritt

    Ethan Merritt - 2022-08-13

    It doesn't work for me to cut-and-paste your data sample because it does not preserve tabs. Add one attachment for the data and another for the script?

    Also, and perhaps most important - could you attach a figure showing what you want your final plot to look like? A hand-drawn sketch or a link to someone else's plot would be fine. You are obviously far along a particular path but it may be more productive for me to step back and see if there is a simpler path, once I know where you're going.

     
  • theozh

    theozh - 2022-08-13

    Such a "findword()" function would nice. In your case you don't necessarily need extra arrays for that. For this type of recurrent task, I guess a hash or dictionary would be the desired feature.

    Although, you can create a lookup or hash table by misusing the sum() function.

    Check the two links which are similar to your task:
    https://stackoverflow.com/a/72289393/7295599
    https://stackoverflow.com/a/67710390/7295599

     
    • Piotr Winiarczyk

      Thanks for the simpler solution.

      The feature of mapping of strings read from columns to integers(indexes) is very common in other drawing software. IMO the Gnuplot should have a demo file with it and simple functions to obtain this result. Today you need to use external tools or be very creative a theozh is.

      For a standard user it is hard to find information how to make anything outside demo files, so having a variety of demo files is important. Maybe theozh can contribute some plots to demo section ?

       
  • Piotr Winiarczyk

    Gnuplot file.

     
  • Piotr Winiarczyk

    rawgraphs.io version of the figure - not yet fully translated.
    It took around 30 minutes to find out how to do it. The key is automatically sorted and I don't like this particular feature.

     
  • Piotr Winiarczyk

    Gnuplot version.
    It works but the code is ugly due to lack of an equivalent of a "findword()" function.
    The learning curve for Gnuplot is steep for anything that is not in demos.

     
  • Piotr Winiarczyk

    After some refining the code is smaller. Using strings to draw such plot still seems to be inappropriate. Using arrays seems to be more natural way for a person with basic programing skills.

    # helper functions
    addToList(list,col) = list.( strstrt(list,' "'.strcol(col).'"') > 0 ? '' : ' "'.strcol(col).'"')
    Lookup(s,str) = (sum [_i=1:words(str)] (s eq word(str,_i)) ? _i : 0)
    
    # Classes
    Classes=''
    stats $FILE u (Classes=addToList(Classes,1)) nooutput
    c_idx_n(col, i)=(Lookup(strcol(col),Classes) == i ? i : NaN )
    
    # Names
    Names=''
    stats $FILE u (Names=addToList(Names,2)) nooutput
    n_idx(col)=(words(Names) - Lookup(strcol(col),Names) )
    

    There are 'xticlabels' columns that do something similar - they build value to string array. Maybe something like this idea can be added (uniq is a new function working is a spirit of xticlabels):

    stats $FILE u (uniq(A,1)) nooutput

    This will produce an A array with unique values from column 1. With an addition of

    index(A, "Target")

    function, it will be much easer to find out how to handle categories in Gnuplot.

    In my case first column of my data can be removed when I would be able to use name of the

    index "<name>"</name>

    as defined by a multi-data-set comment. Something like:

    title index(i)

    This will also simplify c_idx_n function since NaN case will not be needed - now I need to draw four times and uses NaN from c_idx_n to build right key. Having title index(i) will make a drawing an iteration over individual data-sets.

    To sum up. Anything that will make arrays more useful (loading from data sets, finding values - regexp would be a dream) and the demo how to use those new array function will make Gnuplot better and easer to use.

     
  • Ethan Merritt

    Ethan Merritt - 2022-08-14

    I think that in practice the standard linux approach would be to use universal tools like awk and uniq to pre-process the data. But I can sympathize with not feeling comfortable with that, since I myself have never bothered to learn awk. On my own I would probably tackle this whole thing in a perl script and call gnuplot from inside it. But there again I am sympathetic to not being already familiar with perl.

    So here is how I would approach it using only gnuplot and only syntax already available in version 5.4.

    # Create a table with one line for each ship name.
    # The name is in column 3 of each line.
    # The first 5 lines of the table are header records.
    # The last line is blank.
    # Therefore the total number of unique names is |$Nametable|-6
    #
    set datafile separator tab
    Name = ""
    i = 0
    set table $Nametable
    plot 'korea.dat' using (i):(strcol(2) eq Name ? NaN : i=i+1):(Name = strcol(2)) with labels
    unset table
    
    # Now make an array of names
    names = |$Nametable| - 6
    array Names[names]
    do for [i=1:names] { Names[i] = word( $Nametable[i+5], 3) }
    print "Names = ", Names
    
    # Same procedure to create an array of classes
    Class = ""
    i = 0
    set table $Classtable
    plot 'korea.dat' using (i):(strcol(1) eq Class ? NaN : i=i+1):(Class = strcol(1)) with labels
    unset table
    classes = |$Classtable| - 6
    array Classes[classes]
    do for [i=1:classes] { Classes[i] = word( $Classtable[i+5], 3) }
    print "Classes = ", Classes
    

    Running this script gives

    [~/temp] gnuplot foo.gp
    Names = ["Ladybird","Tyne","Unicorn","Triumph","Theseus","Glory","Ocean ","Belfast","Jamaica","Kenya","Ceylon","Newcastle","Birmingham","Charity","Cockade","Comus","Concord","Consort","Constance","Cossack","Alacrity","Alert","Amethyst","Black Swan","Cardigan Bay","Crane","Hart","Modeste","Morecambe Bay","Mounts Bay","Opossum","St Bride's Bay","Sparrow","Whitesand Bay"]
    
    Classes =  ["Naval Headquarters","Aircraft Carrier","Cruiser","Destroyer","Frigate"]
    

    At that point you can retrieve indices directly from the array, e.g. i = index(Classes, "Destroyer")

    I didn't try to convert your full script, but maybe that starting point is helpful.

     
    • Ethan Merritt

      Ethan Merritt - 2022-08-15

      And here's a simpler version using syntax that is in version 5.5. In fact it is also in 5.4 although it is marked EXPERIMENTAL. This does away with the extra columns and the blank lines in the intermediate table by using the syntax plot ... with table if (condition)

      # Create a table with one line for each ship name.
      # Therefore the total number of unique names is |$Nametable|-6
      
      set datafile columnheaders
      set datafile separator tab
      Name = ""
      set table $Nametable
      plot 'korea.dat' using (strcol(2)) with table if (strcol(2) ne Name) && (Name = strcol(2),1)
      unset table
      
      # Now make an array of names
      names = |$Nametable|
      array Names[names]
      do for [i=1:names] { Names[i] = $Nametable[i] }
      
      print "Names = ", Names
      
       
  • Piotr Winiarczyk

    Thank you for the solutions. While the arrays are the objects that conceptually proper for the problem using them with Gnuplot is not convenient. Look at Korea2.gp file.
    There are just two helper functions defined and due to flexibility of the string objects one line is needed to load the data. The string version has also an advantage that input does not be sorted to be loaded due to "hash" like behavior.
    The helper functions are quite a hack. I really appreciate the imagination of the author to use sum function for the index finding.
    The point is that both solutions are complicated. Look at https://www.rawgraphs.io/ how it is easy to load data and use unique values as colors.
    Why not try to extend a bit syntax of stats and expose to a user an array with a result of an operation similar to "labels" (it can be named uniqlabels and the array name can be STATS_uniq_labels ) ?

    Anyway, the idea of adding more functions to arrays (as you presented it at the beginning of this ticket) is a good step forward.

     
  • Ethan Merritt

    Ethan Merritt - 2022-08-16

    Could you give a more complete description or pointer to the mechanism you like in rawgraphs.io? I looked on that web site and did not find anything relevant.

    Of course gnuplot also suffers difficulties in finding potentially useful features. A large part of this is because a feature added to satisfy a particular task the developer had in mind may also be relevant for applications they didn't even think of... .which is great, but those unthought-of applications are obviously not listed in an index or provided with demos or examples.

    For instance stats already does something potentially relevant to the sort of string processing you are interested in. set datafile columnheaders; stats "FOO.dat" name "FOO" will in addition to the usual numerical analysis of the contents of file FOO.dat also load an array FOO_column_header that is a string array containing the strings found in the first row of the file. I realize that in your case you would want an array of strings found in a particular column, not row, but that's the sort of thing you had in mind, right?

    Another possibly relevant hard-to-find feature in gnuplot is the set of "smoothing" operations, many of which are not really smoothing at all. smooth unique can do something close to what you want (collect unique values in a particular column). Unfortunately it only works for numerical values, not strings. Extended the functionality to strings might be feasible; I've never thought about it.

     
  • Piotr Winiarczyk

    I was referring to Aggregation feature. Unfortunately the docs of RAWGraphs are 404 as on now.
    Theses are quick and simple aggregation operation on particular column. Since this is a JS you can get erratic results when you apply sum to strings. :-)
    They can be very useful to quickly get some useful information without processing a data file using an external tool - see attached pdf for examples.

    I realize that in your case you would want an array of strings found in a particular column, not row, but that's the sort of thing you had in mind, right?

    Right, that was one of the ideas. The xticlabels() can load uniqe ,but the result is not exposed via array. Something along these lines would be helpful + new array index func.

    Since I use AWK I am used to arrays that can have an index that is a string. Gunplot does not support this, but you can always use two int indexed arrays to overcome this deficiency.

    As of smooth. Anything that allow you to process the data a bit and produce an array that can be accesses will help in such cases.

    I am fully aware that all of these can be achieved using external tools, but having a bit of aggregation in Gnuplot will not hurt.

     
  • Piotr Winiarczyk

    There is one more difference between array and string that can be tokenized: the array cannot be passed to a function while a string can. This makes strings a bit more useful.

     
    • Ethan Merritt

      Ethan Merritt - 2022-08-22

      The development version has more complete support for arrays and array operations. You can pass an array to a function or return an array from a function. I have now added split() and join() to the set of supported array functions.

       `split("string", "sep")` uses the character sequence in "sep" as a
       field separator to split the content of "string" into individual fields.
       It returns an array of strings, each corresponding to one field of the
       original string. The second parameter "sep" is optional.  If "sep" is
       omitted or if it contains a single space character the fields are split
       by any amount of whitespace (space, tab, formfeed, newline, return).
       Otherwise the full sequence of characters in "sep" must be matched.
      
       The three examples below each produce an array [ "A", "B", "C", "D" ]
           t1 = split( "A B C D" )
           t2 = split( "A B C D", " ")
           t3 = split( "A;B;C;D", ";")
      
       However the command
           t4 = split( "A;B; C;D", "; " )
       produces an array containing only two strings [ "A;B", "C;D" ] because
       the two-character field separator sequence "; " is found only once.
      
       Note: Breaking the string into an array of single characters using an empty
       string for sep is not currently implemneted.  You can instead accomplish
       this using single character substrings:     Array[i] = "string"[i:i]
      
       `join(array, "sep")` concatenates the string elements of an array into a
       single string containing fields delimited by the character sequence in "sep".
       Non-string array elements generate an empty field.
       Example:
           array A = ["A", "B", 5.0, 7, "E"]
           print join(A,";")
                 A;B;;;E
      
       
  • Ethan Merritt

    Ethan Merritt - 2023-02-23
    • status: open --> pending-accepted
    • Group: -->
     
  • Ethan Merritt

    Ethan Merritt - 2023-03-16
    • Status: pending-accepted --> closed-accepted
     

Log in to post a comment.