gnuplot / Feature Requests / #542 findword function

I have in mind an alternative approach, based on a proof-of-principle implementation in a private branch of the development source where I have been playing with a larger set of possible array operators. These include array operations "split" and "join" analogous to those in other scripting languages.

Brief summary

array A = split("orginal string", "separator")
       - e.g. split("Aa Bb C", " ") produces array A = ["Aa", "Bb", "C"]
       - perl has a third parameter that limits the number of resulting pieces
       - perl also treats separator as a regexp, not just a character

join( A, "sep" [, "format"] )
       - inverse operation to split(). Joins string array elements into a single string alternating with separator. The format, if present, would convert numerical array elements to strings.
       - join( A, "") is a pure "cat" operation
       - join( A, ";", "%.3g" ) could be used to generate lines in a csv file from a numerical array

To get the index of a word "Target" in a string you would then be able to do

A = split( String, " ")
i = index(A, "Target")

Larger context
The split and join operations themselves do not seem particularly problematic, althrough the exact syntax is up for discussion. My enthusiasm for a larget set of possible array operations has foundered on uncertainty about a fundamental decision. There is a basic issue here of whether the order of elements in an array is considered immutable. Algorithms that use a pair of arrays to associate two properties depend on the order remaining fixed. union/intersection/sort operators would break this.

    A = sort(A)                     # sort the defined entries, collapse size
    C = union(A,B)                  # A∪B
    D = intersection(A,B)           # A∩B

    E = insert(E,<value>)
    push(A,<value>)                 # insert new value at the end
    value = pop(A)                  # return last element and remove it

Comments and suggestions welcome.

My not finished code is below. I was trying to make it simpler by using data-set and index. It did not work since I cannot add a "index name" as title for data set.
From my perspective, the Gnuplot script is hard to be run without additional parameters (ARGs). For example, reading points and title from the same data file is hard.
One data set in a file could be a key-value definition of variables, while next data sets from the file would be a data to plot. Thus ARGs will not be needed and the script and the data file would recreated a figure. Storing the ARGs, script and data triplet is less convenient than storing just a script and data file.
Anything that allows reading variables, arrays or strings and use then in script is a good choice.
The array approach is better than string hacking. I my case probably "push" and "index" would make the code much simpler (push unique 'class' or 'name' values to an array, and use 'name' array as labels (Y) and 'class' array as key title and color.

reset session

$FILE << EOD
Class   Name    Start date  End date
Naval Headquarters  Ladybird    1 September 1950    30 April 1953
Naval Headquarters  Tyne    1 April 1953    31 July 1953
Aircraft Carrier    Unicorn 1 July 1950 31 July 1953
Aircraft Carrier    Triumph 1 July 1950 30 September 1950
Aircraft Carrier    Theseus 1 October 1950  30 April 1951
Aircraft Carrier    Glory   1 April 1951    30 September 1951
Aircraft Carrier    Glory   1 January 1952  30 September 1952
Aircraft Carrier    Glory   1 November 1952 31 May 1953
Aircraft Carrier    Ocean   1 May 1952  31 October 1952
Aircraft Carrier    Ocean   1 May 1953  31 July 1953
Cruiser Belfast 1 July 1950 31 August 1950
Cruiser Belfast 1 January 1951  30 September 1952
Cruiser Jamaica 1 June 1950 31 October 1950
Cruiser Kenya   1 July 1950 31 August 1952
Cruiser Ceylon  1 August 1950   31 July 1952
Cruiser Newcastle   1 July 1952 31 July 1953
Cruiser Birmingham  1 September 1952    31 July 1953
Destroyer   Charity 1 July 1950 31 January 1951
Destroyer   Charity 1 July 1951 30 September 1951
Destroyer   Charity 1 December 1951 31 March 1952
Destroyer   Charity 1 August 1952   30 November 1952
Destroyer   Charity 1 February 1953 30 April 1953
Destroyer   Charity 1 June 1953 31 July 1953
Destroyer   Cockade 1 July 1950 30 November 1950
Destroyer   Cockade 1 March 1951    31 August 1951
Destroyer   Cockade 1 October 1951  31 December 1951
Destroyer   Cockade 1 January 1952  31 March 1952
Destroyer   Cockade 1 December 1952 28 February 1953
Destroyer   Cockade 1 April 1953    31 July 1953
Destroyer   Comus   1 July 1950 30 November 1950
Destroyer   Comus   1 March 1951    31 August 1951
Destroyer   Comus   1 October 1951  31 December 1951
Destroyer   Comus   1 May 1952  30 September 1952
Destroyer   Comus   1 November 1952 28 February 1953
Destroyer   Concord 1 September 1950    31 January 1951
Destroyer   Concord 1 April 1951    31 May 1951
Destroyer   Concord 1 August 1951   30 November 1951
Destroyer   Concord 1 January 1952  30 April 1952
Destroyer   Concord 1 July 1952 31 August 1952
Destroyer   Concord 1 May 1953  31 July 1953
Destroyer   Consort 1 June 1950 30 April 1951
Destroyer   Consort 1 June 1951 30 September 1951
Destroyer   Consort 1 May 1952  31 August 1952
Destroyer   Consort 1 November 1952 28 February 1953
Destroyer   Consort 1 March 1953    31 May 1953
Destroyer   Constance   1 October 1950  31 March 1951
Destroyer   Constance   1 June 1951 31 July 1951
Destroyer   Constance   1 November 1951 28 February 1952
Destroyer   Constance   1 June 1952 31 December 1952
Destroyer   Cossack 1 June 1950 31 October 1951
Destroyer   Cossack 1 February 1952 31 May 1952
Destroyer   Cossack 1 July 1952 31 July 1952
Destroyer   Cossack 1 September 1952    31 January 1953
Destroyer   Cossack 1 May 1953  31 July 1953
Frigate Alacrity    1 June 1950 31 August 1950
Frigate Alacrity    1 February 1951 30 June 1951
Frigate Alacrity    1 December 1951 28 February 1952
Frigate Alert   1 August 1950   31 October 1950
Frigate Alert   1 October 1951  31 October 1951
Frigate Amethyst    1 February 1951 30 June 1951
Frigate Amethyst    1 September 1951    31 January 1952
Frigate Amethyst    1 April 1952    31 July 1952
Frigate Black Swan  1 June 1950 31 August 1950
Frigate Black Swan  1 February 1951 30 June 1951
Frigate Black Swan  1 September 1951    30 November 1951
Frigate Cardigan Bay    1 November 1950 31 January 1951
Frigate Cardigan Bay    1 June 1951 30 September 1951
Frigate Cardigan Bay    1 January 1952  30 April 1952
Frigate Cardigan Bay    1 June 1952 30 September 1952
Frigate Cardigan Bay    1 January 1953  31 July 1953
Frigate Crane   1 March 1952    30 June 1952
Frigate Crane   1 August 1952   30 September 1952
Frigate Crane   1 November 1952 31 March 1953
Frigate Crane   1 July 1953 31 July 1953
Frigate Hart    1 June 1950 31 August 1950
Frigate Hart    1 February 1951 31 March 1951
Frigate Modeste 1 April 1953    30 June 1953
Frigate Morecambe Bay   1 October 1950  31 January 1951
Frigate Morecambe Bay   1 June 1951 30 September 1951
Frigate Morecambe Bay   1 March 1952    31 May 1952
Frigate Morecambe Bay   1 August 1952   30 November 1952
Frigate Morecambe Bay   1 May 1953  31 July 1953
Frigate Mounts Bay  1 August 1950   30 November 1950
Frigate Mounts Bay  1 December 1950 31 January 1951
Frigate Mounts Bay  1 June 1951 30 September 1951
Frigate Mounts Bay  1 December 1951 30 April 1952
Frigate Mounts Bay  1 June 1952 31 October 1952
Frigate Mounts Bay  1 March 1953    30 June 1953
Frigate Opossum 1 November 1952 30 April 1953
Frigate St Bride's Bay  1 December 1950 31 January 1951
Frigate St Bride's Bay  1 August 1951   31 December 1951
Frigate St Bride's Bay  1 July 1952 31 October 1952
Frigate St Bride's Bay  1 April 1953    30 June 1953
Frigate Sparrow 1 December 1952 28 February 1953
Frigate Sparrow 1 April 1953    30 June 1953
Frigate Whitesand Bay   1 August 1950   31 December 1950
Frigate Whitesand Bay   1 June 1951 31 July 1951
Frigate Whitesand Bay   1 October 1951  28 February 1952
Frigate Whitesand Bay   1 April 1953    31 July 1953
EOD

set title "Royal Fleet deployment in Korea war"

set encoding utf8
set datafile separator tab
set datafile columnheaders 
set datafile missing "NaN"

# helper func
addToList(list,col) = list.( strstrt(list,' "'.strcol(col).'"') > 0 ? '' : ' "'.strcol(col).'"')

# Classes
Classes=''
stats $FILE u (Classes=addToList(Classes,1)) nooutput
array Classes_idx[strlen(Classes)]
i=1
do for [Class in Classes] {
    Classes_idx[strstrt(Classes,Class)]=i
    i=i+1
}

c_idx_n(col, i)=(Classes_idx[strstrt(Classes,strcol(col))] == i ? i : NaN)

# Names
Names=''
stats $FILE u (Names=addToList(Names,2))  nooutput
array Names_idx[strlen(Names)]
i=1
do for [Name in Names] {
    Names_idx[strstrt(Names,Name)]=words(Names)-i
    i=i+1
}

n_idx(col)=Names_idx[strstrt(Names,strcol(col))]

# Axes
set xdata time
set timefmt '%d %B %Y'
set format x '%b %Y'
set xrange ['1 May 1950': '31 July 1953']
set lmargin 12

# Tics
set xtics out nomirror
unset ytics
set border 1

# Key
set key outside
set rmargin 25

plot for [i = 1:words(Classes)] $FILE \
      u 3:(n_idx(2)):3:4:(n_idx(2)):(n_idx(2)+0.95):(c_idx_n(1,i)) \
      w boxxyerror fs solid lc var title word(Classes,i) , \
     $FILE \
      u ("15 May 1950"):(n_idx(2)+0.5):2 \
      w labels right notitle

Ethan Merritt - 2022-08-13

It doesn't work for me to cut-and-paste your data sample because it does not preserve tabs. Add one attachment for the data and another for the script?

Also, and perhaps most important - could you attach a figure showing what you want your final plot to look like? A hand-drawn sketch or a link to someone else's plot would be fine. You are obviously far along a particular path but it may be more productive for me to step back and see if there is a simpler path, once I know where you're going.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

theozh - 2022-08-13

Such a "findword()" function would nice. In your case you don't necessarily need extra arrays for that. For this type of recurrent task, I guess a hash or dictionary would be the desired feature.

Although, you can create a lookup or hash table by misusing the sum() function.

Check the two links which are similar to your task:
https://stackoverflow.com/a/72289393/7295599
https://stackoverflow.com/a/67710390/7295599

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Piotr Winiarczyk - 2022-08-13
  
  Thanks for the simpler solution.
  
  The feature of mapping of strings read from columns to integers(indexes) is very common in other drawing software. IMO the Gnuplot should have a demo file with it and simple functions to obtain this result. Today you need to use external tools or be very creative a theozh is.
  
  For a standard user it is hard to find information how to make anything outside demo files, so having a variety of demo files is important. Maybe theozh can contribute some plots to demo section ?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Piotr Winiarczyk - 2022-08-13

Gnuplot file.

Korea.gp

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Piotr Winiarczyk - 2022-08-13

rawgraphs.io version of the figure - not yet fully translated.
It took around 30 minutes to find out how to do it. The key is automatically sorted and I don't like this particular feature.

Korea2.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Piotr Winiarczyk - 2022-08-13

Gnuplot version.
It works but the code is ugly due to lack of an equivalent of a "findword()" function.
The learning curve for Gnuplot is steep for anything that is not in demos.

Korea2_gp.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Piotr Winiarczyk - 2022-08-14

After some refining the code is smaller. Using strings to draw such plot still seems to be inappropriate. Using arrays seems to be more natural way for a person with basic programing skills.

# helper functions addToList(list,col) = list.( strstrt(list,' "'.strcol(col).'"') > 0 ? '' : ' "'.strcol(col).'"') Lookup(s,str) = (sum [_i=1:words(str)] (s eq word(str,_i)) ? _i : 0) # Classes Classes='' stats $FILE u (Classes=addToList(Classes,1)) nooutput c_idx_n(col, i)=(Lookup(strcol(col),Classes) == i ? i : NaN ) # Names Names='' stats $FILE u (Names=addToList(Names,2)) nooutput n_idx(col)=(words(Names) - Lookup(strcol(col),Names) )

There are 'xticlabels' columns that do something similar - they build value to string array. Maybe something like this idea can be added (uniq is a new function working is a spirit of xticlabels):

stats $FILE u (uniq(A,1)) nooutput

This will produce an A array with unique values from column 1. With an addition of

index(A, "Target")

function, it will be much easer to find out how to handle categories in Gnuplot.

In my case first column of my data can be removed when I would be able to use name of the

index "<name>"</name>

as defined by a multi-data-set comment. Something like:

title index(i)

This will also simplify c_idx_n function since NaN case will not be needed - now I need to draw four times and uses NaN from c_idx_n to build right key. Having title index(i) will make a drawing an iteration over individual data-sets.

To sum up. Anything that will make arrays more useful (loading from data sets, finding values - regexp would be a dream) and the demo how to use those new array function will make Gnuplot better and easer to use.

korea2.gp
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I think that in practice the standard linux approach would be to use universal tools like awk and uniq to pre-process the data. But I can sympathize with not feeling comfortable with that, since I myself have never bothered to learn awk. On my own I would probably tackle this whole thing in a perl script and call gnuplot from inside it. But there again I am sympathetic to not being already familiar with perl.

So here is how I would approach it using only gnuplot and only syntax already available in version 5.4.

# Create a table with one line for each ship name.
# The name is in column 3 of each line.
# The first 5 lines of the table are header records.
# The last line is blank.
# Therefore the total number of unique names is |$Nametable|-6
#
set datafile separator tab
Name = ""
i = 0
set table $Nametable
plot 'korea.dat' using (i):(strcol(2) eq Name ? NaN : i=i+1):(Name = strcol(2)) with labels
unset table

# Now make an array of names
names = |$Nametable| - 6
array Names[names]
do for [i=1:names] { Names[i] = word( $Nametable[i+5], 3) }
print "Names = ", Names

# Same procedure to create an array of classes
Class = ""
i = 0
set table $Classtable
plot 'korea.dat' using (i):(strcol(1) eq Class ? NaN : i=i+1):(Class = strcol(1)) with labels
unset table
classes = |$Classtable| - 6
array Classes[classes]
do for [i=1:classes] { Classes[i] = word( $Classtable[i+5], 3) }
print "Classes = ", Classes

Running this script gives

[~/temp] gnuplot foo.gp
Names = ["Ladybird","Tyne","Unicorn","Triumph","Theseus","Glory","Ocean ","Belfast","Jamaica","Kenya","Ceylon","Newcastle","Birmingham","Charity","Cockade","Comus","Concord","Consort","Constance","Cossack","Alacrity","Alert","Amethyst","Black Swan","Cardigan Bay","Crane","Hart","Modeste","Morecambe Bay","Mounts Bay","Opossum","St Bride's Bay","Sparrow","Whitesand Bay"]

Classes =  ["Naval Headquarters","Aircraft Carrier","Cruiser","Destroyer","Frigate"]

At that point you can retrieve indices directly from the array, e.g. i = index(Classes, "Destroyer")

I didn't try to convert your full script, but maybe that starting point is helpful.

And here's a simpler version using syntax that is in version 5.5. In fact it is also in 5.4 although it is marked EXPERIMENTAL. This does away with the extra columns and the blank lines in the intermediate table by using the syntax plot ... with table if (condition)

# Create a table with one line for each ship name.
# Therefore the total number of unique names is |$Nametable|-6

set datafile columnheaders
set datafile separator tab
Name = ""
set table $Nametable
plot 'korea.dat' using (strcol(2)) with table if (strcol(2) ne Name) && (Name = strcol(2),1)
unset table

# Now make an array of names
names = |$Nametable|
array Names[names]
do for [i=1:names] { Names[i] = $Nametable[i] }

print "Names = ", Names

Piotr Winiarczyk - 2022-08-16

Thank you for the solutions. While the arrays are the objects that conceptually proper for the problem using them with Gnuplot is not convenient. Look at Korea2.gp file.
There are just two helper functions defined and due to flexibility of the string objects one line is needed to load the data. The string version has also an advantage that input does not be sorted to be loaded due to "hash" like behavior.
The helper functions are quite a hack. I really appreciate the imagination of the author to use sum function for the index finding.
The point is that both solutions are complicated. Look at https://www.rawgraphs.io/ how it is easy to load data and use unique values as colors.
Why not try to extend a bit syntax of stats and expose to a user an array with a result of an operation similar to "labels" (it can be named uniqlabels and the array name can be STATS_uniq_labels ) ?

Anyway, the idea of adding more functions to arrays (as you presented it at the beginning of this ticket) is a good step forward.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ethan Merritt - 2022-08-16

Could you give a more complete description or pointer to the mechanism you like in rawgraphs.io? I looked on that web site and did not find anything relevant.

Of course gnuplot also suffers difficulties in finding potentially useful features. A large part of this is because a feature added to satisfy a particular task the developer had in mind may also be relevant for applications they didn't even think of... .which is great, but those unthought-of applications are obviously not listed in an index or provided with demos or examples.

For instance stats already does something potentially relevant to the sort of string processing you are interested in. set datafile columnheaders; stats "FOO.dat" name "FOO" will in addition to the usual numerical analysis of the contents of file FOO.dat also load an array FOO_column_header that is a string array containing the strings found in the first row of the file. I realize that in your case you would want an array of strings found in a particular column, not row, but that's the sort of thing you had in mind, right?

Another possibly relevant hard-to-find feature in gnuplot is the set of "smoothing" operations, many of which are not really smoothing at all. smooth unique can do something close to what you want (collect unique values in a particular column). Unfortunately it only works for numerical values, not strings. Extended the functionality to strings might be feasible; I've never thought about it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Piotr Winiarczyk - 2022-08-16

I was referring to Aggregation feature. Unfortunately the docs of RAWGraphs are 404 as on now.
Theses are quick and simple aggregation operation on particular column. Since this is a JS you can get erratic results when you apply sum to strings. :-)
They can be very useful to quickly get some useful information without processing a data file using an external tool - see attached pdf for examples.

I realize that in your case you would want an array of strings found in a particular column, not row, but that's the sort of thing you had in mind, right?

Right, that was one of the ideas. The xticlabels() can load uniqe ,but the result is not exposed via array. Something along these lines would be helpful + new array index func.

Since I use AWK I am used to arrays that can have an index that is a string. Gunplot does not support this, but you can always use two int indexed arrays to overcome this deficiency.

As of smooth. Anything that allow you to process the data a bit and produce an array that can be accesses will help in such cases.

I am fully aware that all of these can be achieved using external tools, but having a bit of aggregation in Gnuplot will not hurt.

Gnuplot_542.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

There is one more difference between array and string that can be tokenized: the array cannot be passed to a function while a string can. This makes strings a bit more useful.

The development version has more complete support for arrays and array operations. You can pass an array to a function or return an array from a function. I have now added split() and join() to the set of supported array functions.

 `split("string", "sep")` uses the character sequence in "sep" as a
 field separator to split the content of "string" into individual fields.
 It returns an array of strings, each corresponding to one field of the
 original string. The second parameter "sep" is optional.  If "sep" is
 omitted or if it contains a single space character the fields are split
 by any amount of whitespace (space, tab, formfeed, newline, return).
 Otherwise the full sequence of characters in "sep" must be matched.

 The three examples below each produce an array [ "A", "B", "C", "D" ]
     t1 = split( "A B C D" )
     t2 = split( "A B C D", " ")
     t3 = split( "A;B;C;D", ";")

 However the command
     t4 = split( "A;B; C;D", "; " )
 produces an array containing only two strings [ "A;B", "C;D" ] because
 the two-character field separator sequence "; " is found only once.

 Note: Breaking the string into an array of single characters using an empty
 string for sep is not currently implemneted.  You can instead accomplish
 this using single character substrings:     Array[i] = "string"[i:i]

 `join(array, "sep")` concatenates the string elements of an array into a
 single string containing fields delimited by the character sequence in "sep".
 Non-string array elements generate an empty field.
 Example:
     array A = ["A", "B", 5.0, 7, "E"]
     print join(A,";")
           A;B;;;E

Ethan Merritt - 2023-02-23

status: open --> pending-accepted

Group: -->
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ethan Merritt - 2023-03-16

Status: pending-accepted --> closed-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

findword function

A portable, multi-platform, command-line driven graphing utility

Group

Searches

Help

#542 findword function

Discussion