Selecting subsets of data by ranked magnitude

Developers
Stubaan
2012-01-25
2013-10-17
1 2 3 > >> (Page 1 of 3)
  • Stubaan
    Stubaan
    2012-01-25

    Hi folks - I suspect masking may offer a solution to following problem but would very much appreciate some experienced input in the matter.  Thus far I have not been able to figure out how to apply a mask that achieves my ends.

    file1_$month.nc contains the average number of rainfall events (lets call it $avg_no_of_events) for a  given month at each grid-point, where an event is defined as the presence of rainfall (of any amount) in a given 3-hour observation window.

    file2_$month.nc contains the average amount of rainfall for each 3-hour observation window at each grid-point in a month (calculated from 20 years of data).

    I need to:
    - create a file3.nc which has the same spatial and temporal dimensions as file2_$month,
    - but which is populated with the n largest rainfall values from file2_$month, with n being the average number of rainfall events per month ($avg_no_of_events) (out of a total of 28 or 30 or 31 x 8 events per month); i.e. somehow referencing file2_$month in such a way that the values are ranked from largest to smallest
    - and which has all other rainfall values set to zero.

    Many thanks, again!

     
  • Stubaan
    Stubaan
    2012-01-25

    Just a quick follow-up of my current thoughts on this - comments welcome!

    This method essentially forgoes any NCO acrobatics in lieu of Bash editing ASCII versions of the file.

    1) ncdump the file1_$month.nc and file2_$month.nc to ascii format
    2) read n = $avg_no_of_events (from file1_$month.txt)
    3) read the rainfall values in file2_$month.txt and sort them using sort -g or some such
    4) identify the set of n ($avg_no_of_events) largest values from 3)
    5) perform a check-and-replace of the values in file2_$month.txt: if the value is within the range identified in 4 then leave it, else overwrite with 0.0 (there is some risk of duplicate values resulting in a final n > $avg_no_of_events but given given the precision of the values being used this is an unlikely/negligible concern)
    6) ncgen the revised ascii file into .nc

     
  • Charlie Zender
    Charlie Zender
    2012-01-29

    another way would be to put the file1 info into file2, then read file2 in ncap2 and use its sort features and do the zeroing and write to file3. in other words i see the key issues as sorting and zeroing, and masking could be involved in zeroing, but as far as NCO solutions, ncap2 can do all of this. whatever method works is of course the right way :)
    cz

     
  • Stubaan
    Stubaan
    2012-02-02

    Thanks cz

    I have resolved the zeroing issue and have been able to generate a the avg_no_events_per_month for each grid-point.

    I would now like to sort, through time not space, the original event_values in descending order (presumably sort() followed by reverse() ?), for each grid-point individually, and then save these sorted values as a new variable.  I cannot figure out how to sort cell by cell though; I can only seem to sort every element in the file.

    I would then like to find a way to use the avg_no_events_per_month as a mask, such that I can set the sorted event_values to zero after the "avg_no_events_per_month"th value, for each grid point independently.  And to then save those updated values.

    Now that I know more about the built-in sort method I'm sure this doable, but I can't figure it out from the examples or documentation. Any clues would be very much appreciated.

    Also, what exactly is the difference between sort() and dsort(), and why do I keep seeing " * " before some of these variables when used in conjunction with .sort() (amongst other methods)?

     
  • Stubaan
    Stubaan
    2012-02-02

    Figured out that dsort is for sorting with maps, and that * denotes a RAM variable. Sloppy document-mining on my part *blush*

    I expect that it will be a map of some sort that I need for specifying the sort across the time dimension, independently for each grid-point, but I've been battling to make sense of maps for a while now.  Once more into the fray…

     
  • Stubaan
    Stubaan
    2012-02-02

    Are there any further examples/articles that discuss/demonstrate the sort method and the use of maps with it?

    The NCO documentation seems sparse on the subject - I've been unable to make real progress using the examples I've found (in the documentation, ncap2_tst.nco, and this forum).

    Thanks folks

     
  • Stubaan
    Stubaan
    2012-02-07

    Hi again folks

    Well, I have explored everything I can find, and I still cannot figure out how to perform what I assume (perhaps foolishly) is a relatively simple sort operation. So, two questions:

    1) Can anybody demonstrate the syntax to perform the following, assuming it is feasible, and if not then please offer some brief insight into why.

    Task: i) sort a variable, ii) in decreasing magnitude, iii) along the time dimension, iv) without any spatial amalgamation

    This would seem to me like a fairly common sorting operation, so I am working on the assumption this functionality is built into sort.  However, I have read much about its 2D limitations, though I have been able to master little of it, and so perhaps this is not as simple as I am hoping it is. 

    If not, is there a way to sort a single location that I might then be able to implement in a loop over all the locations?

    2) Assuming that I were able to sort a particular grid-cell's values over time in descending order, is there any way to restore those sorted values to their original temporal location once I have made some adjustments to the values? This would require that a variable's location at a given point in space and time is conserved throughout the sorting and value-adjustment process, and that NCO could use this to "unsort" the adjusted values back into their original location in time and space. 

    Thanks as always!
    Stu

     
  • Charlie Zender
    Charlie Zender
    2012-02-09

    sorry the documentation is not more helpful.
    i am working on an example for you but keep getting interrupted.
    hopefully by this weekend.
    cz

     
  • Charlie Zender
    Charlie Zender
    2012-02-09

    Disclaimer: I don't know what you mean by "spatial amalgamation".
    Here's a script to chew on.
    It's based on my understanding of your needs, which may be inaccurate.
    Nevertheless, it illustrates one or two methods you will probably find useful.
    And maybe we can iterate a bit.

    cat > ~/ncap2_foo.nco << 'EOF'
    // Purpose: Generate and sort multi-dimensional array along a single dimension
    // This example uses the variable three_dmn_rec_var(time,lat,lon)

    /* NB: Single quotes around EOF above turn off shell parameter
        expansion in "here documents". This in turn prevents the
        need for protecting dollarsign characters in NCO scripts with
        backslashes when the script is cut-and-pasted (aka "moused")
        from an editor or e-mail into a shell console window */

    // Randomize the time coordinate
    time=10.0*gsl_rng_uniform(time);

    print("original randomized time =\n");print(time);

    /* The sort() routine is overloaded, and takes one or two arguments
       The first argument is a one dimensional array
       The second argument (optional) is the "sort map" (srt_map below)
       Pass the sort map by reference, i.e., prefix with an ampersand
       If the sort map does not yet exist, then it will be created and
       returned as an integer type the same shape as the input variable.
       The output of sort(), on the LHS, is a sorted version of the input */

    time=sort(time,&srt_map);

    print("sorted time (ascending order) and associated sort map =\n");print(time);print(srt_map);

    /* sort() always sorts in ascending order
       The associated sort map therefore re-arranges the original,
       randomized time array into ascending order.
       There are two methods to obtain the descending order the user wants
       1) We could solve the problem in ascending order (the default)
       and then apply the reverse() method to re-arrange the results.
       2) We could change the sort map to return things in descending
       order of time and solve the problem directly in descending order. */

    // Following shows how to do method one:

    /* Expand the sort map to srt_map_3d, the size of the data array
       1. Use data array to provide right shape for the expanded sort map
       2. Coerce data array into an integer so srt_map_3d is an integer
       3. Multiply data array by zero so 3-d map elements are all zero
       4. Add the 1-d sort map to the 3-d sort map (NCO automatically resizes)
       5. Add the spatial (lat,lon) offsets to each time index
       6. de-sort using the srt_map_3d
       7. Use reverse to obtain descending in time order
       This could also be accomplished with loops (exercise left for reader) */

    // Copy 1-d sort map into 3-d sort map
    srt_map_3d=(0*int(three_dmn_rec_var))+srt_map;
    // Multiply base offset by factorial of lesser dimensions
    srt_map_3d*=$lat.size*$lon.size;
    lon_idx=array(0,1,$lon);
    lat_idx=array(0,1,$lat)*$lon.size;
    lat_lon_idx=lat_idx+lon_idx;
    srt_map_3d+=lat_lon_idx;

    print("sort map 3d =\n");print(srt_map_3d);

    // Use dsort() to "de-sort" the data using the sort map
    three_dmn_rec_var=dsort(three_dmn_rec_var,srt_map_3d);

    // Finally, reverse the data so the time coordinate is descending
    time=time.reverse($time);
    three_dmn_rec_var=three_dmn_rec_var.reverse($time);

    // Method two is identical to method one except the reverse() methods are applied at the start of the algorithm

    print("sorted time (descending order) =\n");print(time);
    EOF
    ncap2 -O -v -S ~/ncap2_foo.nco ~/nco/data/in.nc ~/foo.nc

     
  • Stubaan
    Stubaan
    2012-02-09

    Thanks cz - very much appreciated!!

    I'll dig into this and let you know how it goes. I promise you won't hear from me again until next week :-)

     
  • Charlie Zender
    Charlie Zender
    2012-02-09

    And yes, the complexity of the script shows that the existing sort() and dsort() commands, as I understand them, have room for improvement, i.e., in handling multidimensional variables.

     
  • Stubaan
    Stubaan
    2012-02-09

    Yeah I see that. My current workaround has been to break my file up into individual files for each grid point, and I was then finally able to sort the 1-D variable that remained. Then I permuted the record variable to latitude and ncrcatted them all back together again along that dimension, then permuted the record variable to longitude and repeated the process.

    While I have you, is the only way to ncecat files with completely different variables to first add new variables to each file such that they then both have the same suite of variables?

     
  • Charlie Zender
    Charlie Zender
    2012-02-09

    > While I have you, is the only way to ncecat files with completely different variables to first add new variables to each file such > that they then both have the same suite of variables?

    yes

     
  • Stubaan
    Stubaan
    2012-02-16

    Hi again cz

    You refer in your example to using dsort() to "de-sort" a sorted variable. This is something I very much need since the whole reason I am sorting the variable to begin with is in order to set the lowest x number of values to zero (the exact number of x comes from another variable I have calculated). I then need to return them to their original locations.

    However, I am unable to get this work, and having dug further into the existing examples I'm not sure this is possible, though it seems like it should be.

    For example, when I try the following sort/dsort pairing I do not get back to where I started

    // VAR is a 1-D variable
    VAR=sort(VAR,&srt_map)  // VAR sorts correctly
    VAR=dsort(VAR,srt_map)  // VAR is unsorted, but values do not occur in the same order as they originally were

    What am I missing?

    Thanks!

     
  • Charlie Zender
    Charlie Zender
    2012-02-16

    hi mullers,
    You are missing nothing, you are exactly right. ncap2 dsort() should not be called "de-sort".
    dsort() applies the same map that sort() returns. so sort() followed by dsort() shifts everything
    by the same map twice, rather than shifting and unshifting back to the original.
    We need to develop a real "de-sort" or un-sort function that undoes what srt_map did in the original call to sort().
    Then sort();unsort() will produce an identity mapping like you want.
    That is now TODO nco1032.
    In the meantime, you could populate your own such map within a for loop, and then apply it.
    Or maybe we should overload sort() with an optional third argument that returns the de-sorting map.

     
  • Charlie Zender
    Charlie Zender
    2012-02-16

    p.s. here is my simple script for testing, in case anyone finds it useful…

    cat > ~/ncap2_foo.nco << 'EOF'
    // Randomize the time coordinate
    time=int(10.0*gsl_rng_uniform(time));
    print(time);
    // Sort time
    time=sort(time,&srt_map);
    print(time);
    print(srt_map);
    // "De-sort" time
    time=dsort(time,srt_map);
    print(time);
    EOF
    ncap2 -O -v -S ~/ncap2_foo.nco ~/nco/data/in.nc ~/foo.nc

     
  • Stubaan
    Stubaan
    2012-02-16

    Yes I think this functionality would prove very useful.

    If dsort() is unable to unsort, is there any way of doing it at all? It seems like you're suggesting one possible option when you said: "In the meantime, you could populate your own such map within a for loop, and then apply it."  Could you elaborate upon this?

    My stab:

    // prcp is the variable of interest

    prcp_sort=sort(prcp,&prcp_sort_map);
    idx_end=prcp_sort.size($time);  // For the for-loop

    for(*idx=1;idx_end;idx++)
          prcp_idx=prcp_sort_map(idx);  // Original index of the now sorted prcp. Check whether count starts at 0 or 1
          prcp(prcp_idx)=prcp_sort(idx);   //  Set prcp to the sorted value at it's pre-sorting index.

    I think another overloaded option that returns the desorting map would be fantastic.

     
  • henry Butowsky
    henry Butowsky
    2012-02-16

    HI there,
    should be able to implement an unsort function. Maybe rename dsort or add an alias - any suggestions
    1) sort is overloaded and remains the same
    2) apply sort ( currently dsort)   asort ?
    3) apply un-sort ??  usort ?

    Any suggestions ?
    … Henry

     
  • Charlie Zender
    Charlie Zender
    2012-02-16

    Hi Henry,

    > should be able to implement an unsort function. Maybe rename dsort or add an
    > alias - any suggestions
    > 1) sort is overloaded and remains the same

    yes, i think sort() should be overloaded with three versions

    var_out=sort(var_in); // already exists
    var_out=sort(var_in,&srt_map); // already exists
    var_out=sort(var_in,&srt_map,&dsr_map); // proposed

    where dsr_map is the "de-sorting map" such that

    var_out=sort(var_in,&srt_map,&dsr_map);
    var_out=remap(var_out,dsr_map);

    results in var_out==var_in

    > 2) apply sort ( currently dsort)   asort ?

    as above, i favor renaming dsort() to remap().
    we will deprecate dsort() from the docs now and code eventually.
    note that remap() has no overloads.
    it always requires two arguments, an array and a map.

    > 3) apply un-sort ??  usort ?

    i don't see the need for this. sort() returns one or two maps.
    asort() applies a single map. what else is there?

    it would be nice to have a way to specify the sort() direction,
    ascending or descending. the order affects the returned maps as well.
    a direction option would obviate need to use .reverse() (which reverses
    arrays, but does not "work" on index maps).
    one way to do this would be two different sort() functions, asort()
    and dsort(), for ascending and descending sorts, respectively.
    they could both call the same underlying sort() routine.
    this is actually the solution i favor.
    yes there is a name conflict with dsort() but few people use dsort(),
    and the prototypes are incompatible so there is no danger of old
    scripts silently working and producing wrong answers in the future.

    how does this sound to you?

    c

     
  • Charlie Zender
    Charlie Zender
    2012-02-16

    i would do it this way for now:

    pcp_srt=sort(pcp,&srt_map);
    *idx_end=pcp_srt.size($time);
    for(*idx=0;idx<idx_end;idx++){
      // zero or whatever you want to do to specific elements
      if(idx < 20) prp_srt(idx)=0.0;
    }

    for(*idx=0;idx<idx_end;idx++){
      // put sorted (and some zeroed) entries back into original pcp locations
      pcp(idx)=pcp_srt(srt_map(idx));
    }

     
  • henry Butowsky
    henry Butowsky
    2012-02-17

    HI Charlie,
    I don't understand the need for the de-sorting map.
    we could have an an unmap function that takes the regular map and does a reverse map ?
    e.g  var_out=unmap(var_in, srt_map);
    So to sumarise:
    we have:

    sort(var_in, &srt_map );  // accending sort
    asort(var_in,&srt_map);  // accending sort 
    dsort(var_in,&srt_map); // desending sort     
    remap(var_in,srt_map);  
    unmap(var_in,srt_map);
    

     
    …. Henry

     
  • Charlie Zender
    Charlie Zender
    2012-02-18

    Hi Henry,

    > I don't understand the need for the de-sorting map.

    A fair question. Providing the de-sorting map directly to the user
    would give the user the opportunity to examine/modify it directly.
    Then the desorting map could, like the sorting map, be used multiple
    times, i.e., to de-sort multiple variables. It's really a matter of
    symmetry. We give the "forward" sorting map, and should give the same
    access to the "reverse" sorting map, because it is just as useful.

    > we could have an an unmap function that takes the regular map and does a reverse
    > map ?
    > e.g  var_out=unmap(var_in, srt_map);

    Yes, the forward sorting map uniquely determines the reverse map, so
    an unmap() function is well-defined and would solve all problems
    except if users wanted to tweak the map itself before using it,
    perhaps multiple times.
    My instinct is to give the reverse map directly to users who want it.
    If you don't want to overload the sort() functions, then how about
    providing a function that simply returns the de-sort map corresponding
    to the input, i.e.,

    dsr_map=reverse_map(srt_map);

    then users can always get the desort map, and apply it with remap().
    This would extend your summary by one function from

    >  sort(var_in,&srt_map); // accending sort
    > asort(var_in,&srt_map); // accending sort
    > dsort(var_in,&srt_map); // desending sort    
    > remap(var_in,srt_map); 
    > unmap(var_in,srt_map); 

    to

    var_out=sort(var_in,&srt_map); // accending sort
    var_out=asort(var_in,&srt_map); // accending sort
    var_out=dsort(var_in,&srt_map); // desending sort    
    var_out=remap(var_in,srt_map); 
    var_out=unmap(var_in,srt_map); 
    dsr_map=reverse_map(srt_map); 

    cz

     
  • Stubaan
    Stubaan
    2012-02-20

    Hi cz

    I'm sure the cause of this is something trivial, but I am having trouble getting the nco script you suggested (or my own version for that matter) to execute the if statement correctly.

    I have used your code with the addition of one more variable for the if-loop condition as follows:

    pcp_srt=sort(pcp,&srt_map);
    *idx_end=pcp_srt.size($time);
    *idx_cutoff=idx_end-pcp_events    // pcp_events is another vairable in the file that I use to determine the cutoff point

    for(*idx=0;idx<idx_end;idx++){

               if(idx < idx_cutoff) prp_srt(idx)=0.0;
    }

    If, like you did, I give the if-loop a condition with a value (e.g. 20) instead of a variable then it works fine.  Unfortunately in my case I must use a variable for the condition.  I've dug around online and it seems like my syntax is correct, yet the result is always the same: the entire pcp_srt variable is set to zero and the cutoff is never implemented.

    I must be missing something obvious but I'll be damned if I can spot it…

     
  • Stubaan
    Stubaan
    2012-02-21

    Well I spent a good 5 hours experimenting and trying to resolve this today, without success.

    I wonder if I am not missing something in my implementations of the script.  Currently I am executing a bash script that contains the ncap2 call to the nco script.  Specifically:

    ncap2 -O -v -S script.nco in.nc out.nc

    My nco script contains, exactly, the following:

    // in.nc contains the following variables: prcp (248 values); prcp_events (1 value)
    prcp_sort=sort(prcp,&sort_map)
    *idx_end=prcp_sort.size($time)   // note that I get WARNING size(): Function has been called with more than one argument
    *idx_cutoff=idx_end-prcp_events

    for(idx=0;idx<idx_end;idx++){
             print(idx)    // this confirms that idx is indeed changing
             print(idx_cutoff)       // this confirms that idx_cutoff is indeed constant and < idx_end
             if(idx < idx_cutoff) prcp_sort(idx)=0.0
             }

    The only way I have been able to get this to work is by setting idx_cutoff to an actual value.

    Do I perhaps need a header that specifies the script language? The documentation implies that C syntax and formatting should work. I should add that this is one of my first forays into C scripting, which is why I suspect some trivial error on my part, but after so many hours I suspect I am either braindead, a moron, or dealing with a bug (in that order :-).

    Tomorrow is another day…

     
1 2 3 > >> (Page 1 of 3)