Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

[ee22fe]: PDL / Book / Manipulation.pod Maximize Restore History

Download this file

Manipulation.pod    662 lines (526 with data), 25.7 kB

=head1 Selection and Location in PDLs


Indexing and manipulating pieces of arrays is central to many PDL 
operations. Slicing, dicing, and indexing are selection 
operations -- they select (or extract) subfields from a source 
array and arrange them for use by other operations. Slicing is 
the act of selecting affine chunks -- linear or rectangular 
N-dimensional subfields that are regularly sampled; normal array 
subfields are called “slices”. Dicing is similar but without the 
affine constraint: selection of an arbitrary set of locations 
along one or more axes of an array. Indexing is the selection of 
a completely arbitrary collection of elements from an array. 

PDL treats selection operators slightly differently from most 
other array languages. Array selections, including slices, dices, 
and indexed selections, maintain their connection to the original 
parent variable unless they are explicitly severed (via the copy 
or sever operators). This is possible because PDL distinguishes 
the operations of global assignment (C<=>) and computed assignment 
(C<.=>) (See Section [sec:Controlling-Dataflow:-copy]). That 
behavior lets you represent your data multiple ways 
simultaneously, depending on which form is most convenient. 

Slicing, dicing, and indexing are so basic to data extraction and 
manipulation that PDL slightly modifies the syntax of Perl to 
make these operations more convenient. The modified syntax is 
called C<NiceSlice> syntax, and you can enable it with the Perl 
command C<use PDL::NiceSlice>. Slicing syntax and methods are 
described in detail in Section [sec:Selection-Operators], below.

The opposite of selection is location, which generates indices 
where a particular condition is true in an array. PDL has several 
location opeators, including the unique C<where> operator that 
selects corresponding elements from related arrays. These 
operations are described in Section [sec:Location-Operators].

=head2 A quick tour of selection

Here is a simple example that illustrates some of the selection 
and indexing operations in PDL. Consider a color image of a 
starfield:

  $starfield = rim('starfield.fits');

might read in the starfield as a 1000x1000x3 image. Then 

  $subfield=$starfield->(500:599,500:599);

is a 100x100x3 subfield of the original image, and 

  $red   = $starfield->(:,:,(0));
  $blue  = $starfield->(:,:,(1));
  $green = $starfield->(:,:,(2));

lets you access the individual color planes as 1000x1000 PDLs 
(the parentheses around the C<(0)>, C<(1)>, and C<(2)> indicate that the 
final dimension is to be dropped -- without the parentheses you'd 
get three 1000x1000x1 PDLs). 

You can then change the color balance, for example, by modifying 
the red color plane:

  $red *= 2; 

will affect not just the variable C<$red>, but also the original 
C<$starfield> too (and C<$subfield> and any other selection you have 
made from C<$starfield>). The selections are merely different 
representations of the original data in C<$starfield>. To make a 
separate PDL you can make an explicit copy in the initial 
assignment, as in:

  $red = $starfield->(:,:,(0))->copy;

or, after the fact, use the sever method on C<$red> ( [sec:Controlling-Dataflow:-copy]
):

  $red->sever; 
  $red /= 2;

will not affect C<$starfield> or C<$stars>, because sever breaks the 
connection between C<$red> and its source PDL C<$starfield> even after 
the initial assignment.

If you have a list of star locations as a 2xn PDL (called, say, 
C<$xylist>), you can extract a subfield around each star all at 
once:

  $stars = $starfield->range($xylist - 5, 11, 'truncate');

will return an nx11x11x3 PDL that contains an 11x11-pixel 
subfield centered around each star. That is handy if you want to 
do the same thing to the neighborhood around each star -- for 
example,

  $stars->mv(0,3) *= rvals(11,11) + 0.1;

will amplify the tail of the brightness distribution around each 
star: the C<mv(0,3)> shifts the color-plane index out of the active 
dimension at the beginning of the dim list, to a thread dimension 
at the end, making an 11x11x3xn array. The C<rvals> routine creates 
an 11x11 PDL whose elements contain distance (in pixels) from the 
center of the image, so the region around each star is amplified 
far from the central star, and the central star itself is reduced 
in brightness.

The opposite of selection is location. Here's an example of how 
to use location to generate an C<$xylist> to find all the red stars.

  $starthresh = 500;
  $red_simple_xy = indexND( $red >= $starthresh );

That makes C<$red_simple_xy> a C<2xn> list of all the pixel coordinates 
for which the red color plane exceeds some brightness threshold. 
One minor problem is that C<$red_simple_xy> may contain multiple 
entries for a single star, if that star has more than one pixel 
brighter than the threshold. One solution is to find only local 
maxima in the image. You can use range to extract the region 
around each pixel in the entire image, and then use the threading 
engine to find which pixels are local maxima:


  $ndc = ndcoords(3,3)-1;
  $starthresh = 500;
  $redmax = $red > $starthresh and
            $red ==
               $red->
               range( $ndc, [$red->dims], 't')->
               clump(2)->
               maximum;
  $red_xy = indexND($redmax);

Here, C<ndcoords> returns a C<2x3x3> index array, each row of which is 
a vector containing the coordinates of that row in dims 1 and 2. 
The C<range> call returns a C<3x3x1000x1000> array; clumping the first 
two dims yields a C<9x1000x1000> array, which is reduced to a 
C<1000x1000> array by the maximum call. Thus the right hand side of 
the C<==> is an image, each pixel of which has the value of the 
brightest pixel in its C<3x3> neighborhood within C<$red>, so C<$redmax> 
gets a Boolean image with true pixels wherever C<$red> exceeds the 
threshold and is also a local maximum. Finally, the C<indexND> 
operator returns a C<2xn> array containing the locations of all the 
true pixels in C<$redmax>.

=head1 Selection Operators

PDL is extremely flexible in its ability to reshape, cut up, 
reconstruct, and represent data in multiple ways. Most vectorized 
languages feature a way to cut slices out of a large array and 
copy them to a new variable; PDL goes one step farther, by 
allowing you to represent the original data in multiple ways 
simultaneously. Conceptually, a slice, index, or transpose of an 
array B<remains attached to the original array> unless you 
explicitly sever it. That connection is referred to as I<dataflow>, 
because data flows between the original PDL and its children.

The basic slicing syntax in PDL is supplied with the special 
module C<PDL::NiceSlice>, which modifies the way the Perl compiler 
parses your script, to add new syntax for slicing. Slicing, 
dicing (selection of particular rows/columns), indexing 
(selection of particular elements), and ranging (selection of an 
arbitrary collection of slices) are all supported.

=head2 NiceSlice - array subfield syntax


Subfields of a PDL are selected with the C<NiceSlice> operator, 
which takes two forms: juxtaposed and null method. The juxtaposed 
syntax looks like this: C<< $a(<slicing-stuff>) >>, while the null 
method syntax looks like this: C<< $a->(<slicing stuff>) >>. The 
juxtaposed syntax only works on variables; the null method syntax 
works on both variables and expressions that return a PDL, as in 
C<< $a->sumover->(3) >>. The C<< <slicing-stuff> >> is a comma-separated list 
of slice specifiers, as in C<< $a->(3:5,(4),$b,*2) >>. Each slice 
specifier indicates what should happen to the corresponding 
dimension of the output, as follows:

=over

=item *

B< C<n> > - a lone number means that the single corresponding 
generalized row of C<$a> is used, making this a trivial dim (of 
size 1). For example, if C<$a> is a 3x4-PDL, then C<< $a->(1) >> is a 
1x4-PDL.

=item *

B< C<(n)> > - a lone number (or single-element PDL) in parentheses 
means that the single corresponding generalized row of C<$a> is 
used, but that dimension (which is trivial -- it has a size of 
just 1) is omitted from the output dim list. For example, if C<$a> 
is a 3x4-PDL , then C<< $a->(1) >> is a 1x4-PDL and C<< $a->((1)) >> is a 
4-PDL.

=item *

B< C<$pdl> > - a PDL with 1 or more elements uses the corresponding 
generalized rows of C<$a>, in the same dimensional structure as 
the PDL. For example, C<< $a=sequence(5); $b=pdl(4,1); print $a->($b); >> 
prints C<[ 4 1 ]>.

=item *

B< C<n:m> > - two numbers (or variables) separated by a colon is a 
range to include from the corresponding dimension. Negative 
numbers are interpreted modulo the last element, so (e.g.) C<2:-1> 
grabs everything from the third element to the last one.

=item *

B< C<n:m:s> > - three numbers separated by two colons is an affine 
range: the C<s> is a step value, allowing sparse slices through 
the source PDL. Negative values of C<s> step backwards, so (for 
example) C<-1:0:-1> reverses the order of the elements along a 
particular dimension.

=item *

B< C<:> > - uses the whole corresponding dimension

=item *

B< C<*n> > - inserts a dummy dimension of the given size. 

=back

=head2 NiceSlice Examples


Here are some interactive examples of how to use NiceSlice, in 
the perldl shell:

  pdl> $a=xvals(5,4)+10*yvals(5,4); print $a;
  [
   [ 0  1  2  3  4]
   [10 11 12 13 14]
   [20 21 22 23 24]
   [30 31 32 33 34]
  ]
  pdl> print $a->(:,2);
  [
   [20 21 22 23 24]
  ]
  pdl> print $a->(:,(2));
  [20 21 22 23 24]
  pdl> print $a->(0:-1:2,(0));
  [0 2 4]
  pdl> $a->(0:-1:2,(0)) .= 99;
  pdl> print $a->(0:-1:2,(0));
  [99 99 99]
  pdl> print $a->(:,(0));
  [99 1 99 3 99]
  pdl> $b = pdl(3,4); print $a->($b,(1));
  [13 14]
  pdl> print $a->((2),(3),*4);
  [32 32 32 32]

B< I< A warning > >

Nice slicing is, well, very nice -- but it does have some warts 
because of how Perl 5 implements language modifications. 

In particular, if you use the nice slice syntax in any file, 
script, or perl module, you need to include the command C<use PDL::NiceSlice;> somewhere near the top of the file, to ensure 
that the file is parsed correctly. The C<PDL::NiceSlice> module will 
preprocess your code on-the-fly, identify nice slicing syntax, 
and convert it to a normal Perl method call to the method C<nslice>
, before Perl can parse it. This normally works well, but because 
Perl's quoting syntax is so complicated, C<PDL::NiceSlice> doesn't 
properly recognize most quote constructs. So saying C<print "myval is $val ($units)\n";> will give you something different than you 
want. You can avoid that by not using as much string 
interpolation: C<print "myval is $val (".$units.")";> or 
C<printf "myval is %s (%s)",$val,$units;> should work fine. You can also 
shut off nice slicing with C<no PDL::NiceSLice;>, and resume by 
using it again just after your quote.

=head2 Slice - string-conrolled subfields of a PDL

The C<slice> method works almost exactly like C<NiceSlice>, except that 
it accepts a single string that contains the arguments. The 
string should consist of the same arguments that you would pass 
to C<NiceSlice>, with the exception of PDL indexing. Only numeric 
values and ranges are accepted. C<slice> was once the main way to 
create subarrays of PDLs, but once C<NiceSlice> became available it 
is mainly kept around for legacy reasons. 

=head2 Dice - pull arbitrary rows from a PDL

The C<dice> method performs the function of PDL indexing with 
C<NiceSlice>: it allows you to pull arbitrary collections of 
generalized rows from a source PDL. Dicing with C<dice> is 
deprecated, because the C<NiceSlice> syntax (or even C<slice>) is 
preferred.

=head2 Index - select elements from a 1-D PDL

This is used for extracting arbitrary elements from a 1-D PDL. 
For example:

  pdl> $a = xvals(100); print $a->index(pdl(43,10,21));
  [43 10 21]

The counterpart of C<index> is which, which extracts indices from a 
1-D PDL wherever a particular condition is met (see [sub:which]).

=head2 IndexND - select elements from an N-D PDL

You can extract and manipulate an arbitrary collection of 
elements from an C<N>-dimensional PDL with C<indexND>. C<IndexND> is a 
reduce operator: it collapses an index PDL by one dimension, 
using the vector in each row to look up a single value in a 
source PDL. Each row of the index PDL is treated as a vector that 
indexes an element of the source PDL, and you get back the 
collection of locations pointed to by the index. That makes 
C<indexND> a reduce operator on the index PDL.

C<indexND> is handy both for extracting data and for marking the 
source data set via dataflow: if you have a collection of image 
coordinates as a C<2xN> PDL, you can assign to the index PDL and 
mark the original image. C<IndexND> can accept and handle boundary 
conditions, in case your index might run off the edge of the 
source PDL - see the writeup for C<range>, below, for details. 

  pdl> $a = xvals(5,4)+10*yvals(5,4); print $a;
  [
   [ 0  1  2  3  4]
   [10 11 12 13 14]
   [20 21 22 23 24]
   [30 31 32 33 34]
  ]
  pdl> $idx = pdl([[2,3],[4,3]],
  ..(    > [[0,0],[0,1]],
  ..(    > [[0,2],[3,3]],
  ..(    > [[1,3],[0,3]]);
  pdl> print ($b = $a->indexND($idx));
  [
   [32 34]
   [ 0 10]
   [20 33]
   [31 30]
  ]
  pdl> $b .= 99; print $a;
  [
   [99  1  2  3  4]
   [99 11 12 13 14]
   [99 21 22 23 24]
   [99 99 99 99 99]
  ]


The C<indexND> call returns the elements addressed in each row of 
C<$idx>. C<$idx> is a C<2x2x4> PDL, so the elements are returned as a
C<2x4> PDL. They remain connected to C<$a>, so setting them updates the 
elements of C<$a>.

C<IndexND> is implemented via a convenience interface to the 
slightly more general range; please read the discussion of range, 
below, for more information on the limits of C<indexND>. If you want 
to interpolate values from arbitrary locations, you should look 
for C<interpND>, which is discussed in Chapter [cha:Basic-mathematics].

=head2 Range - select subfields from an N-D PDL

The most general selection operator in PDL is C<range>, which 
selects an arbitrary collection of subfields from the original 
PDL and returns them collated in a form suitable for threading. 
It is useful for interpolation, convolution, averaging, marking 
arbitrary locations in an original data set, or performing local 
operations at a set of arbitrary locations in a data set. C<range> 
works similarly to C<indexND> (above), except that each indexed 
location can refer not only to a scalar but also to an C<N>-D 
rectangular subfield of the original source array. This is handy, 
for example, for vectorizing some types of image processing: it 
is possible to "stack up" subregions of a large data set for 
threaded processing by a vectorized algorithm.

You call range with a source PDL and an index, just like C<indexND> 
-- but two optional arguments can follow -- a I<size array>, and a 
I<boundary condition>:

  $out = $source->range($index, $size, $boundary);

will extract a collection of ranges from C<$index>, and return them 
in C<$out>. The C<$index> must have at least one dimension, and each 
row of C<$index> is treated as a single vector pointing at a 
particular value in C<$source>. If you specify a single index 
location as a row vector, then range is essentially an expensive 
slice, with controllable boundary conditions. If C<$index>'s 0th 
dimension has size higher than the number of dimensions in 
C<$source>, then C<$source> is treated as though it had trivial dummy 
dimensions of size 1, up to the required number to be indexed by 
C<$index> -- so if your source array is 1-D and your index array is 
a list of 3-vectors, you get B<two> dummy dimensions of size 1 on 
the end of your source array.

B<Range sizes>

The C<$size> field allows you to extract C<N>-D rectangular ranges from 
C<$source>. If C<$size> is undef or zero, then you get a single sample 
out of C<$source> for each row of C<$index>. This behavior is similar 
to C<indexND>. If C<$size> is positive then you get a range of values 
from C<$source> at each location, and the output has extra 
dimensions allocated for them. 

C<$size> can be a scalar, in which case it applies to all 
dimensions, or an N-vector, in which case each element is applied 
independently to the corresponding dimension in C<$source>. Each 
element of C<$size> should be nonnegative. 

If an element of C<$size> is positive, then the corresponding output 
dim is made to have the indicated size. If an element is zero, 
then the corresponding output dim is omitted entirely. This 
allows you to distinguish, for example, between a C<3x1x2> output 
range at each location and a C<3x2> output range at each location 
(with the last output coordinate running over the third input 
coordinate).

B<Boundary conditions>

The C<$boundary> is a number, string, or list ref indicating the 
type of boundary conditions to use when the extracted ranges 
reach beyond the boundaries of C<$source>. If you specify no 
boundary conditions the default is to forbid boundary violations 
on all axes. If you specify exactly one boundary condition, it 
applies to all axes. If you specify more (for example, as 
elements of a list ref), then they apply to dimensions in the 
order in which they appear, and the last one applies to all 
subsequent dimensions. 

=over

=item *

B<0> or C<"f"> C<"forbid"> (default) Ranges are not allowed to cross the 
boundary of the original PDL. Disallowed ranges throw an error. 
The errors are thrown at evaluation time, not at the time of 
the range call (this is the same behavior as slice).

=item *

B<1> or C<"t"> C<"truncate> - Values outside the original piddle get the 
special value BAD if you've got bad value support compiled into 
your PDL and set the badflag for the source PDL; or 0 if you 
haven't (you must set the badflag if you want BADs for 
out-of-bound values, otherwise you get 0). Reverse dataflow 
works OK for the portion of the child that is in-bounds. The 
out-of-bounds part of the child is reset to (BAD|0) during each 
dataflow operation, but execution continues.

=item *

B<2> or C<"e"> C<"extend> - Values that would be outside the original PDL 
point instead to the nearest allowed value within the PDL. 

=item *

B<3> or C<"p"> C<"periodic> - Periodic boundary conditions apply: the 
numbers in C<$index> are applied, strict-modulo the corresponding 
dimensions of C<$source>. This is equivalent to duplicating the 
C<$source> piddle throughout C<N>-D space. 

=item *

B<4> or C<"m"> C<"mirror"> - Mirror-reflection periodic boundary 
conditions apply. 

=back

B<Output Dimensionality>

C<range> threads over both C<$index> and C<$source>. The returned 
dimension list is stacked up like this:

  (index thread dims), (index dims (size)), (source thread dims)

The first few dims of the output correspond to the thread dims of 
C<$index> (beyond the 0 dim). They allow you to pick out individual 
ranges from a large, threaded collection, so that the output 
normally has the same dimensionality as the C<$index>, but collapsed 
by one dimension.

The middle few dims of the output correspond to the size dims 
specified in C<$size>, and contain the range of values that is 
extracted at each location in C<$source>. Every nonzero element of 
C<$size> is copied to the dimension list here, so that if you feed 
in (for example) C<"$size = [2,0,1]"> you get an index dim list of 
C<"(2,1)">.

The last few dims of the output correspond to extra dims of 
C<$source> beyond the number of dims indexed by C<$index>. These dims 
act like ordinary thread dims, because adding more dims to 
C<$source> just tacks extra dims on the end of the output. Each 
source thread dim ranges over the entire corresponding dim of 
C<$source>.

B<Examples>


Here are basic examples of range operation, showing how to get 
ranges out of a small matrix. The first few examples show 
extraction and selection of individual chunks. The last example 
shows how to mark loci in the original matrix (using dataflow).

  pdl> $src = 10*xvals(10,5)+yvals(10,5) 
  pdl> print $src->range([2,3]) # Cut out a single element 
  23 
  pdl> print $src->range([2,3],1) # Cut out a single 1x1 block 
  [
   [23] 
  ] 
  pdl> print $src->range([2,3], [2,1]) # Cut a 2x1 chunk 
  [
   [23 33]
  ] 
  pdl> print $src->range([[2,3]],[2,1]) # Trivial list of 1 chunk 
  [
   [
    [23] [33] 
   ] 
  ] 
  pdl> print $src->range([[2,3],[0,1]], [2,1]) # two 2x1 chunks 
  [
   [ 
    [23 1] 
    [33 11] 
   ] 
  ] 
  pdl> # A 2x2 collection of 2x1 chunks 
  pdl> print $src->range([[[1,1],[2,2]],[[2,3],[0,1]]],[2,1]) 
  [
   [ 
    [ 
     [11 22] 
     [23 1] 
    ] 
    [ 
     [21 32] 
     [33 11] 
    ] 
   ] 
  ] 

=head1 Location Operators


Location operators are the opposite of indexing operators: they 
return the elements or locations where a particular expression is 
true, allowing you to filter a large array and act on an 
arbitrary subset of it. PDL's location and filtering operators 
are C<where>, C<which>, and C<whichND>. Each operator accepts a PDL filter 
expression, and returns either source elements or index values 
corresponding to the locations where the filter expression is 
true.

The C<where> operator combines location and selection into a single 
step: it returns the actual elements of the source PDL, so that 
you can copy them out or act on them directly (via dataflow, the 
elements remain connected to the original data). The other two 
operators return indices of the locations where the source 
expression is true.

Location operators use PDLs to represent sets (the set of 
elements for which a condition is true). The null set is 
correctly represented -- if the filtering condition is false 
everywhere, C<where>, C<which>, and C<whichND> will each return the 
special I<empty> PDL, which has 0 elements.

=head2 The C<where> operator

The C<where> operator rolls up the operations of location and 
selection in a single routine. You can say:

  $out = $source->where( condition($source) )

to retrieve all (and only) the elements of $out that correspond 
to true elements of the expression C<condition($source)>. The data 
remain connected back to the original C<$source>. For example:

  $source = - (xvals(10) % 2);
  print “Source is $source.\n”;
  $zeroes = $source->where( !$source );
  $zeroes .= xvals($zeroes);
  print “Source is now $source.\n”;

outputs:

  Source is [0 -1 0 -1 0 -1 0 -1 0 -1].
  Source is now [0 -1 1 -1 2 -1 3 -1 4 -1].

Often, you'd like to select the same collection of elements from 
several PDLs at once, and where can handle that. For example, if 
C<$x> and C<$y> contain coordinates of a collection of rocks, and
C<$mass> 
contains the mass of each rock, then you can say:

  ($xsub, $ysub, $msub) = where($x, $y, $mass, $mass > 10 & $x < 0 );

to select the coordinates and mass of just the rocks with a mass 
greater than 10, that also happen to be placed to be left of the 
origin. (Notice that the example uses the bitwise-and operator C<&>, 
not the more familiar logical-and operator C<&&> - to find out why, 
check out [sec:PDLs-as-logical], below).

=head2 The C<which> operator

The simplest indexing function is C<which>. It accepts a PDL 
expression and returns a list of all the offset locations where 
the expression is true. If the source expression has more than 
one dimension, then it is B<flattened> to one dimension first using 
C<flat> ([sub:Rearranging-a-dim]) so that it can be indexed with a 
single number. Note that this may or may not be a useful way to 
address thread dimensions, depending on your application. You can 
use the returned index list either in C<index> or in a C<NiceSlice> 
expression, to get access to the elements. For exmple:

  $dex = which( $source==5 );
  $fives = $source->($dex); # niceslice
  $fives = $source->index($dex); # index

=head2 The C<whichND> operator

For any kind of indexing that is more sophisticated than C<which>, 
you probably want C<whichND>, which returns a collection of vectors 
into the source expression, rather than simple offsets. If you 
call C<whichND> in scalar context, you get back an C<nxm> PDL 
whose 0 dim runs across dimension in the vectors and whose 1 dim 
runs across found locations. If you call it in array context, you 
get back C<n> separate C<m>-PDLs, each of which contains one dimension 
of the entire list of vectors. The scalar output of C<whichND> is 
suitable for use with C<indexND> and range. 


Here is an example of using C<whichND> to extract purple areas from 
an RGB image. The demo PDL logo is a JPEG image, with pixel 
values running from 0-255. We first generate a mask expression 
that is true when the red and blue components are over 40 and 
100, respectively, and green is under 70 -- the purple portion of 
the RGB palette. The C<whichND> operator yields the coordinates of 
all nonzero pixels in the mask.

  pdl> #wget http://pdl.perl.org/images/pdllogo.jpg
  pdl> $a=rim('pdllogo.jpg'); # $a is 250(X) x 150(Y) x 3(RGB)
  pdl> $mask = andover ( ($a->mv(2,0) * pdl(1,-1,1)) > pdl(40,-70,100) );
  pdl> $coords = whichND($mask);  

Now C<$pixels> is a C<2x4298> PDL containing the coordinates of every 
purple pixel in the image. After extracting them with C<indexND>, 
it is possible to change the original image:

  pdl> $pixels = $a->indexND($coords) # 4298 x 3 
  pdl> $pixels .= pdl(10,40,200)->(*1); # flows to $a
  pdl> wim($a,'pdllogo-blue.jpg');

[float Figure:
[Figure 3.1:
whichND example. Left: original image. Right: after pixel 
selection and processing with whichND and indexND ([sub:whichND]
).
]