rdf-kd Code

Brought to you by: andyroo101, annacegu, tehuser

Tree [678602] default tip /

History

Read Only access

File	Date	Author	Commit
examples	2013-06-28	tehuser	[485024] * add example files
src	2013-06-28	tehuser	[678602] * Update version number
tests	2013-06-28	tehuser	[0fc7f7] * Update licence headers
COPYING	2013-01-23	Tehuser	[fd381d] * Initial import from internal repo
Doxyfile	2013-01-23	Tehuser	[fd381d] * Initial import from internal repo
Makefile	2013-06-28	tehuser	[678602] * Update version number
README	2013-01-24	Tehuser	[29db6a] * rebase against latest internal version
rdf-kd.dtd	2013-01-24	Tehuser	[29db6a] * rebase against latest internal version

Read Me

This is the readme for the rdf-kd program, written by D Haley.

This program is intended for use on atom probe datasets, though can probably be reasonably easily adapted to other things.

Description:
The program attempts to find the statistical separation distances between atoms in the dataset by generating a histogram of the NN distances from 1st to some specified value. The inputs to the program are a flat file "pos" (4*( 4 byte IEEE 754 float, little endian), {x,y,z,label} ) format, giving the co-ordinates of each atom and the atom's mass to charge ration, as well as a "rng" (text) file giving the mass to charge ranges that identify particular atoms.

Major Caveats:
As this program is optimised for speed - it uses an expanding search sphere method to locate nearby points. As a result if two points are equidistant from a search point, the second point will be skipped during the search. This means that the amplitudes of RDFs for noiseless data will not be sensible, although the positions at which peaks occur will be. This can be easily circumvented by applying a small quantity of additive noise to your dataset first.

Code exists to perform exact matching (see 3depict's K3DTree-mk2), but currently has not been integrated into RDF-KD

Outline of the algorithm:

The algorithm employed is relatively straightforward, with the major steps outlined below

1) Loading the POS File
2) Load Range file
3) trimming out atoms from the POS file if they do not fall into one of the given mass to charge ranges specified in the range file input
4) Calculating which atoms are considered "surface atoms" and which are considered "core" atoms. Surface atoms are able to be found in the nearest neighbour search, but are not used to initiate a nearest neighbour search. This is done to prevent "surface" atoms from skewing the NN distribution owing to the fact that the material outside the analysis volume is unknown, so a search would turn up NNs that are not valid.
4a) To remove surface atoms, the convex hull of the dataset is calculated. The midpoint of the analysis volume is then found and used to find the shortest distance to the surface of the analysis volume. Once this is found the convex hull is shrunk around its midpoint (I am unsure as to the validity of the midpoint (average of all points that form the hull), It seems to me that at least intuitively one should be using the centre of mass for such calculations). Using this shortest distance, one then calculates the appropriate scaling factor that reduces the distance of that furthest point D to some scaled version (sD) such that D-sD = k where k is a parameter that is the "depth from surface" parameter. The dataset is then translated to centre on the midpoint, scaled then the translation is undone.
4b) Now we have this scaled or "reduced" hull, we can define all points that lie within the hull to be "core" and those outside to be "surface" (also referred to as "non-core"). As previously stated the core atoms are used to specify start locations, whilst both surface and core atoms are used for matching.
5) Now all atoms are placed in a single balanced K-D tree.

If specifying NN Max
6) Using a find nearest Q neighbours search, for which exact matching has been disabled by the creation of a "dead-zone" (a region inside which atoms will be ignored for purposes of NN matching) whose radius is that of the machines floating point epsilon. This information is then placed into a giant list which is later histogrammed.
7) By a combination of scanning storage and duplicate calculation, the largest NN point is found to determine the histogram binwidth. Once known the distances from any single calculation can be simply dumped into the histogram, rather than requiring additional memory to store.
8) results are then written to STDOUT in a tabular format

Output:

The output comes in the following format

Bin width:
v1 v2 v3 ... vn overall
Histogram:
v1_1 v1_2 v1_3 ... v1_n v1_overall
v2_1 ... v2_overall

...

vnumBins_1 ... vnumBins_overall

Where the v's stand for a floating point value in the bin widths, and an integer value in the histogram.

The bin numbers go down the columns, and the NNs go across, with the exception of the final column which gives the overall results
If specifying Maximum Distance
6) Using a single NN search, the search is continually expanded by increasing the dead zone around the point by feeding in the last distance as the dead-zone. This means that once the 1st NN is matched, the dead zone is set to include the 1st NN and this is repeated until the distance max is exceeded, where the final result is discarded.

There now exists an option to perform a 3D voxel based RDF, where the information is stored. If you wish to use this I currently highly recommend the "box" mode over the "sparse" mode. The box mode will not handle large bin sizes as you will rapidly run out of ram, (2n+1)^3*4 is a large number of bytes apparently. The sparse mode costs significantly (dependant on the density, up to an order of magnitude or two) more CPU time with the speed decreasing as the simulation continues, most likely in a logrithmic manner. The sparse mode will use less ram ONLY IF 48.07...% or fewer the boxes have pairs in them (the data structure for storing the hits in this mode is 4 times larger, but only data inside the sphere contained inside the bin cube will ever be recorded), and it will always be slower.

The output is stored in a rectilinear raw format, where each voxel is an unsigned integer which contains the number of pairs in that box. This can be viewed in an external dataview, I recommend "paraview".

Some sensibility checks:
By approximation of the pos file to that of a cone, the shrinking of the hull by k units should result in a volume reduction (v_new/v_old) of (1 - (4k)/h)^3 where h is the height of the cone (note that this seems independent of the taper angle (interesting result eh?)). As an example applying this formula to a 2200nm pos file and using a 1.5nm reduction we should have (1-(1.5*4)/220)^3 of the original volume (92%). This is an approximation only.

rdf-kd Code

Branches

Tags

Tree [678602] default tip /

History

Read Me

rdf-kd Code

Branches

Tags

Tree [678602] default tip / Download Snapshot History

Read Me

Tree [678602] default tip /

History