Fast Autonomous Unsupervised Multidimensional (FAUM) Clustering.
1. INTRODUCTION
This is the proof-of-concept implementation of the FAUM Clustering method
presented in [1]. This implementation was used to perform the published results
and is now released in the hope that it will be useful.
2. COPYRIGHT AND DISCLAIMER
Copyright (C) 2015-2024 Hugo Javier Curti, Ruben Sergio Wainschenker
This file is part of FAUM.
FAUM is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
FAUM is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with FAUM. If not, see <https://www.gnu.org/licenses/>.
3. COMPONENTS INCLUDED IN THIS PACKAGE
README: Introduction and base documentation.
COPYING: Terms of Usage and Distribution.
PRODUCTION BINARIES:
clasif0:
Generate only Order 0 hyper-histogram of FAUM clustering. It can generate
the hyper-histogram summary, the hyper-histogram data or both.
The hyper-bin side size must be indicated in the command line, and the
input data in Portable Arbitrary Map (PAM) format must be provided,
either through standard input, or as a file name in the command line.
Use 'clasif0 --help' to consult synopsis and available options.
createPamImage:
Utility to create a Portable Arbitrary Map (PAM) with information
taken from different files, one per depth channel. PGM files, or crude
data files containing integers in big endian notation, one per depth
channel in western natural reading order (lef-right-up-down) may be used.
In the case of crude data, the parameters of the image must be specified
in the command line. Cropping options are provided.
Use 'createPamImage --help' to consult synopsis and available options.
See USAGE for typical usage scenario, and EXAMPLES for examples of PAM
generation from typical satellite imagery formats.
data2pam:
Utility to create a Portable Arbitrary Map (PAM) from a text file
containing a list of integer numbers separated by space. The parameters
of the Map must be specified in the command line. In particular, the
end of line is considered as normal space and no assumption is made
about its position and/or count. Numbers must be ordered in points
(one number per depth channel) in western natural reading order
(left-right-up-down).
Use 'data2pam --help' to consult synopsis and available options.
See USAGE for typical usage scenario, and EXAMPLES for examples of PAM
generation from text files.
faum:
Main program. Perform the full FAUM algorithm on a Portable Arbitrary
Map (PAM) provided through standard input or as file name in the command
line (which is more efficient). By default the automatic clustering
algorithm is applied, starting zero order clustering step from the
bigger hyper-bin side size, looking for the optimum size using both
Plural Hyper-bin count and Cardinality Dispersion empirical methods. The
first optimum found is used for the first order clustering step with
distance 1. Options may be used to either use a particular hyper-bin
side size and a different distance, or to choose the preferred empirical
method for zero order clustering step. Options may be used to indicate
a hint on the number of clusters to find and trigger the experimental
second order clustering step to find the optimum hyper-bin side size
and optimum distance. Options may also be used to avoid the first order
postclustering step, to use a finer grain point postclustering step, or
to use a distance function other than Chebyshev for the clustering or
postclustering steps (Euclidean and Minkowsky are available). The
implementation generates many different outputs, including a PNM image
using different colors to represent the 16 most populated cluster's
points, a full PGM with the classification as raster data and the full
data in csv format with a classification column. It can also dump the
cluster histogram, with seed, centroid and deviation.
Use 'faum --help' to consult synopsis and available options.
See USAGE for typical usage scenarios.
kmeans:
This is fast K-Means implementation, designed using the same optimization
strategy used in FAUM. Although its original purpose was to achieve a
fair quantitative comparation between FAUM and K-Means, it is now fully
functional. Its interface and output options are simmilar to FAUM. The
number of clusters to find is a mandatory argument. Several
initialization methods are implemented, including manual (indicating the
list of centroids), Forgy (random) and the default KMeans++. In
particular, the manual initialization may be used to feed the centroids
produced by FAUM into K-Means.
Use 'kmeans --help' to consult synopsis and available options.
hypercubeGenerator:
hypercubeGenerator16:
These utilities are used to generate the Big Data Hypercubes Set used
to test FAUM and perform a quality and timing comparison between FAUM
and K-Means. The shape of the dataset is hardcoded, but the actual points
are randomly created on invocation. hypercubeGenrator and
hypercubeGenerator16 create a dataset with 8 and 16 bits per sample
respectively.
Use 'hypercubeGenerator --help' and 'hypercubeGenerator16 --help' to
consult synopsis and available options.
readPamImage:
Small utility useful to extract any three depth channel from a Portable
Arbitrary Map (PAM) file and create a Portable Any Map (PNM) file, or
any depth channel in the PAM file to create a Portable Grey Map (PGM)
file. PNM/PGM tools available in the NetPBM package may be used to
export the generated Map to many standard image format (PNG, TIF, JPG,
etc.).
Use 'readPamImage --help' to consult synopsis and available options.
DEVELOPMENT BINARIES:
These binaries were used during FAUM research and implementation. They
are not ment to be used in production scenarios, and are distributed
for historical and completeness reasons only.
sizes:
Used to quickly get the basic type sizes during tests.
calculoScott:
Used to perform the Scott optimal bin size on histogram. It was tested
and then discarded during FAUM research.
clasif1_chebyshev_fixed:
First simple version of FAUM First Order Clustering Step, now superseded
by the faum utility.
prueba_rnorrexp:
Small utility used to test the quality of the Gaussian distribution
generator.
test_crc32 (available when configured with gcrypt-crc option):
During the Resarch, the use of a CRC as hashing function for FAUM was
proposed. This utility test the Gcrypt CRC function.
test_hash_speed (available when configured with gcrypt-crc option):
Timing comparison between the ad hoc hashing function used in FAUM and
the Gcrypt CRC function.
4. USAGE
The typical usage process implies the following steps:
a) Prepare input data:
- In case of image or raster data, use NetPBM tools to generate PGM files
or other tool to extract the Raw Binary Data or numbers into text file.
- Normalize data into 8, 16, 24 or 32 bit unsigned integer when needed.
- Generate a Portable Arbitrary Map (PAM) file from text or binary data.
b) Execute the clustering process:
- Choose the preferred clustering options.
- Choose the desired output(s).
- In case of FAUM as initialization to K-Means, collect the centroids
and invoke K-Means using them.
c) Prepare output data.
- In case of PNM Image with most populated clusters, use NetPBM tools to
generate an image in the preferred format.
- In case of raster output, use NetPBM tools to generate a raster in the
preferred format. If GeoTiff input is used, the GeoTiff Headers may be
extracted from the input file and installed in the output file.
- In case of text output, classification may be easily extracted from the
whole dataset.
5. EXAMPLES
Example 1: Generate a PAM file from Landsat 7 Image data files.
Landsat 7 images are presented in one 8 bit unsigned integers binary
file per band (a.k.a. depth channel). The image geometry data must be
extracted from the HRF file before using the createPamImage tool to create
the PAM file. An example with L71225086_08620000119 is illustrated here:
a) Extract image geometry from HRF file. The relevant part is shown below:
PIXELS PER LINE =7476 LINES PER BAND =7040 /7040
OUTPUT BITS PER PIXEL =8
Therefore: 7040x7476, 8 bits (1 byte) per pixel.
b) Execute createPamImage for the bands 1,2,3,4,5,7 (excluding Thermals):
createPamImage -l 1,2,3,4,5,7 -b 1 -w 7476 -h 7040 \
-o l71225086_08620000119.pam \
l71225086_08620000119_b10.fst \
l71225086_08620000119_b20.fst \
l71225086_08620000119_b30.fst \
l71225086_08620000119_b40.fst \
l71225086_08620000119_b50.fst \
l72225086_08620000119_b70.fst
The l71225086_08620000119.pam file is ready to be used as FAUM input. (See
Example 6)
Example 2: Generate a PAM file from a crop of a Landsat 8 Image.
Landsat 8 images are presented in one integer TIFF binary file per band
(a.k.a. depth channel). The createPamImage tool combined with tifftopnm
tool can be used to create the PAM file. There is no need to indicate the
geometry of the image, since it is read from the TIFF header.
An example of extrating a crop from the LC82240872013279LGN00 image is
illustrated here. The crop is 430x350 size and starts at point 2860x2440.
createPamImage -l 1,2,3,4,5,6,7 -x 2860 -y 2440 -c 430 -r 350 \
-o LC82240872013279LGN00_crop.pam \
<(tifftopnm LC82240872013279LGN00_B1.TIF) \
<(tifftopnm LC82240872013279LGN00_B2.TIF) \
<(tifftopnm LC82240872013279LGN00_B3.TIF) \
<(tifftopnm LC82240872013279LGN00_B4.TIF) \
<(tifftopnm LC82240872013279LGN00_B5.TIF) \
<(tifftopnm LC82240872013279LGN00_B6.TIF) \
<(tifftopnm LC82240872013279LGN00_B7.TIF)
The LC82240872013279LGN00_crop.pam file is ready to be used as FAUM input.
(See Example 5)
Example 3: Generate a PAM file from a raster 0-100 normalized GeoTIFF file
FAUM uses integer numbers for input, avoiding the floating point unit thus
working faster. In order to use floating point data, it must be translated/
normalized to integer before processing them.
This might seem difficult, but standard open source tools exist to translate
the data. In the case of GeoTIFF and other raster formats, the Geospatial
Data Abstraction Library and its tools may be used.
In this example, the file raster.gtiff is a 6 depth channel 0-100 normalized
floating point raster data to be exported into a 16 bit unsigned integer PAM
file. 24 or 32 bits could also be used, but no need of such precision
exists in this dataset. Use the gdal_translate tool (from the GDAL library)
to extract and normalize each depth channel. Then use the createPamImage
tool to generate the PAM file.
gdal_translate -ot int16 -scale 0 100 0 65535 -b 1 raster.gtiff raster_1.tif
gdal_translate -ot int16 -scale 0 100 0 65535 -b 2 raster.gtiff raster_2.tif
gdal_translate -ot int16 -scale 0 100 0 65535 -b 3 raster.gtiff raster_3.tif
gdal_translate -ot int16 -scale 0 100 0 65535 -b 4 raster.gtiff raster_4.tif
gdal_translate -ot int16 -scale 0 100 0 65535 -b 5 raster.gtiff raster_5.tif
gdal_translate -ot int16 -scale 0 100 0 65535 -b 6 raster.gtiff raster_6.tif
createPamImage -l 1,2,3,4,5,6 -o raster.pam \
<(tifftopnm raster_1.tif) \
<(tifftopnm raster_2.tif) \
<(tifftopnm raster_3.tif) \
<(tifftopnm raster_4.tif) \
<(tifftopnm raster_5.tif) \
<(tifftopnm raster_6.tif)
The GeoTIFF metadata may also be exported to be used in the output. The
'listgeo' tool from the GeoTIFF library is useful for that purpose:
listgeo raster.gtiff > raster_metadata.txt
The raster.pam file is ready to be used as FAUM input. (See Example 7)
Example 4: Generate a PAM image from integer numbers in a text file.
This example generates a PAM file from the 1024 dimension variance 100 G2
dataset [2]. The dataset contains 2048 points of 1024 dimension each.
Numbers are between 45 and 1056, thus fitting in a 16 bit PAM. Use the
data2pam tool to generate the PAM file, fitting the points in a 32x64
pseudoimage, using one depth channel per dimension:
data2pam -b 2 -w 32 -h 64 -d 1024 g2-1024-100.txt g2-1024-100.pam
The g2-1024-100.pam file is ready to be used as FAUM input. (See Example 8)
The g2-1024-100.txt file is included in the examples directory. The full
G2 dataset is available at https://cs.joensuu.fi/sipu/datasets/g2-txt.zip
Example 5: Default FAUM processing on a Satellite Image
In this example, FAUM is executed on the LC82240872013279LGN00_crop.pam
generated in example 2. Generate the pnm image with the 16 most populated
clusters. Use the default parameters. The summary generated by FAUM is
shown:
faum -o LC82240872013279LGN00_crop.pnm LC82240872013279LGN00_crop.pam
===
SUMMARY:
Clusterized bins: 2561
Unclusterized bins (before postclustering): 1251
Total clusters: 63
Cardinality dispersion: 44
===
The LC82240872013279LGN00_crop.pam file is included in the examples
directory.
Example 6: Execute FAUM indicating a hyper-bin side size and a distance.
In this example, FAUM is executed on the l71225086_08620000119.pam
generated in the example 1, using predefined hyper-bin side size of 4 (2^2)
and a distance 3. Generate the output with the most populated clusters in
png format. Dump the full class histogram. The summary and the first part
of the histogram generated by FAUM are shown:
faum -o >(pnmtopng > l71225086_08620000119.png) -h -f2,3 \
l71225086_08620000119.pam
===
SUMMARY:
Clusterized bins: 430578
Unclusterized bins (before postclustering): 275692
Total clusters: 549
Cardinality dispersion: 199
===
Histogram dump:
15401621: Seed: (0000 0000 0000 0000 0000 0000)
Centroid (0 0 0 0 0 0) Deviation (0 0 0 0 0 0)
11819558: Seed: (0013 0010 000F 0013 001C 000F)
Centroid (78 65 62 76 112 62) Deviation (8 11 11 14 12 12)
5709734: Seed: (0014 0011 0011 0013 0023 0014)
Centroid (83 71 74 74 135 81) Deviation (7 4 6 13 17 14)
.
. <other 545 clusters>
.
1: Seed: (003F 003F 003F 0033 0035 002C
Centroid (255 255 255 205 212 178) Deviation (0 0 0 0 0 0)
Example 7: Generate a raster with the classification data from a GeoTIFF.
In this example, FAUM is executed on the raster.pam file generated in the
example 3. The raster is transformed to tif and the metadata header is
incorporated.
faum -R >(pnmtotif > class_raster.tif) raster.pam
geotifcp -g raster_metadata.txt class_raster.tif class_raster.gtiff
Example 8: Generate a text output with classification data.
In this example, FAUM is executed on the g2-1024-100.pam file generated
in the example 4. A hint on the number of clusters to find is indicated
to FAUM (in this case, exactly 2 clusters), and Euclidean distance is used
in the postclustering step. A csv output is generated with a classification
column prepended, suitable for importing into a spreadsheet. A visual
representation of the classification is also generated.
faum -O g2-1024-100.csv -o g2-1024-100.pnm -m 2 -E g2-1024-100.pam
Example 9: Use FAUM as initialization to K-Means.
In this example, FAUM is executed on the g2-32-80.pam file generated
from the G2 dataset [2] in a simmilar manner than the one shown in
example 4.
This is a difficult dataset and the classification may be enhanced by a
normal K-Means processing using the centroids computed by FAUM. At the
moment, centroid information must be parsed from FAUM output. A csv output
is generated with a classification column prepended, suitable for importing
into a spreadsheet. A visual representation of the classification is also
generated.
faum -h -m 2 -E g2-32-80.pam | \
sed -ne 's/.*Centroid (\([0-9 ]\+\)).*/\1/p' > centroids.txt
kmeans -C centroids.txt -n 2 -O g2-32-80.csv -o g2-32-80.pnm -R 1 \
g2-32-80.pam
The g2-1024-100.txt and the g2-32-80.txt files (to generate the .pam files
as shown in example 4) are included in the examples directory. The full G2
dataset is available at http://cs.joensuu.fi/sipu/datasets/g2-txt.zip
Example 10: Run FAUM on the Big Data Hyper-Cubes Set.
This example shows the generation and processing of a Big Data Hypercubes
Set in one step. The -s option may be used to obtain a deterministic
dataset. Warning: a temporary file will be created. Generate a visual
representation of the ouput clustering.
hypercubeGenerator -s 1543554 | faum -o hypercube.pnm
BIBLIOGRAPHY
[1] H.J.Curti and R.S.Wainschenker, FAUM: Fast Autonomous Unsupervised
Multidimensional classification, Information Sciences, Volume 462,
2018, Pages 182-203, ISSN 0020-0255,
https://doi.org/10.1016/j.ins.2018.06.008.
[2] P.Fränti, O.Virmajoki and V.Hautamäki, Fast Agglomerative Clustering Using
a K-Nearest Neighbor Graph, IEEE Trans. on Pattern Analysis and Machine
Intelligence, Volume 28(11), 2006, Pages 1875-1881.