Faum Code
Fast Autonomous Unsupervised Multidimiensional Classification
Status: Beta
Brought to you by:
hcurti
Fast Autonomous Unsupervised Multidimensional (FAUM) Clustering. 1. INTRODUCTION This is the proof-of-concept implementation of the FAUM Clustering method presented in [1]. This implementation was used to perform the published results and is now released in the hope that it will be useful. 2. COPYRIGHT AND DISCLAIMER Copyright (C) 2015-2024 Hugo Javier Curti, Ruben Sergio Wainschenker This file is part of FAUM. FAUM is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. FAUM is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with FAUM. If not, see <https://www.gnu.org/licenses/>. 3. COMPONENTS INCLUDED IN THIS PACKAGE README: Introduction and base documentation. COPYING: Terms of Usage and Distribution. PRODUCTION BINARIES: clasif0: Generate only Order 0 hyper-histogram of FAUM clustering. It can generate the hyper-histogram summary, the hyper-histogram data or both. The hyper-bin side size must be indicated in the command line, and the input data in Portable Arbitrary Map (PAM) format must be provided, either through standard input, or as a file name in the command line. Use 'clasif0 --help' to consult synopsis and available options. createPamImage: Utility to create a Portable Arbitrary Map (PAM) with information taken from different files, one per depth channel. PGM files, or crude data files containing integers in big endian notation, one per depth channel in western natural reading order (lef-right-up-down) may be used. In the case of crude data, the parameters of the image must be specified in the command line. Cropping options are provided. Use 'createPamImage --help' to consult synopsis and available options. See USAGE for typical usage scenario, and EXAMPLES for examples of PAM generation from typical satellite imagery formats. data2pam: Utility to create a Portable Arbitrary Map (PAM) from a text file containing a list of integer numbers separated by space. The parameters of the Map must be specified in the command line. In particular, the end of line is considered as normal space and no assumption is made about its position and/or count. Numbers must be ordered in points (one number per depth channel) in western natural reading order (left-right-up-down). Use 'data2pam --help' to consult synopsis and available options. See USAGE for typical usage scenario, and EXAMPLES for examples of PAM generation from text files. faum: Main program. Perform the full FAUM algorithm on a Portable Arbitrary Map (PAM) provided through standard input or as file name in the command line (which is more efficient). By default the automatic clustering algorithm is applied, starting zero order clustering step from the bigger hyper-bin side size, looking for the optimum size using both Plural Hyper-bin count and Cardinality Dispersion empirical methods. The first optimum found is used for the first order clustering step with distance 1. Options may be used to either use a particular hyper-bin side size and a different distance, or to choose the preferred empirical method for zero order clustering step. Options may be used to indicate a hint on the number of clusters to find and trigger the experimental second order clustering step to find the optimum hyper-bin side size and optimum distance. Options may also be used to avoid the first order postclustering step, to use a finer grain point postclustering step, or to use a distance function other than Chebyshev for the clustering or postclustering steps (Euclidean and Minkowsky are available). The implementation generates many different outputs, including a PNM image using different colors to represent the 16 most populated cluster's points, a full PGM with the classification as raster data and the full data in csv format with a classification column. It can also dump the cluster histogram, with seed, centroid and deviation. Use 'faum --help' to consult synopsis and available options. See USAGE for typical usage scenarios. kmeans: This is fast K-Means implementation, designed using the same optimization strategy used in FAUM. Although its original purpose was to achieve a fair quantitative comparation between FAUM and K-Means, it is now fully functional. Its interface and output options are simmilar to FAUM. The number of clusters to find is a mandatory argument. Several initialization methods are implemented, including manual (indicating the list of centroids), Forgy (random) and the default KMeans++. In particular, the manual initialization may be used to feed the centroids produced by FAUM into K-Means. Use 'kmeans --help' to consult synopsis and available options. hypercubeGenerator: hypercubeGenerator16: These utilities are used to generate the Big Data Hypercubes Set used to test FAUM and perform a quality and timing comparison between FAUM and K-Means. The shape of the dataset is hardcoded, but the actual points are randomly created on invocation. hypercubeGenrator and hypercubeGenerator16 create a dataset with 8 and 16 bits per sample respectively. Use 'hypercubeGenerator --help' and 'hypercubeGenerator16 --help' to consult synopsis and available options. readPamImage: Small utility useful to extract any three depth channel from a Portable Arbitrary Map (PAM) file and create a Portable Any Map (PNM) file, or any depth channel in the PAM file to create a Portable Grey Map (PGM) file. PNM/PGM tools available in the NetPBM package may be used to export the generated Map to many standard image format (PNG, TIF, JPG, etc.). Use 'readPamImage --help' to consult synopsis and available options. DEVELOPMENT BINARIES: These binaries were used during FAUM research and implementation. They are not ment to be used in production scenarios, and are distributed for historical and completeness reasons only. sizes: Used to quickly get the basic type sizes during tests. calculoScott: Used to perform the Scott optimal bin size on histogram. It was tested and then discarded during FAUM research. clasif1_chebyshev_fixed: First simple version of FAUM First Order Clustering Step, now superseded by the faum utility. prueba_rnorrexp: Small utility used to test the quality of the Gaussian distribution generator. test_crc32 (available when configured with gcrypt-crc option): During the Resarch, the use of a CRC as hashing function for FAUM was proposed. This utility test the Gcrypt CRC function. test_hash_speed (available when configured with gcrypt-crc option): Timing comparison between the ad hoc hashing function used in FAUM and the Gcrypt CRC function. 4. USAGE The typical usage process implies the following steps: a) Prepare input data: - In case of image or raster data, use NetPBM tools to generate PGM files or other tool to extract the Raw Binary Data or numbers into text file. - Normalize data into 8, 16, 24 or 32 bit unsigned integer when needed. - Generate a Portable Arbitrary Map (PAM) file from text or binary data. b) Execute the clustering process: - Choose the preferred clustering options. - Choose the desired output(s). - In case of FAUM as initialization to K-Means, collect the centroids and invoke K-Means using them. c) Prepare output data. - In case of PNM Image with most populated clusters, use NetPBM tools to generate an image in the preferred format. - In case of raster output, use NetPBM tools to generate a raster in the preferred format. If GeoTiff input is used, the GeoTiff Headers may be extracted from the input file and installed in the output file. - In case of text output, classification may be easily extracted from the whole dataset. 5. EXAMPLES Example 1: Generate a PAM file from Landsat 7 Image data files. Landsat 7 images are presented in one 8 bit unsigned integers binary file per band (a.k.a. depth channel). The image geometry data must be extracted from the HRF file before using the createPamImage tool to create the PAM file. An example with L71225086_08620000119 is illustrated here: a) Extract image geometry from HRF file. The relevant part is shown below: PIXELS PER LINE =7476 LINES PER BAND =7040 /7040 OUTPUT BITS PER PIXEL =8 Therefore: 7040x7476, 8 bits (1 byte) per pixel. b) Execute createPamImage for the bands 1,2,3,4,5,7 (excluding Thermals): createPamImage -l 1,2,3,4,5,7 -b 1 -w 7476 -h 7040 \ -o l71225086_08620000119.pam \ l71225086_08620000119_b10.fst \ l71225086_08620000119_b20.fst \ l71225086_08620000119_b30.fst \ l71225086_08620000119_b40.fst \ l71225086_08620000119_b50.fst \ l72225086_08620000119_b70.fst The l71225086_08620000119.pam file is ready to be used as FAUM input. (See Example 6) Example 2: Generate a PAM file from a crop of a Landsat 8 Image. Landsat 8 images are presented in one integer TIFF binary file per band (a.k.a. depth channel). The createPamImage tool combined with tifftopnm tool can be used to create the PAM file. There is no need to indicate the geometry of the image, since it is read from the TIFF header. An example of extrating a crop from the LC82240872013279LGN00 image is illustrated here. The crop is 430x350 size and starts at point 2860x2440. createPamImage -l 1,2,3,4,5,6,7 -x 2860 -y 2440 -c 430 -r 350 \ -o LC82240872013279LGN00_crop.pam \ <(tifftopnm LC82240872013279LGN00_B1.TIF) \ <(tifftopnm LC82240872013279LGN00_B2.TIF) \ <(tifftopnm LC82240872013279LGN00_B3.TIF) \ <(tifftopnm LC82240872013279LGN00_B4.TIF) \ <(tifftopnm LC82240872013279LGN00_B5.TIF) \ <(tifftopnm LC82240872013279LGN00_B6.TIF) \ <(tifftopnm LC82240872013279LGN00_B7.TIF) The LC82240872013279LGN00_crop.pam file is ready to be used as FAUM input. (See Example 5) Example 3: Generate a PAM file from a raster 0-100 normalized GeoTIFF file FAUM uses integer numbers for input, avoiding the floating point unit thus working faster. In order to use floating point data, it must be translated/ normalized to integer before processing them. This might seem difficult, but standard open source tools exist to translate the data. In the case of GeoTIFF and other raster formats, the Geospatial Data Abstraction Library and its tools may be used. In this example, the file raster.gtiff is a 6 depth channel 0-100 normalized floating point raster data to be exported into a 16 bit unsigned integer PAM file. 24 or 32 bits could also be used, but no need of such precision exists in this dataset. Use the gdal_translate tool (from the GDAL library) to extract and normalize each depth channel. Then use the createPamImage tool to generate the PAM file. gdal_translate -ot int16 -scale 0 100 0 65535 -b 1 raster.gtiff raster_1.tif gdal_translate -ot int16 -scale 0 100 0 65535 -b 2 raster.gtiff raster_2.tif gdal_translate -ot int16 -scale 0 100 0 65535 -b 3 raster.gtiff raster_3.tif gdal_translate -ot int16 -scale 0 100 0 65535 -b 4 raster.gtiff raster_4.tif gdal_translate -ot int16 -scale 0 100 0 65535 -b 5 raster.gtiff raster_5.tif gdal_translate -ot int16 -scale 0 100 0 65535 -b 6 raster.gtiff raster_6.tif createPamImage -l 1,2,3,4,5,6 -o raster.pam \ <(tifftopnm raster_1.tif) \ <(tifftopnm raster_2.tif) \ <(tifftopnm raster_3.tif) \ <(tifftopnm raster_4.tif) \ <(tifftopnm raster_5.tif) \ <(tifftopnm raster_6.tif) The GeoTIFF metadata may also be exported to be used in the output. The 'listgeo' tool from the GeoTIFF library is useful for that purpose: listgeo raster.gtiff > raster_metadata.txt The raster.pam file is ready to be used as FAUM input. (See Example 7) Example 4: Generate a PAM image from integer numbers in a text file. This example generates a PAM file from the 1024 dimension variance 100 G2 dataset [2]. The dataset contains 2048 points of 1024 dimension each. Numbers are between 45 and 1056, thus fitting in a 16 bit PAM. Use the data2pam tool to generate the PAM file, fitting the points in a 32x64 pseudoimage, using one depth channel per dimension: data2pam -b 2 -w 32 -h 64 -d 1024 g2-1024-100.txt g2-1024-100.pam The g2-1024-100.pam file is ready to be used as FAUM input. (See Example 8) The g2-1024-100.txt file is included in the examples directory. The full G2 dataset is available at https://cs.joensuu.fi/sipu/datasets/g2-txt.zip Example 5: Default FAUM processing on a Satellite Image In this example, FAUM is executed on the LC82240872013279LGN00_crop.pam generated in example 2. Generate the pnm image with the 16 most populated clusters. Use the default parameters. The summary generated by FAUM is shown: faum -o LC82240872013279LGN00_crop.pnm LC82240872013279LGN00_crop.pam === SUMMARY: Clusterized bins: 2561 Unclusterized bins (before postclustering): 1251 Total clusters: 63 Cardinality dispersion: 44 === The LC82240872013279LGN00_crop.pam file is included in the examples directory. Example 6: Execute FAUM indicating a hyper-bin side size and a distance. In this example, FAUM is executed on the l71225086_08620000119.pam generated in the example 1, using predefined hyper-bin side size of 4 (2^2) and a distance 3. Generate the output with the most populated clusters in png format. Dump the full class histogram. The summary and the first part of the histogram generated by FAUM are shown: faum -o >(pnmtopng > l71225086_08620000119.png) -h -f2,3 \ l71225086_08620000119.pam === SUMMARY: Clusterized bins: 430578 Unclusterized bins (before postclustering): 275692 Total clusters: 549 Cardinality dispersion: 199 === Histogram dump: 15401621: Seed: (0000 0000 0000 0000 0000 0000) Centroid (0 0 0 0 0 0) Deviation (0 0 0 0 0 0) 11819558: Seed: (0013 0010 000F 0013 001C 000F) Centroid (78 65 62 76 112 62) Deviation (8 11 11 14 12 12) 5709734: Seed: (0014 0011 0011 0013 0023 0014) Centroid (83 71 74 74 135 81) Deviation (7 4 6 13 17 14) . . <other 545 clusters> . 1: Seed: (003F 003F 003F 0033 0035 002C Centroid (255 255 255 205 212 178) Deviation (0 0 0 0 0 0) Example 7: Generate a raster with the classification data from a GeoTIFF. In this example, FAUM is executed on the raster.pam file generated in the example 3. The raster is transformed to tif and the metadata header is incorporated. faum -R >(pnmtotif > class_raster.tif) raster.pam geotifcp -g raster_metadata.txt class_raster.tif class_raster.gtiff Example 8: Generate a text output with classification data. In this example, FAUM is executed on the g2-1024-100.pam file generated in the example 4. A hint on the number of clusters to find is indicated to FAUM (in this case, exactly 2 clusters), and Euclidean distance is used in the postclustering step. A csv output is generated with a classification column prepended, suitable for importing into a spreadsheet. A visual representation of the classification is also generated. faum -O g2-1024-100.csv -o g2-1024-100.pnm -m 2 -E g2-1024-100.pam Example 9: Use FAUM as initialization to K-Means. In this example, FAUM is executed on the g2-32-80.pam file generated from the G2 dataset [2] in a simmilar manner than the one shown in example 4. This is a difficult dataset and the classification may be enhanced by a normal K-Means processing using the centroids computed by FAUM. At the moment, centroid information must be parsed from FAUM output. A csv output is generated with a classification column prepended, suitable for importing into a spreadsheet. A visual representation of the classification is also generated. faum -h -m 2 -E g2-32-80.pam | \ sed -ne 's/.*Centroid (\([0-9 ]\+\)).*/\1/p' > centroids.txt kmeans -C centroids.txt -n 2 -O g2-32-80.csv -o g2-32-80.pnm -R 1 \ g2-32-80.pam The g2-1024-100.txt and the g2-32-80.txt files (to generate the .pam files as shown in example 4) are included in the examples directory. The full G2 dataset is available at http://cs.joensuu.fi/sipu/datasets/g2-txt.zip Example 10: Run FAUM on the Big Data Hyper-Cubes Set. This example shows the generation and processing of a Big Data Hypercubes Set in one step. The -s option may be used to obtain a deterministic dataset. Warning: a temporary file will be created. Generate a visual representation of the ouput clustering. hypercubeGenerator -s 1543554 | faum -o hypercube.pnm BIBLIOGRAPHY [1] H.J.Curti and R.S.Wainschenker, FAUM: Fast Autonomous Unsupervised Multidimensional classification, Information Sciences, Volume 462, 2018, Pages 182-203, ISSN 0020-0255, https://doi.org/10.1016/j.ins.2018.06.008. [2] P.Fränti, O.Virmajoki and V.Hautamäki, Fast Agglomerative Clustering Using a K-Nearest Neighbor Graph, IEEE Trans. on Pattern Analysis and Machine Intelligence, Volume 28(11), 2006, Pages 1875-1881.