Faum Code

Fast Autonomous Unsupervised Multidimiensional Classification

Status: Beta

Brought to you by: hcurti

Tree [91a804] master / History

HTTPS access

File	Date	Author	Commit
examples	2019-04-29	Hugo Javier Curti	[18c2ca] Minor bug correction
COPYING	2019-01-08	Hugo Javier Curti	[8488dd] Se ingresa la versión 1.0 de FAUM para su publi...
Makefile.am	2024-01-11	Hugo Javier Curti	[83c9cb] New version 1.0-3: Research updates
Makefile.in	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
README	2024-01-11	Hugo Javier Curti	[83c9cb] New version 1.0-3: Research updates
aclocal.m4	2024-01-11	Hugo Javier Curti	[83c9cb] New version 1.0-3: Research updates
calculoScott.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
clasif0.c	2024-01-11	Hugo Javier Curti	[83c9cb] New version 1.0-3: Research updates
clasif1_chebyshev_fixed.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
compile	2019-01-08	Hugo Javier Curti	[8488dd] Se ingresa la versión 1.0 de FAUM para su publi...
config.guess	2024-01-11	Hugo Javier Curti	[83c9cb] New version 1.0-3: Research updates
config.h.in	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
config.sub	2024-01-11	Hugo Javier Curti	[83c9cb] New version 1.0-3: Research updates
configure	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
configure.ac	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
createPamImage.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
data2pam.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
depcomp	2019-01-08	Hugo Javier Curti	[8488dd] Se ingresa la versión 1.0 de FAUM para su publi...
faum.c	2024-01-11	Hugo Javier Curti	[83c9cb] New version 1.0-3: Research updates
faum_clusterscope.c	2024-01-11	Hugo Javier Curti	[83c9cb] New version 1.0-3: Research updates
faum_mask.c	2024-01-11	Hugo Javier Curti	[83c9cb] New version 1.0-3: Research updates
gaussianDatasetGenerator.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
hypercubeGenerator.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
hypercubeGenerator16.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
install-sh	2019-01-08	Hugo Javier Curti	[8488dd] Se ingresa la versión 1.0 de FAUM para su publi...
kmeans.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
missing	2019-01-08	Hugo Javier Curti	[8488dd] Se ingresa la versión 1.0 de FAUM para su publi...
perfectGaussian.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
perfectGaussian.h	2024-01-11	Hugo Javier Curti	[83c9cb] New version 1.0-3: Research updates
prueba_perfectGaussian.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
prueba_rnorrexp.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
readPamImage.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
rnorrexp.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
rnorrexp.h	2024-01-11	Hugo Javier Curti	[83c9cb] New version 1.0-3: Research updates
sizes.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
test_crc32.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections
test_hash_speed.c	2024-02-02	Hugo Javier Curti	[91a804] Version 1.0-3 minor corrections

Read Me

Fast Autonomous Unsupervised Multidimensional (FAUM) Clustering.

1. INTRODUCTION

This is the proof-of-concept implementation of the FAUM Clustering method
presented in [1]. This implementation was used to perform the published results
and is now released in the hope that it will be useful.

2. COPYRIGHT AND DISCLAIMER

   Copyright (C) 2015-2024 Hugo Javier Curti, Ruben Sergio Wainschenker
   
   This file is part of FAUM.
   
   FAUM is free software: you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation, either version 3 of the License, or
   (at your option) any later version.
   
   FAUM is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.
   
   You should have received a copy of the GNU General Public License
   along with FAUM.  If not, see <https://www.gnu.org/licenses/>.

3. COMPONENTS INCLUDED IN THIS PACKAGE

   README: Introduction and base documentation.

   COPYING: Terms of Usage and Distribution.

   PRODUCTION BINARIES:

     clasif0:
       Generate only Order 0 hyper-histogram of FAUM clustering. It can generate
       the hyper-histogram summary, the hyper-histogram data or both.
       The hyper-bin side size must be indicated in the command line, and the
       input data in Portable Arbitrary Map (PAM) format must be provided,
       either through standard input, or as a file name in the command line.
       Use 'clasif0 --help' to consult synopsis and available options.

     createPamImage:
       Utility to create a Portable Arbitrary Map (PAM) with information
       taken from different files, one per depth channel. PGM files, or crude
       data files containing integers in big endian notation, one per depth
       channel in western natural reading order (lef-right-up-down) may be used.
       In the case of crude data, the parameters of the image must be specified
       in the command line. Cropping options are provided.
       Use 'createPamImage --help' to consult synopsis and available options.
       See USAGE for typical usage scenario, and EXAMPLES for examples of PAM
       generation from typical satellite imagery formats.

     data2pam:
       Utility to create a Portable Arbitrary Map (PAM) from a text file
       containing a list of integer numbers separated by space. The parameters
       of the Map must be specified in the command line. In particular, the
       end of line is considered as normal space and no assumption is made
       about its position and/or count. Numbers must be ordered in points 
       (one number per depth channel) in western natural reading order
       (left-right-up-down).
       Use 'data2pam --help' to consult synopsis and available options.
       See USAGE for typical usage scenario, and EXAMPLES for examples of PAM
       generation from text files.

     faum:
       Main program. Perform the full FAUM algorithm on a Portable Arbitrary
       Map (PAM) provided through standard input or as file name in the command
       line (which is more efficient). By default the automatic clustering
       algorithm is applied, starting zero order clustering step from the
       bigger hyper-bin side size, looking for the optimum size using both
       Plural Hyper-bin count and Cardinality Dispersion empirical methods. The
       first optimum found is used for the first order clustering step with
       distance 1. Options may be used to either use a particular hyper-bin
       side size and a different distance, or to choose the preferred empirical
       method for zero order clustering step. Options may be used to indicate
       a hint on the number of clusters to find and trigger the experimental
       second order clustering step to find the optimum hyper-bin side size
       and optimum distance. Options may also be used to avoid the first order
       postclustering step, to use a finer grain point postclustering step, or
       to use a distance function other than Chebyshev for the clustering or
       postclustering steps (Euclidean and Minkowsky are available). The
       implementation generates many different outputs, including a PNM image
       using different colors to represent the 16 most populated cluster's
       points, a full PGM with the classification as raster data and the full
       data in csv format with a classification column. It can also dump the
       cluster histogram, with seed, centroid and deviation.
       Use 'faum --help' to consult synopsis and available options.
       See USAGE for typical usage scenarios.

     kmeans:
       This is fast K-Means implementation, designed using the same optimization
       strategy used in FAUM. Although its original purpose was to achieve a
       fair quantitative comparation between FAUM and K-Means, it is now fully
       functional. Its interface and output options are simmilar to FAUM. The
       number of clusters to find is a mandatory argument. Several
       initialization methods are implemented, including manual (indicating the
       list of centroids), Forgy (random) and the default KMeans++. In
       particular, the manual initialization may be used to feed the centroids
       produced by FAUM into K-Means.
       Use 'kmeans --help' to consult synopsis and available options.

     hypercubeGenerator:
     hypercubeGenerator16:
       These utilities are used to generate the Big Data Hypercubes Set used
       to test FAUM and perform a quality and timing comparison between FAUM
       and K-Means. The shape of the dataset is hardcoded, but the actual points
       are randomly created on invocation. hypercubeGenrator and
       hypercubeGenerator16 create a dataset with 8 and 16 bits per sample
       respectively.
       Use 'hypercubeGenerator --help' and 'hypercubeGenerator16 --help' to
       consult synopsis and available options.

     readPamImage:
       Small utility useful to extract any three depth channel from a Portable
       Arbitrary Map (PAM) file and create a Portable Any Map (PNM) file, or
       any depth channel in the PAM file to create a Portable Grey Map (PGM)
       file. PNM/PGM tools available in the NetPBM package may be used to
       export the generated Map to many standard image format (PNG, TIF, JPG,
       etc.).
       Use 'readPamImage --help' to consult synopsis and available options.

   DEVELOPMENT BINARIES:

       These binaries were used during FAUM research and implementation. They
       are not ment to be used in production scenarios, and are distributed
       for historical and completeness reasons only.

     sizes:
       Used to quickly get the basic type sizes during tests.

     calculoScott:
       Used to perform the Scott optimal bin size on histogram. It was tested
       and then discarded during FAUM research.

     clasif1_chebyshev_fixed:
       First simple version of FAUM First Order Clustering Step, now superseded
       by the faum utility.
 
     prueba_rnorrexp:
       Small utility used to test the quality of the Gaussian distribution
       generator.

     test_crc32 (available when configured with gcrypt-crc option):
       During the Resarch, the use of a CRC as hashing function for FAUM was
       proposed. This utility test the Gcrypt CRC function.

     test_hash_speed (available when configured with gcrypt-crc option):
       Timing comparison between the ad hoc hashing function used in FAUM and
       the Gcrypt CRC function.

4. USAGE
  The typical usage process implies the following steps:

    a) Prepare input data:
       - In case of image or raster data, use NetPBM tools to generate PGM files
         or other tool to extract the Raw Binary Data or numbers into text file.
       - Normalize data into 8, 16, 24 or 32 bit unsigned integer when needed.
       - Generate a Portable Arbitrary Map (PAM) file from text or binary data.

    b) Execute the clustering process:
       - Choose the preferred clustering options.
       - Choose the desired output(s).
       - In case of FAUM as initialization to K-Means, collect the centroids
         and invoke K-Means using them.

    c) Prepare output data.
       - In case of PNM Image with most populated clusters, use NetPBM tools to
         generate an image in the preferred format.

       - In case of raster output, use NetPBM tools to generate a raster in the
         preferred format. If GeoTiff input is used, the GeoTiff Headers may be
         extracted from the input file and installed in the output file.

       - In case of text output, classification may be easily extracted from the
         whole dataset.

5. EXAMPLES

  Example 1: Generate a PAM file from Landsat 7 Image data files.

    Landsat 7 images are presented in one 8 bit unsigned integers binary
    file per band (a.k.a. depth channel). The image geometry data must be
    extracted from the HRF file before using the createPamImage tool to create
    the PAM file. An example with L71225086_08620000119 is illustrated here:

    a) Extract image geometry from HRF file. The relevant part is shown below:

       PIXELS PER LINE =7476  LINES PER BAND =7040 /7040
       OUTPUT BITS PER PIXEL =8

       Therefore: 7040x7476, 8 bits (1 byte) per pixel.

    b) Execute createPamImage for the bands 1,2,3,4,5,7 (excluding Thermals):

       createPamImage -l 1,2,3,4,5,7 -b 1 -w 7476 -h 7040 \
                      -o l71225086_08620000119.pam \
                      l71225086_08620000119_b10.fst \
                      l71225086_08620000119_b20.fst \
                      l71225086_08620000119_b30.fst \
                      l71225086_08620000119_b40.fst \
                      l71225086_08620000119_b50.fst \
                      l72225086_08620000119_b70.fst

    The l71225086_08620000119.pam file is ready to be used as FAUM input. (See
    Example 6)

  Example 2: Generate a PAM file from a crop of a Landsat 8 Image.

    Landsat 8 images are presented in one integer TIFF binary file per band
    (a.k.a. depth channel). The createPamImage tool combined with tifftopnm
    tool can be used to create the PAM file. There is no need to indicate the
    geometry of the image, since it is read from the TIFF header. 

    An example of extrating a crop from the LC82240872013279LGN00 image is
    illustrated here. The crop is 430x350 size and starts at point 2860x2440.

    createPamImage -l 1,2,3,4,5,6,7 -x 2860 -y 2440 -c 430 -r 350 \
                   -o LC82240872013279LGN00_crop.pam \
                   <(tifftopnm LC82240872013279LGN00_B1.TIF) \
                   <(tifftopnm LC82240872013279LGN00_B2.TIF) \
                   <(tifftopnm LC82240872013279LGN00_B3.TIF) \
                   <(tifftopnm LC82240872013279LGN00_B4.TIF) \
                   <(tifftopnm LC82240872013279LGN00_B5.TIF) \
                   <(tifftopnm LC82240872013279LGN00_B6.TIF) \
                   <(tifftopnm LC82240872013279LGN00_B7.TIF)

    The LC82240872013279LGN00_crop.pam file is ready to be used as FAUM input.
    (See Example 5)

  Example 3: Generate a PAM file from a raster 0-100 normalized GeoTIFF file

    FAUM uses integer numbers for input, avoiding the floating point unit thus
    working faster. In order to use floating point data, it must be translated/
    normalized to integer before processing them.

    This might seem difficult, but standard open source tools exist to translate
    the data. In the case of GeoTIFF and other raster formats, the Geospatial
    Data Abstraction Library and its tools may be used.

    In this example, the file raster.gtiff is a 6 depth channel 0-100 normalized
    floating point raster data to be exported into a 16 bit unsigned integer PAM
    file. 24 or 32 bits could also be used, but no need of such precision
    exists in this dataset. Use the gdal_translate tool (from the GDAL library)
    to extract and normalize each depth channel. Then use the createPamImage
    tool to generate the PAM file.

    gdal_translate -ot int16 -scale 0 100 0 65535 -b 1 raster.gtiff raster_1.tif
    gdal_translate -ot int16 -scale 0 100 0 65535 -b 2 raster.gtiff raster_2.tif
    gdal_translate -ot int16 -scale 0 100 0 65535 -b 3 raster.gtiff raster_3.tif
    gdal_translate -ot int16 -scale 0 100 0 65535 -b 4 raster.gtiff raster_4.tif
    gdal_translate -ot int16 -scale 0 100 0 65535 -b 5 raster.gtiff raster_5.tif
    gdal_translate -ot int16 -scale 0 100 0 65535 -b 6 raster.gtiff raster_6.tif
    
    createPamImage -l 1,2,3,4,5,6 -o raster.pam \
      <(tifftopnm raster_1.tif) \
      <(tifftopnm raster_2.tif) \
      <(tifftopnm raster_3.tif) \
      <(tifftopnm raster_4.tif) \
      <(tifftopnm raster_5.tif) \
      <(tifftopnm raster_6.tif)

    The GeoTIFF metadata may also be exported to be used in the output. The
    'listgeo' tool from the GeoTIFF library is useful for that purpose:

    listgeo raster.gtiff > raster_metadata.txt

    The raster.pam file is ready to be used as FAUM input. (See Example 7)

  Example 4: Generate a PAM image from integer numbers in a text file.

    This example generates a PAM file from the 1024 dimension variance 100 G2
    dataset [2]. The dataset contains 2048 points of 1024 dimension each.
    Numbers are between 45 and 1056, thus fitting in a 16 bit PAM. Use the
    data2pam tool to generate the PAM file, fitting the points in a 32x64
    pseudoimage, using one depth channel per dimension:

    data2pam -b 2 -w 32 -h 64 -d 1024 g2-1024-100.txt g2-1024-100.pam

    The g2-1024-100.pam file is ready to be used as FAUM input. (See Example 8)

    The g2-1024-100.txt file is included in the examples directory. The full
    G2 dataset is available at https://cs.joensuu.fi/sipu/datasets/g2-txt.zip

  Example 5: Default FAUM processing on a Satellite Image

    In this example, FAUM is executed on the LC82240872013279LGN00_crop.pam
    generated in example 2. Generate the pnm image with the 16 most populated
    clusters. Use the default parameters. The summary generated by FAUM is
    shown:

    faum -o LC82240872013279LGN00_crop.pnm LC82240872013279LGN00_crop.pam

    ===
    SUMMARY:
     Clusterized bins: 2561
     Unclusterized bins (before postclustering): 1251
     Total clusters: 63
     Cardinality dispersion: 44
    ===

    The LC82240872013279LGN00_crop.pam file is included in the examples
    directory.

  Example 6: Execute FAUM indicating a hyper-bin side size and a distance.

    In this example, FAUM is executed on the l71225086_08620000119.pam
    generated in the example 1, using predefined hyper-bin side size of 4 (2^2)
    and a distance 3. Generate the output with the most populated clusters in
    png format. Dump the full class histogram. The summary and the first part
    of the histogram generated by FAUM are shown:

    faum -o >(pnmtopng > l71225086_08620000119.png) -h -f2,3 \
      l71225086_08620000119.pam

    ===
    SUMMARY:
     Clusterized bins: 430578
     Unclusterized bins (before postclustering): 275692
     Total clusters: 549
     Cardinality dispersion: 199
    ===
    Histogram dump:
    15401621: Seed: (0000 0000 0000 0000 0000 0000)
              Centroid (0 0 0 0 0 0) Deviation (0 0 0 0 0 0)
    11819558: Seed: (0013 0010 000F 0013 001C 000F)
              Centroid (78 65 62 76 112 62) Deviation (8 11 11 14 12 12)
    5709734: Seed: (0014 0011 0011 0013 0023 0014)
             Centroid (83 71 74 74 135 81) Deviation (7 4 6 13 17 14)
    .
    . <other 545 clusters>
    .
    1: Seed: (003F 003F 003F 0033 0035 002C
       Centroid (255 255 255 205 212 178) Deviation (0 0 0 0 0 0)

  Example 7: Generate a raster with the classification data from a GeoTIFF.

    In this example, FAUM is executed on the raster.pam file generated in the
    example 3. The raster is transformed to tif and the metadata header is
    incorporated.

    faum -R >(pnmtotif > class_raster.tif) raster.pam
    geotifcp -g raster_metadata.txt class_raster.tif class_raster.gtiff

  Example 8: Generate a text output with classification data.

    In this example, FAUM is executed on the g2-1024-100.pam file generated
    in the example 4. A hint on the number of clusters to find is indicated
    to FAUM (in this case, exactly 2 clusters), and Euclidean distance is used
    in the postclustering step. A csv output is generated with a classification
    column prepended, suitable for importing into a spreadsheet. A visual
    representation of the classification is also generated.

    faum -O g2-1024-100.csv -o g2-1024-100.pnm -m 2 -E g2-1024-100.pam

  Example 9: Use FAUM as initialization to K-Means.

    In this example, FAUM is executed on the g2-32-80.pam file generated
    from the G2 dataset [2] in a simmilar manner than the one shown in
    example 4.
    This is a difficult dataset and the classification may be enhanced by a
    normal K-Means processing using the centroids computed by FAUM. At the
    moment, centroid information must be parsed from FAUM output. A csv output
    is generated with a classification column prepended, suitable for importing
    into a spreadsheet. A visual representation of the classification is also
    generated.

    faum -h -m 2 -E g2-32-80.pam | \
      sed -ne 's/.*Centroid (\([0-9 ]\+\)).*/\1/p' > centroids.txt

    kmeans -C centroids.txt -n 2 -O g2-32-80.csv -o g2-32-80.pnm -R 1 \
           g2-32-80.pam

    The g2-1024-100.txt and the g2-32-80.txt files (to generate the .pam files
    as shown in example 4) are included in the examples directory. The full G2
    dataset is available at http://cs.joensuu.fi/sipu/datasets/g2-txt.zip

    Example 10: Run FAUM on the Big Data Hyper-Cubes Set.

    This example shows the generation and processing of a Big Data Hypercubes
    Set in one step. The -s option may be used to obtain a deterministic
    dataset. Warning: a temporary file will be created. Generate a visual
    representation of the ouput clustering.

    hypercubeGenerator -s 1543554 | faum -o hypercube.pnm

BIBLIOGRAPHY

[1] H.J.Curti and R.S.Wainschenker, FAUM: Fast Autonomous Unsupervised
    Multidimensional classification, Information Sciences, Volume 462,
    2018, Pages 182-203, ISSN 0020-0255,
    https://doi.org/10.1016/j.ins.2018.06.008.

[2] P.Fränti, O.Virmajoki and V.Hautamäki, Fast Agglomerative Clustering Using
    a K-Nearest Neighbor Graph, IEEE Trans. on Pattern Analysis and Machine
    Intelligence, Volume 28(11), 2006, Pages 1875-1881.

Faum Code

Fast Autonomous Unsupervised Multidimiensional Classification

Branches

Tree [91a804] master / Download Snapshot History

Read Me

Tree [91a804] master /

History