#########################
#Author: Bill Andreopoulos, PhD
#Lawrence Berkeley National Laboratory
#Date: October 22, 2018
#########################
Retrieval and clustering of large categorical data sets with locality-sensitive hashing
____________________Overview_____________________
Hierdenc is a tool for retrieving nearest neighbors and clustering of large categorical data sets repesented in transactional or bag-of-words format.
The software satisfies the need to be able to retrieve the most similar items for a query in very large data sets, as well as cluster large datasets efficiently.
The performance runtime scales when moving to larger data sets.
The retrieval and clustering of large categorical data sets is achieved via a locality-sensitive hashing and a backend database for speed and scalability.
The locality-sensitive hashing method implemented is described in the video lectures under www.mmds.org (Chapter 3).
Information needed for LSH, such as shingles/tokens, MinHash signatures, band hashes to buckets
are stored in several database tables, supporting scalability to very large datasets.
Information needed for clustering purposes, such as the most significant pairwise object similarities and density-based similarities are also stored in tables.
These are the tools the Hierdenc software comes with:
* ReadInput.py supports updating the database with a CSV text file of new objects in transactional (market-basket) format.
This tool converts an object to its shingle strings and from there to unsigned integer tokens (to save space) using the CRC32 checksum computation.
Then the tokens get converted to signature vectors using MinHashing - specifically, the murmurhash_32 hashing function, which is a part of scipy.
The signature vectors then get grouped into bands and each band gets hashed to a bucket.
All arrays are stored as numpy vactors. For the constants used, see the file Constants.py.
* RetrieveNN.py supports fast retrieval of the most similar objects to a query item,
based on the Jaccard similarity Coefficient.
* Cluster.py performs the clustering and outputs the clusters in GraphML format, such that they
can be input into a graph visualization tool. The edges are weighted by Jaccard similarity Coefficient.
An early version of the fast database-based retrieval of nearest neighbors and clustering in large categorical datasets was published in:
Bill Andreopoulos et al. Efficient Layered Density-based Clustering of Categorical Data. Elsevier Journal of Biomedical Informatics, 2009.
The Plants dataset from the UCI Machine Learning repository is included for testing.
This is not a very large dataset as it has just 34,781 objects,
since it is meant for testing only. It is in transactional (market-basket) format,
which means this tool can easily be applied on other categorical or text datasets.
https://archive.ics.uci.edu/ml/datasets/Plants
_____________________Instructions for setting up on Mac OS X__________________
________
1) Download the mysql dmg file from oracle and install mysql:
Use legacy password encryption
_________
2) Install brew, followed by gcc and mysql-connector-c:
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew install gcc
brew install mysql-connector-c
________
2.2) You may need to edit mysql_config if mysql does not work or cannot login:
vi /usr/local/bin/mysql_config
According to:
Copied from this blog:
https://stackoverflow.com/questions/43543483/pip-install-mysql-python-fails-with-indexerror/43543677
"
By finding out the information that mysql-connector-cmight come to the conclusion that the configuration by brew installation may be incorrect , open the /usr/local/bin/mysql_config script to modify some of the contents of it:
#Create options
Libs = "-L $ pkglibdir "
Libs = " $ libs -l"
change into:
#Create options
Libs = "- L $ pkglibdir"
Libs = "$ libs -lmysqlclient -lssl -lcrypto"
"
________
3) Update py packages:
brew install -u numpy
brew install -u scipy
_______
4) Check requirements.txt
sudo python setup.py install
To build the distribution and tag the source code in Git:
sudo python setup.py buildtag
______
5) Update DB connection parameters in Constants.py (DB, USER, PASS, HOST).
#Database connectivity information
USER="root"
PASS=""
HOST="localhost"
DB="hierdenc"
Run create_db (if doesn't work, run python scripts/create_db.py OR try setting export PYTHONPATH=/Users/andreopo/Documents/PythonCourse/Hierdenc_project/hierdenc).
Re-run create_db to delete the database tables and re-create empty ones.
To log on to mysql: mysql -u root -p
The create_db command will create these tables under the db and host specified in Constants.py:
CREATE TABLE `shingles` (
`id` INTEGER NOT NULL AUTO_INCREMENT,
`name` varchar(100) not null,
`tokens` varchar(5000) not null,
PRIMARY KEY (`id`),
UNIQUE INDEX name_index (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `signatures` (
`id` INTEGER NOT NULL AUTO_INCREMENT,
`name` varchar(100) not null,
`hash` INTEGER UNSIGNED NOT NULL,
PRIMARY KEY (`id`),
INDEX hash_index (`hash`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `band_hashes` (
`id` INTEGER NOT NULL AUTO_INCREMENT,
`name` varchar(100) not null,
`hash` INTEGER UNSIGNED NOT NULL,
PRIMARY KEY (`id`),
INDEX hash_index (`hash`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `PAIRWISE_SIMS` (
`id` INT NOT NULL AUTO_INCREMENT,
`name_a` varchar(100) not null,
`name_b` varchar(100) not null,
`sim` int(4) not null,
PRIMARY KEY (`id`),
UNIQUE INDEX name_index (`name_a`, `name_b`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `RADIUS_SIM_DENSITY` (
`id` INT NOT NULL AUTO_INCREMENT,
`name` varchar(100) not null,
`radius_sim` int(4) not null,
`num_objects` INTEGER UNSIGNED NOT NULL,
PRIMARY KEY (`id`),
UNIQUE INDEX name_radiussim_index (`name`, `radius_sim`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
______
6) Run tests:
$ nosetests --nocapture test_ReadInput.py
$ nosetests --nocapture test_RetrieveNN.py
$ nosetests --nocapture test_Cluster.py
______
7) Run the tool from a Mac terminal:
To run this tool with the plants data set after the database is setup and working, do:
python ReadInput.py -i ./test/plants.data -t
python RetrieveNN.py -q QUERY,fl,ca,ny,or -t
python Cluster.py -n 100
*********
HIERDENC Copyright (c) 2015, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.
If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Innovation & Partnerships Office at IPO@lbl.gov referring to " HIERDENC (LBNL Ref 2015-036)."
NOTICE. This software was developed under funding from the U.S. Department of Energy. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, prepare derivative works, and perform publicly and display publicly. Beginning five (5) years after the date permission to assert copyright is obtained from the U.S. Department of Energy, and subject to any subsequent five (5) year renewals, the U.S. Government is granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.
*********