Home
Name Modified Size InfoDownloads / Week
src 2018-04-27
_classpath 2018-04-27 1.4 kB
_project 2018-04-27 389 Bytes
ReadMe.txt 2018-04-27 5.0 kB
Totals: 4 Items   6.8 kB 0
/**
The package HealthDoubles is used to implement the algorithm
for finding health doubles of a subject. This algorithm
is experimented on a dataset of 43 subjects. The static,
medical, dynamic features of each subject is included in
"SupplymentaryDatabase.xlsx". The poi-3.16-beta2, external
JAR file is used to retrieve data from xlsx file. Try to keep
the xlsx file and the whole package in same folder. The path
for retrieving the file is given accordingly.

	The "main" class finds top 10 health doubles of the
query vector. The input, query vector is stored in the "Query
Vector" sheet of "SupplymentaryDatabase.xlxs". The number of
static features, number of medical features, number of vaccines,
number of diseases, number of dynamic features, number of days,
number of stages for performing LSH, number of buckets for
LSH and number of health doubles are initialized in the this
class. The class extracts static, medical and dynamic features
of the query vector in static_features_query, medical_features_
_query and dynamic_features_array respectively. Then, it finds
static healh doubles of the query vector using the class "static_
healthdoubles". Similarly, medical and dynamic health doubles 
lists are obtained by using "medical_healthdoubles" and "dynamic
_healthdoubles" class respectively. These lists are aggregated
using medina rank aggregation and finally the "aggr_rank"
holds that aggregated ordered lists. Top 10 of the them
are reported as top 10 health doubles.

	The class "static_healthdoubles" computes the static
health doubles list of the query vector. It retrieves the quant
-tified  static feature vector of each subject from "Quantifi
-ed Static Features" sheet. It then hash each vector using 
E2 Locality Sensitive Hash families. It also finds the hash
value for static feature vector of query. For each stage,
if a static feature vector and query vector hashed into
same bucket then their Euclidian difference value are computed.
This is repeated for each type of static feature. Then, according
to these difference values, each feature vectors are ranked. The
subject with lowest difference value acquires the top position
(0th) in the ranked list. Median rank aggregation is used to
aggregate the ranked lists obtained from different category
of features. 
	
	The class "Enhash" is used to implement E2 Locality
Sensitive Hashing. Signature size (number of hash functions)
is derived using the number of stages and the threshold value.
Here, the hash functions are random projection lines. Those
are obtained by randomly generating two coordinates of each line.
Each static feature vector is projected on those lines, and 
according to the position of the projected points the signature 
values are derived. Signature values are hashed for each stage 
and accordingly bucket numbers is obtained.

	Similarly, the "medical_healthdoubles" class finds
the medical health doubles list of query vector. Here, each
vectors are hashed using Locality Sensitive Min Hash families.
After hashing of vectors including the query vector, if
the query vector and a feature vector resides in same
bucket (for a stage), then their Jaccrad similarity is 
calculated. Again, it repeats for all type of medical
features (like vaccines and diseases). The subjects are 
ranked using these similarity values, where the subject
with highest similarity is in top most position of the
list. The aggregated list is obtained by using median
rank aggregation on those lists.

	The class "LSH_MinHash" implements the Min hash
families. Here, the hash functions are in the form of h=a*x+b. 
These 'a' and 'b' are randomly generated to obtain 'signature size'
number hash functions. Signature values are hashed in buckets. 
The bucket numbers are returned as hash values.

	Finally, the class "dynamic_healthdoubles" computes
dynamic health doubles list. It uses Random Projection Hash
family to find hash values for each dynamic feature vector and
as well as for dynamic feature vector for query. Similar to
static and medical health doubles algorithm, the vector
which hashed in same bucket of the query are considered
for similarity measurement. Similarity is measured using
cosine similarity. Similarity values are used to rank each subject,
where the subject with highest similarity acquire the highest 
position. Those lists are aggregated using median rank aggregation
and returned dynamic health doubles list of query vector.

	The class "random_projection" is used to implement
random projection hash families. Here, the hash functions 
are the random vectors where each value is obtained from
a normal distribution. Similar to "LSH_MinHash", the signature
values are hashed in buckets. These bucket numbers are
returned as hash value.

	The class "Difference_value" and "Similarityvalue"
are used for implementation perspective. 
**/


Source: ReadMe.txt, updated 2018-04-27