[Rdkit-discuss] Clustering

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I am attempting to cluster a database of circa 4M small molecules and I
have hit several snags.
Using BulkTanimoto is not possible due to resiurces that are required. I am
now working with fpsim2 and chemfp to get a distance matrix (sparse
matrix). However, I am finding it very challenging to identify an
appropriate clustering algorithm. I have considered both k-medoids and
DBSCAN. Each of these has its own limitations, stating the number of
clusters for k-medoids and not obtaining centroids for DBSCAN.

I was wondering whether there is an implementation of the stochastic
clustering analysis for clustering purposes, described in
https://doi.org/10.1021/ci970056l .

Any suggestions on the best method for clustering large datasets, with code
suggestions, would be greatly appreciated. I am new to the subject and
would appreciate any help.

Regards,
Tristan

[Rdkit-discuss] Clustering

Open-Source Cheminformatics and Machine Learning

[Rdkit-discuss] Clustering