[Rdkit-discuss] Clustering
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Tristan C. <tri...@um...> - 2022-05-01 13:20:16
|
Hi, I am attempting to cluster a database of circa 4M small molecules and I have hit several snags. Using BulkTanimoto is not possible due to resiurces that are required. I am now working with fpsim2 and chemfp to get a distance matrix (sparse matrix). However, I am finding it very challenging to identify an appropriate clustering algorithm. I have considered both k-medoids and DBSCAN. Each of these has its own limitations, stating the number of clusters for k-medoids and not obtaining centroids for DBSCAN. I was wondering whether there is an implementation of the stochastic clustering analysis for clustering purposes, described in https://doi.org/10.1021/ci970056l . Any suggestions on the best method for clustering large datasets, with code suggestions, would be greatly appreciated. I am new to the subject and would appreciate any help. Regards, Tristan |