[Rdkit-discuss] Clustering

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I want to do clustering on around 4 million structures

The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html <http://www.rdkit.org/docs/Cookbook.html>) suggests 

"For large sets of molecules (more than 1000-2000), it’s most efficient to use the Butina clustering algorithm”

 However it is quite a step up from a few thousand to several million and I wondered if anyone had used this algorithm on larger data sets?

As far as I can tell it is not possible to define the number of clusters, is this correct?

Cheers,

Chris

[Rdkit-discuss] Clustering

Open-Source Cheminformatics and Machine Learning

[Rdkit-discuss] Clustering