Re: [Rdkit-discuss] Clustering

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Chris,

There are other options for clustering. According to this:
http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html
HDBSCAN and K-means scale well. HDBSCAN will find clusters based on density
and it also allows for outliers, but can be fiddly to find the right
parametes. You can not specify the number of clusters (like in Butina
case). If you want to specify the number of clusters, you can simply use
K-means. High dimensionality of fingerprints might be a problem for memory
consumption. In this case you can use PCA to reduce dimensions to something
manageable. To avoid memory issues with PCA and speed things up I would fit
the model on random 100k compounds and then just use fit_transform method
on the rest.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Cheers,
Samo

On Sun, Jun 4, 2017 at 9:08 AM, Chris Swain <sw...@ma...> wrote:

> Hi,
>
> I want to do clustering on around 4 million structures
>
> The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html) suggests
>
> "For large sets of molecules (more than 1000-2000), it’s most efficient
> to use the Butina clustering algorithm”
>
>  However it is quite a step up from a few thousand to several million and
> I wondered if anyone had used this algorithm on larger data sets?
>
> As far as I can tell it is not possible to define the number of clusters,
> is this correct?
>
> Cheers,
>
> Chris
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdk...@li...
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>

Re: [Rdkit-discuss] Clustering

Open-Source Cheminformatics and Machine Learning

Re: [Rdkit-discuss] Clustering