Re: [Rdkit-discuss] Clustering
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Samo T. <sam...@gm...> - 2017-06-04 10:24:51
|
Hi Chris, There are other options for clustering. According to this: http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html HDBSCAN and K-means scale well. HDBSCAN will find clusters based on density and it also allows for outliers, but can be fiddly to find the right parametes. You can not specify the number of clusters (like in Butina case). If you want to specify the number of clusters, you can simply use K-means. High dimensionality of fingerprints might be a problem for memory consumption. In this case you can use PCA to reduce dimensions to something manageable. To avoid memory issues with PCA and speed things up I would fit the model on random 100k compounds and then just use fit_transform method on the rest. http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html Cheers, Samo On Sun, Jun 4, 2017 at 9:08 AM, Chris Swain <sw...@ma...> wrote: > Hi, > > I want to do clustering on around 4 million structures > > The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html) suggests > > "For large sets of molecules (more than 1000-2000), it’s most efficient > to use the Butina clustering algorithm” > > However it is quite a step up from a few thousand to several million and > I wondered if anyone had used this algorithm on larger data sets? > > As far as I can tell it is not possible to define the number of clusters, > is this correct? > > Cheers, > > Chris > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > |