Re: [Rdkit-discuss] Clustering
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Maciek W. <ma...@wo...> - 2017-06-04 13:13:17
|
Is there a big difference in the quality of the final dataset between K-means and random under-sampling of big database (~20M)? ---- Pozdrawiam, | Best regards, Maciek Wójcikowski ma...@wo... 2017-06-04 12:24 GMT+02:00 Samo Turk <sam...@gm...>: > Hi Chris, > > There are other options for clustering. According to this: http://hdbscan. > readthedocs.io/en/latest/performance_and_scalability.html > HDBSCAN and K-means scale well. HDBSCAN will find clusters based on > density and it also allows for outliers, but can be fiddly to find the > right parametes. You can not specify the number of clusters (like in Butina > case). If you want to specify the number of clusters, you can simply use > K-means. High dimensionality of fingerprints might be a problem for memory > consumption. In this case you can use PCA to reduce dimensions to something > manageable. To avoid memory issues with PCA and speed things up I would fit > the model on random 100k compounds and then just use fit_transform method > on the rest. http://scikit-learn.org/stable/modules/generated/ > sklearn.decomposition.PCA.html > > Cheers, > Samo > > On Sun, Jun 4, 2017 at 9:08 AM, Chris Swain <sw...@ma...> wrote: > >> Hi, >> >> I want to do clustering on around 4 million structures >> >> The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html) suggests >> >> "For large sets of molecules (more than 1000-2000), it’s most efficient >> to use the Butina clustering algorithm” >> >> However it is quite a step up from a few thousand to several million and >> I wondered if anyone had used this algorithm on larger data sets? >> >> As far as I can tell it is not possible to define the number of clusters, >> is this correct? >> >> Cheers, >> >> Chris >> >> ------------------------------------------------------------ >> ------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdk...@li... >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > |