Re: [Rdkit-discuss] Clustering
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Gonzalo C. <col...@gm...> - 2017-06-05 08:19:27
|
Hi Chris, as far as I know, Butina's sphere exclusion algorithm is the fastest for very large datasets. But if you have 4 million compounds, using RDKit directly can result in very long runs, even after parallellization. For that number of molecules I think there are faster things, like chemfp (see for instance https://chemfp.readthedocs.io/en/latest/using-api.html#taylor-butina-clustering ). Cheers Gonzalo On Sun, Jun 4, 2017 at 3:12 PM, Maciek Wójcikowski <ma...@wo...> wrote: > Is there a big difference in the quality of the final dataset between > K-means and random under-sampling of big database (~20M)? > > ---- > Pozdrawiam, | Best regards, > Maciek Wójcikowski > ma...@wo... > > 2017-06-04 12:24 GMT+02:00 Samo Turk <sam...@gm...>: > >> Hi Chris, >> >> There are other options for clustering. According to this: >> http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html >> HDBSCAN and K-means scale well. HDBSCAN will find clusters based on >> density and it also allows for outliers, but can be fiddly to find the >> right parametes. You can not specify the number of clusters (like in Butina >> case). If you want to specify the number of clusters, you can simply use >> K-means. High dimensionality of fingerprints might be a problem for memory >> consumption. In this case you can use PCA to reduce dimensions to something >> manageable. To avoid memory issues with PCA and speed things up I would fit >> the model on random 100k compounds and then just use fit_transform method >> on the rest. http://scikit-learn.org/stable/modules/generated/sklea >> rn.decomposition.PCA.html >> >> Cheers, >> Samo >> >> On Sun, Jun 4, 2017 at 9:08 AM, Chris Swain <sw...@ma...> wrote: >> >>> Hi, >>> >>> I want to do clustering on around 4 million structures >>> >>> The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html) suggests >>> >>> "For large sets of molecules (more than 1000-2000), it’s most efficient >>> to use the Butina clustering algorithm” >>> >>> However it is quite a step up from a few thousand to several million >>> and I wondered if anyone had used this algorithm on larger data sets? >>> >>> As far as I can tell it is not possible to define the number of >>> clusters, is this correct? >>> >>> Cheers, >>> >>> Chris >>> >>> ------------------------------------------------------------ >>> ------------------ >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdk...@li... >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >>> >> >> ------------------------------------------------------------ >> ------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdk...@li... >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > |