Re: [Rdkit-discuss] Clustering
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Chris S. <sw...@ma...> - 2017-06-05 09:10:49
|
Hi, I’m just starting but I can add another example I tried the clustering as described for the Butina clustering (http://www.rdkit.org/docs/Cookbook.html <http://www.rdkit.org/docs/Cookbook.html>) using a Jupiter Notebook. Worked fine on data sets < 10,000 molecules but kernel crash when I tried 150,000 molecules. Plan to try some other examples this week and will report back findings. Chris > On 5 Jun 2017, at 10:02, Michał Nowotka <mm...@gm...> wrote: > > Hi, > > Is there anyone who actually done this: clustered >2M compounds using > any well-known clustering algorithm and is willing to share a code and > some performance statistics? > > It's easy to get a sparse distance matrix using chemfp. But if you > take this matrix and feed it into any scipy.cluster you want get any > results in a reasonable time. > > We also tried to extract 10 most significant features from the latent > representation described in this paper: > https://arxiv.org/pdf/1610.02415v1.pdf for all compounds in ChEMBL and > then use this web-based tool to generate visualization > https://github.com/tensorflow/embedding-projector-standalone but > obviously we didn't get anything useful from this. > > My last attempt was to use sfdp tool from graphviz package to get some > sort of primitive clustering. I allocated a lot of RAM memory to the > process but without any luck as well. > > I would be interested in all kinds of hints related to clustering > millions of compounds, especially using DBSCAN/OPTICS-based clustering > algorithms. > > Regards, > > Michał Nowotka > > On Mon, Jun 5, 2017 at 9:19 AM, Gonzalo Colmenarejo > <col...@gm...> wrote: >> Hi Chris, >> >> as far as I know, Butina's sphere exclusion algorithm is the fastest for >> very large datasets. But if you have 4 million compounds, using RDKit >> directly can result in very long runs, even after parallellization. For that >> number of molecules I think there are faster things, like chemfp (see for >> instance >> https://chemfp.readthedocs.io/en/latest/using-api.html#taylor-butina-clustering). >> >> Cheers >> >> Gonzalo >> >> On Sun, Jun 4, 2017 at 3:12 PM, Maciek Wójcikowski <ma...@wo...> >> wrote: >>> >>> Is there a big difference in the quality of the final dataset between >>> K-means and random under-sampling of big database (~20M)? >>> >>> ---- >>> Pozdrawiam, | Best regards, >>> Maciek Wójcikowski >>> ma...@wo... >>> >>> 2017-06-04 12:24 GMT+02:00 Samo Turk <sam...@gm...>: >>>> >>>> Hi Chris, >>>> >>>> There are other options for clustering. According to this: >>>> http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html >>>> HDBSCAN and K-means scale well. HDBSCAN will find clusters based on >>>> density and it also allows for outliers, but can be fiddly to find the right >>>> parametes. You can not specify the number of clusters (like in Butina case). >>>> If you want to specify the number of clusters, you can simply use K-means. >>>> High dimensionality of fingerprints might be a problem for memory >>>> consumption. In this case you can use PCA to reduce dimensions to something >>>> manageable. To avoid memory issues with PCA and speed things up I would fit >>>> the model on random 100k compounds and then just use fit_transform method on >>>> the rest. >>>> http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html >>>> >>>> Cheers, >>>> Samo >>>> >>>> On Sun, Jun 4, 2017 at 9:08 AM, Chris Swain <sw...@ma...> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I want to do clustering on around 4 million structures >>>>> >>>>> The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html) suggests >>>>> >>>>> "For large sets of molecules (more than 1000-2000), it’s most efficient >>>>> to use the Butina clustering algorithm” >>>>> >>>>> However it is quite a step up from a few thousand to several million >>>>> and I wondered if anyone had used this algorithm on larger data sets? >>>>> >>>>> As far as I can tell it is not possible to define the number of >>>>> clusters, is this correct? >>>>> >>>>> Cheers, >>>>> >>>>> Chris >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Check out the vibrant tech community on one of the world's most >>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>>>> _______________________________________________ >>>>> Rdkit-discuss mailing list >>>>> Rdk...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>>>> >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Check out the vibrant tech community on one of the world's most >>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>>> _______________________________________________ >>>> Rdkit-discuss mailing list >>>> Rdk...@li... >>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdk...@li... >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> >> >> ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdk...@li... >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> |