Re: [Rdkit-discuss] Clustering

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I’m just starting but I can add another example

I tried the clustering as described for the Butina clustering (http://www.rdkit.org/docs/Cookbook.html <http://www.rdkit.org/docs/Cookbook.html>) using a Jupiter Notebook.

Worked fine on data sets < 10,000 molecules but kernel crash when I tried 150,000 molecules.

Plan to try some other examples this week and will report back findings.

Chris

> On 5 Jun 2017, at 10:02, Michał Nowotka <mm...@gm...> wrote:
> 
> Hi,
> 
> Is there anyone who actually done this: clustered >2M compounds using
> any well-known clustering algorithm and is willing to share a code and
> some performance statistics?
> 
> It's easy to get a sparse distance matrix using chemfp. But if you
> take this matrix and feed it into any scipy.cluster you want get any
> results in a reasonable time.
> 
> We also tried to extract 10 most significant features from the latent
> representation described in this paper:
> https://arxiv.org/pdf/1610.02415v1.pdf for all compounds in ChEMBL and
> then use this web-based tool to generate visualization
> https://github.com/tensorflow/embedding-projector-standalone but
> obviously we didn't get anything useful from this.
> 
> My last attempt was to use sfdp tool from graphviz package to get some
> sort of primitive clustering. I allocated a lot of RAM memory to the
> process but without any luck as well.
> 
> I would be interested in all kinds of hints related to clustering
> millions of compounds, especially using DBSCAN/OPTICS-based clustering
> algorithms.
> 
> Regards,
> 
> Michał Nowotka
> 
> On Mon, Jun 5, 2017 at 9:19 AM, Gonzalo Colmenarejo
> <col...@gm...> wrote:
>> Hi Chris,
>> 
>> as far as I know, Butina's sphere exclusion algorithm is the fastest for
>> very large datasets. But if you have 4 million compounds, using RDKit
>> directly can result in very long runs, even after parallellization. For that
>> number of molecules I think there are faster things, like chemfp (see for
>> instance
>> https://chemfp.readthedocs.io/en/latest/using-api.html#taylor-butina-clustering).
>> 
>> Cheers
>> 
>> Gonzalo
>> 
>> On Sun, Jun 4, 2017 at 3:12 PM, Maciek Wójcikowski <ma...@wo...>
>> wrote:
>>> 
>>> Is there a big difference in the quality of the final dataset between
>>> K-means and random under-sampling of big database (~20M)?
>>> 
>>> ----
>>> Pozdrawiam,  |  Best regards,
>>> Maciek Wójcikowski
>>> ma...@wo...
>>> 
>>> 2017-06-04 12:24 GMT+02:00 Samo Turk <sam...@gm...>:
>>>> 
>>>> Hi Chris,
>>>> 
>>>> There are other options for clustering. According to this:
>>>> http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html
>>>> HDBSCAN and K-means scale well. HDBSCAN will find clusters based on
>>>> density and it also allows for outliers, but can be fiddly to find the right
>>>> parametes. You can not specify the number of clusters (like in Butina case).
>>>> If you want to specify the number of clusters, you can simply use K-means.
>>>> High dimensionality of fingerprints might be a problem for memory
>>>> consumption. In this case you can use PCA to reduce dimensions to something
>>>> manageable. To avoid memory issues with PCA and speed things up I would fit
>>>> the model on random 100k compounds and then just use fit_transform method on
>>>> the rest.
>>>> http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
>>>> 
>>>> Cheers,
>>>> Samo
>>>> 
>>>> On Sun, Jun 4, 2017 at 9:08 AM, Chris Swain <sw...@ma...> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I want to do clustering on around 4 million structures
>>>>> 
>>>>> The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html) suggests
>>>>> 
>>>>> "For large sets of molecules (more than 1000-2000), it’s most efficient
>>>>> to use the Butina clustering algorithm”
>>>>> 
>>>>> However it is quite a step up from a few thousand to several million
>>>>> and I wondered if anyone had used this algorithm on larger data sets?
>>>>> 
>>>>> As far as I can tell it is not possible to define the number of
>>>>> clusters, is this correct?
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Chris
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> Check out the vibrant tech community on one of the world's most
>>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>>> _______________________________________________
>>>>> Rdkit-discuss mailing list
>>>>> Rdk...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdk...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>> 
>>> 
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdk...@li...
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdk...@li...
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>> 

Re: [Rdkit-discuss] Clustering

Open-Source Cheminformatics and Machine Learning

Re: [Rdkit-discuss] Clustering