Re: [Rdkit-discuss] Clustering
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Patrick W. <wpw...@gm...> - 2022-05-01 16:18:49
|
Similarity search on a database of 4 million is pretty quick with ChemFp or fpsim2. Do you need to do the clustering? Here are a couple of relevant blog posts. http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html http://practicalcheminformatics.blogspot.com/2021/09/similarity-search-and-some-cool-pandas.html Pat On Sun, May 1, 2022 at 12:12 PM Tristan Camilleri < tri...@um...> wrote: > Thank you both for the feedback. > > My primary aim is to run an LBVS experiment (similarity search) using a > set of actives and the dataset of cluster representatives. > > > > On Sun, 1 May 2022, 17:09 Patrick Walters, <wpw...@gm...> wrote: > >> For me, a lot of this depends on what you intend to do with the >> clustering. If you want to pick a "representative" subset from a larger >> dataset, k-means may do the trick. As Rajarshi mentioned, Practical >> Cheminformatics has a k-means implementation that runs with FAISS. >> Depending on your goal, choosing a subset with a diversity picker may fit >> the bill. One annoying aspect of diversity pickers is that the initial >> selections tend to consist of strange molecules. >> >> @Tristen can you provide more information on what you want to do with the >> clustering results? >> >> >> Pat >> >> On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha <raj...@gm...> >> wrote: >> >>> You could consider using FAISS. An example of clustering 2.1M cmpds is >>> described at >>> http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html >>> >>> >>> On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri < >>> tri...@um...> wrote: >>> >>>> Hi, >>>> >>>> I am attempting to cluster a database of circa 4M small molecules and I >>>> have hit several snags. >>>> Using BulkTanimoto is not possible due to resiurces that are required. >>>> I am now working with fpsim2 and chemfp to get a distance matrix (sparse >>>> matrix). However, I am finding it very challenging to identify an >>>> appropriate clustering algorithm. I have considered both k-medoids and >>>> DBSCAN. Each of these has its own limitations, stating the number of >>>> clusters for k-medoids and not obtaining centroids for DBSCAN. >>>> >>>> I was wondering whether there is an implementation of the stochastic >>>> clustering analysis for clustering purposes, described in >>>> https://doi.org/10.1021/ci970056l . >>>> >>>> Any suggestions on the best method for clustering large datasets, with >>>> code suggestions, would be greatly appreciated. I am new to the subject and >>>> would appreciate any help. >>>> >>>> Regards, >>>> Tristan >>>> >>>> >>>> _______________________________________________ >>>> Rdkit-discuss mailing list >>>> Rdk...@li... >>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>>> >>> >>> >>> -- >>> Rajarshi Guha | http://blog.rguha.net | @rguha >>> <https://twitter.com/rguha> >>> >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdk...@li... >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> |