Re: [Rdkit-discuss] Clustering
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Patrick W. <wpw...@gm...> - 2022-05-01 15:09:57
|
For me, a lot of this depends on what you intend to do with the clustering. If you want to pick a "representative" subset from a larger dataset, k-means may do the trick. As Rajarshi mentioned, Practical Cheminformatics has a k-means implementation that runs with FAISS. Depending on your goal, choosing a subset with a diversity picker may fit the bill. One annoying aspect of diversity pickers is that the initial selections tend to consist of strange molecules. @Tristen can you provide more information on what you want to do with the clustering results? Pat On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha <raj...@gm...> wrote: > You could consider using FAISS. An example of clustering 2.1M cmpds is > described at > http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html > > > On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri < > tri...@um...> wrote: > >> Hi, >> >> I am attempting to cluster a database of circa 4M small molecules and I >> have hit several snags. >> Using BulkTanimoto is not possible due to resiurces that are required. I >> am now working with fpsim2 and chemfp to get a distance matrix (sparse >> matrix). However, I am finding it very challenging to identify an >> appropriate clustering algorithm. I have considered both k-medoids and >> DBSCAN. Each of these has its own limitations, stating the number of >> clusters for k-medoids and not obtaining centroids for DBSCAN. >> >> I was wondering whether there is an implementation of the stochastic >> clustering analysis for clustering purposes, described in >> https://doi.org/10.1021/ci970056l . >> >> Any suggestions on the best method for clustering large datasets, with >> code suggestions, would be greatly appreciated. I am new to the subject and >> would appreciate any help. >> >> Regards, >> Tristan >> >> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdk...@li... >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > > > -- > Rajarshi Guha | http://blog.rguha.net | @rguha <https://twitter.com/rguha> > > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > |