Re: [Rdkit-discuss] Clustering
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Michał N. <mm...@gm...> - 2017-06-12 14:07:04
|
Hi, Thanks for all the answers, especially those pointing to code examples, very useful. I should be more specific when asking about clustering >2M compounds. An example I would like to see would use: 1. A clustering algorithm, that does not require specifying the number of classes upfront (so not K-means). 2. An algorithm that is a bit more sophisticated than Taylor-Butina 3. Preferably one from pyclustering (NOT from scipy.cluster, sorry for mistake) In those, somewhat more sophisticated algorithms, running PCA will not help. You can try to cluster >2M points on 2D surface and you will find out that this is not a trivial task. That being said, I don't expect any amazing results when doing compound clustering using those algorithms. And I agree that clustering a random sample can give similar results. This question is more out of curiosity. Michał On Sun, Jun 11, 2017 at 7:58 PM, Samo Turk <sam...@gm...> wrote: > Hi All, > > I have to admit I was commenting about PCA->k-means without actually trying. > Out of curiosity I implemented it here: > https://github.com/samoturk/cheminf-notebooks/tree/master/Python#pca-k-meanspy > > It can process 4M compounds in ~60 minutes on desktop i5 and it should work > with 16GB or RAM. Clusters that come out make (some) sense but in this > regard Butina is better. > > PS. DataWarrior can easily load the results (it takes 30 min) but then it > works smoothly. > > Cheers, > Samo > > On Mon, Jun 5, 2017 at 7:46 PM, Abhik Seal <abh...@gm...> wrote: >> >> Hello all , >> >> How about doing some dimension reduction using pca or Tsne and then run >> clustering using some selected top components like top 20 and I think then >> the clustering would be fast . >> >> Thanks >> Abhik >> >> On Mon, Jun 5, 2017 at 6:11 AM David Cosgrove <dav...@gm...> >> wrote: >>> >>> Hi, >>> I have used this algorithm for many years clustering sets of several >>> millions of compounds. Indeed, I am old enough to know it as the Taylor >>> algorithm. It is slow but reliable. A crucial setting is the similarity >>> threshold for the clusters, which dictates the size of the neighbour lists >>> and hence the amount of RAM required. It also, of course, determines the >>> quality of the clusters. My implementation is at >>> https://github.com/OpenEye-Contrib/Flush.git. This repo has a number of >>> programs of relevance, the one you want is called cluster. I have just >>> confirmed that it compiles on ubuntu 16. It needs the fingerprints as ascii >>> bitstrings, I don't have code for turning RDKit fingerprints into this >>> format, but I would imagine it's quite straightforward. The program runs in >>> parallel using OpenMPI. That's valuable for two reasons. One is speed, but >>> the more important one is memory use. If you can spread the slave processes >>> over several machines you can cluster much larger sets of molecules as you >>> are effectively expanding the RAM of the machine. When I wrote the >>> original, 64MB was a lot of RAM, it is less of an issue these days but still >>> matters if clustering millions of fingerprints. Note that the program >>> cluster doesn't ever store the distance matrix, just the lists of neighbours >>> for each molecule within the threshold. This reduces the memory footprint >>> substantially if you have a tight-enough cluster threshold. >>> HTH, >>> Dave >>> >>> >>> >>> On Mon, Jun 5, 2017 at 11:22 AM, Nils Weskamp <nil...@gm...> >>> wrote: >>>> >>>> Hi Michal, >>>> >>>> I have done this a couple of times for compound sets up to 10M+ using a >>>> simplified variant of the Taylor-Butina algorithm. The overall run time >>>> was in the range of hours to a few days (which could probably be >>>> optimized, but was fast enough for me). >>>> >>>> As you correctly mentioned, getting the (sparse) similarity matrix is >>>> fairly simple (and can be done in parallel on a cluster). Unfortunately, >>>> this matrix gets very large (even the sparse version). Most clustering >>>> algorithms require random access to the matrix, so you have to keep it >>>> in main memory (which then has to be huge) or calculate it on-the-fly >>>> (takes forever). >>>> >>>> My implementation (in C++, not sure if I can share it) assumes that the >>>> similarity matrix has been pre-calculated and is stored in one (or >>>> multiple) files. It reads these files sequentially and whenever a >>>> compound pair with a similarity beyond the threshold is found, it checks >>>> whether one of the cpds. is already a centroid (in which case the other >>>> is assigned to it). Otherwise, one of the compounds is randomly chosen >>>> as centroid and the other is assigned to it. >>>> >>>> This procedure is highly order-dependent and thus not optimal, but has >>>> to read the whole similarity matrix only once and has limited memory >>>> consumption (you only need to keep a list of centroids). If you still >>>> run into memory issues, you can start by clustering with a high >>>> similarity threshold and then re-cluster centroids and singletons on a >>>> lower threshold level. >>>> >>>> I also played around with DBSCAN for large compound databases, but (as >>>> previously mentioned by Samo) found it difficult to find the right >>>> parameters and ended up with a single huge cluster covering 90 percent >>>> of the database in many cases. >>>> >>>> Hope this helps, >>>> Nils >>>> >>>> Am 05.06.2017 um 11:02 schrieb Michał Nowotka: >>>> > Is there anyone who actually done this: clustered >2M compounds using >>>> > any well-known clustering algorithm and is willing to share a code and >>>> > some performance statistics? >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Check out the vibrant tech community on one of the world's most >>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>>> _______________________________________________ >>>> Rdkit-discuss mailing list >>>> Rdk...@li... >>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >>> >>> >>> >>> -- >>> David Cosgrove >>> Freelance computational chemistry and chemoinformatics developer >>> http://cozchemix.co.uk >>> >>> >>> ------------------------------------------------------------------------------ >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, Slashdot.org! >>> http://sdm.link/slashdot_______________________________________________ >>> Rdkit-discuss mailing list >>> Rdk...@li... >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> -- >> >> Cheers, >> Abhik Seal Ph.D. (Cheminformatics) >> >> >> >> ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdk...@li... >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > |