Re: [Rdkit-discuss] Clustering
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Abhik S. <abh...@gm...> - 2017-06-05 17:46:52
|
Hello all , How about doing some dimension reduction using pca or Tsne and then run clustering using some selected top components like top 20 and I think then the clustering would be fast . Thanks Abhik On Mon, Jun 5, 2017 at 6:11 AM David Cosgrove <dav...@gm...> wrote: > Hi, > I have used this algorithm for many years clustering sets of several > millions of compounds. Indeed, I am old enough to know it as the Taylor > algorithm. It is slow but reliable. A crucial setting is the similarity > threshold for the clusters, which dictates the size of the neighbour lists > and hence the amount of RAM required. It also, of course, determines the > quality of the clusters. My implementation is at > https://github.com/OpenEye-Contrib/Flush.git. This repo has a number of > programs of relevance, the one you want is called cluster. I have just > confirmed that it compiles on ubuntu 16. It needs the fingerprints as > ascii bitstrings, I don't have code for turning RDKit fingerprints into > this format, but I would imagine it's quite straightforward. The program > runs in parallel using OpenMPI. That's valuable for two reasons. One is > speed, but the more important one is memory use. If you can spread the > slave processes over several machines you can cluster much larger sets of > molecules as you are effectively expanding the RAM of the machine. When I > wrote the original, 64MB was a lot of RAM, it is less of an issue these > days but still matters if clustering millions of fingerprints. Note that > the program cluster doesn't ever store the distance matrix, just the lists > of neighbours for each molecule within the threshold. This reduces the > memory footprint substantially if you have a tight-enough cluster threshold. > HTH, > Dave > > > > On Mon, Jun 5, 2017 at 11:22 AM, Nils Weskamp <nil...@gm...> > wrote: > >> Hi Michal, >> >> I have done this a couple of times for compound sets up to 10M+ using a >> simplified variant of the Taylor-Butina algorithm. The overall run time >> was in the range of hours to a few days (which could probably be >> optimized, but was fast enough for me). >> >> As you correctly mentioned, getting the (sparse) similarity matrix is >> fairly simple (and can be done in parallel on a cluster). Unfortunately, >> this matrix gets very large (even the sparse version). Most clustering >> algorithms require random access to the matrix, so you have to keep it >> in main memory (which then has to be huge) or calculate it on-the-fly >> (takes forever). >> >> My implementation (in C++, not sure if I can share it) assumes that the >> similarity matrix has been pre-calculated and is stored in one (or >> multiple) files. It reads these files sequentially and whenever a >> compound pair with a similarity beyond the threshold is found, it checks >> whether one of the cpds. is already a centroid (in which case the other >> is assigned to it). Otherwise, one of the compounds is randomly chosen >> as centroid and the other is assigned to it. >> >> This procedure is highly order-dependent and thus not optimal, but has >> to read the whole similarity matrix only once and has limited memory >> consumption (you only need to keep a list of centroids). If you still >> run into memory issues, you can start by clustering with a high >> similarity threshold and then re-cluster centroids and singletons on a >> lower threshold level. >> >> I also played around with DBSCAN for large compound databases, but (as >> previously mentioned by Samo) found it difficult to find the right >> parameters and ended up with a single huge cluster covering 90 percent >> of the database in many cases. >> >> Hope this helps, >> Nils >> >> Am 05.06.2017 um 11:02 schrieb Michał Nowotka: >> > Is there anyone who actually done this: clustered >2M compounds using >> > any well-known clustering algorithm and is willing to share a code and >> > some performance statistics? >> >> >> >> ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdk...@li... >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > > > > -- > David Cosgrove > Freelance computational chemistry and chemoinformatics developer > http://cozchemix.co.uk > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Cheers, Abhik Seal Ph.D. (Cheminformatics) |