Re: [Rdkit-discuss] Clustering

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello all ,

How about doing some dimension reduction using  pca or Tsne and then run
clustering using some selected top components like top 20 and I think then
the clustering would be fast .

Thanks
Abhik

On Mon, Jun 5, 2017 at 6:11 AM David Cosgrove <dav...@gm...>
wrote:

> Hi,
> I have used this algorithm for many years clustering sets of several
> millions of compounds.  Indeed, I am old enough to know it as the Taylor
> algorithm.  It is slow but reliable.  A crucial setting is the similarity
> threshold for the clusters, which dictates the size of the neighbour lists
> and hence the amount of RAM required.  It also, of course, determines the
> quality of the clusters.  My implementation is at
> https://github.com/OpenEye-Contrib/Flush.git.  This repo has a number of
> programs of relevance, the one you want is called cluster.  I have just
> confirmed that it compiles on ubuntu 16.  It needs the fingerprints as
> ascii bitstrings, I don't have code for turning RDKit fingerprints into
> this format, but I would imagine it's quite straightforward.  The program
> runs in parallel using OpenMPI.  That's valuable for two reasons.  One is
> speed, but the more important one is memory use.  If you can spread the
> slave processes over several machines you can cluster much larger sets of
> molecules as you are effectively expanding the RAM of the machine.  When I
> wrote the original, 64MB was a lot of RAM, it is less of an issue these
> days but still matters if clustering millions of fingerprints.  Note that
> the program cluster doesn't ever store the distance matrix, just the lists
> of neighbours for each molecule within the threshold.  This reduces the
> memory footprint substantially if you have a tight-enough cluster threshold.
> HTH,
> Dave
>
>
>
> On Mon, Jun 5, 2017 at 11:22 AM, Nils Weskamp <nil...@gm...>
> wrote:
>
>> Hi Michal,
>>
>> I have done this a couple of times for compound sets up to 10M+ using a
>> simplified variant of the Taylor-Butina algorithm. The overall run time
>> was in the range of hours to a few days (which could probably be
>> optimized, but was fast enough for me).
>>
>> As you correctly mentioned, getting the (sparse) similarity matrix is
>> fairly simple (and can be done in parallel on a cluster). Unfortunately,
>> this matrix gets very large (even the sparse version). Most clustering
>> algorithms require random access to the matrix, so you have to keep it
>> in main memory (which then has to be huge) or calculate it on-the-fly
>> (takes forever).
>>
>> My implementation (in C++, not sure if I can share it) assumes that the
>> similarity matrix has been pre-calculated and is stored in one (or
>> multiple) files. It reads these files sequentially and whenever a
>> compound pair with a similarity beyond the threshold is found, it checks
>> whether one of the cpds. is already a centroid (in which case the other
>> is assigned to it). Otherwise, one of the compounds is randomly chosen
>> as centroid and the other is assigned to it.
>>
>> This procedure is highly order-dependent and thus not optimal, but has
>> to read the whole similarity matrix only once and has limited memory
>> consumption (you only need to keep a list of centroids). If you still
>> run into memory issues, you can start by clustering with a high
>> similarity threshold and then re-cluster centroids and singletons on a
>> lower threshold level.
>>
>> I also played around with DBSCAN for large compound databases, but (as
>> previously mentioned by Samo) found it difficult to find the right
>> parameters and ended up with a single huge cluster covering 90 percent
>> of the database in many cases.
>>
>> Hope this helps,
>> Nils
>>
>> Am 05.06.2017 um 11:02 schrieb Michał Nowotka:
>> > Is there anyone who actually done this: clustered >2M compounds using
>> > any well-known clustering algorithm and is willing to share a code and
>> > some performance statistics?
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdk...@li...
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
>
> --
> David Cosgrove
> Freelance computational chemistry and chemoinformatics developer
> http://cozchemix.co.uk
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdk...@li...
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
-- 

Cheers,
Abhik Seal  Ph.D. (Cheminformatics)

Re: [Rdkit-discuss] Clustering

Open-Source Cheminformatics and Machine Learning

Re: [Rdkit-discuss] Clustering