Re: [Rdkit-discuss] Clustering

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,
I have used this algorithm for many years clustering sets of several
millions of compounds.  Indeed, I am old enough to know it as the Taylor
algorithm.  It is slow but reliable.  A crucial setting is the similarity
threshold for the clusters, which dictates the size of the neighbour lists
and hence the amount of RAM required.  It also, of course, determines the
quality of the clusters.  My implementation is at
https://github.com/OpenEye-Contrib/Flush.git.  This repo has a number of
programs of relevance, the one you want is called cluster.  I have just
confirmed that it compiles on ubuntu 16.  It needs the fingerprints as
ascii bitstrings, I don't have code for turning RDKit fingerprints into
this format, but I would imagine it's quite straightforward.  The program
runs in parallel using OpenMPI.  That's valuable for two reasons.  One is
speed, but the more important one is memory use.  If you can spread the
slave processes over several machines you can cluster much larger sets of
molecules as you are effectively expanding the RAM of the machine.  When I
wrote the original, 64MB was a lot of RAM, it is less of an issue these
days but still matters if clustering millions of fingerprints.  Note that
the program cluster doesn't ever store the distance matrix, just the lists
of neighbours for each molecule within the threshold.  This reduces the
memory footprint substantially if you have a tight-enough cluster threshold.
HTH,
Dave

On Mon, Jun 5, 2017 at 11:22 AM, Nils Weskamp <nil...@gm...>
wrote:

> Hi Michal,
>
> I have done this a couple of times for compound sets up to 10M+ using a
> simplified variant of the Taylor-Butina algorithm. The overall run time
> was in the range of hours to a few days (which could probably be
> optimized, but was fast enough for me).
>
> As you correctly mentioned, getting the (sparse) similarity matrix is
> fairly simple (and can be done in parallel on a cluster). Unfortunately,
> this matrix gets very large (even the sparse version). Most clustering
> algorithms require random access to the matrix, so you have to keep it
> in main memory (which then has to be huge) or calculate it on-the-fly
> (takes forever).
>
> My implementation (in C++, not sure if I can share it) assumes that the
> similarity matrix has been pre-calculated and is stored in one (or
> multiple) files. It reads these files sequentially and whenever a
> compound pair with a similarity beyond the threshold is found, it checks
> whether one of the cpds. is already a centroid (in which case the other
> is assigned to it). Otherwise, one of the compounds is randomly chosen
> as centroid and the other is assigned to it.
>
> This procedure is highly order-dependent and thus not optimal, but has
> to read the whole similarity matrix only once and has limited memory
> consumption (you only need to keep a list of centroids). If you still
> run into memory issues, you can start by clustering with a high
> similarity threshold and then re-cluster centroids and singletons on a
> lower threshold level.
>
> I also played around with DBSCAN for large compound databases, but (as
> previously mentioned by Samo) found it difficult to find the right
> parameters and ended up with a single huge cluster covering 90 percent
> of the database in many cases.
>
> Hope this helps,
> Nils
>
> Am 05.06.2017 um 11:02 schrieb Michał Nowotka:
> > Is there anyone who actually done this: clustered >2M compounds using
> > any well-known clustering algorithm and is willing to share a code and
> > some performance statistics?
>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdk...@li...
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk

Re: [Rdkit-discuss] Clustering

Open-Source Cheminformatics and Machine Learning

Re: [Rdkit-discuss] Clustering