Re: [Rdkit-discuss] Clustering

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Jun 5, 2017, at 11:02, Michał Nowotka <mm...@gm...> wrote:
> Is there anyone who actually done this: clustered >2M compounds using
> any well-known clustering algorithm and is willing to share a code and
> some performance statistics?

Yes. People regularly use chemfp (http://chemfp.com/) to cluster >2M compound data sets using the Taylor-Butina algorithm.

> It's easy to get a sparse distance matrix using chemfp. But if you
> take this matrix and feed it into any scipy.cluster you want get any
> results in a reasonable time.

On Jun 5, 2017, at 10:19, Gonzalo Colmenarejo <col...@gm...> wrote:
> I think there are faster things, like chemfp (see for instance https://chemfp.readthedocs.io/en/latest/using-api.html#taylor-butina-clustering). 

I've fleshed out that algorithm so it's a command-line program that can be used for benchmarking purposes. It's available from http://dalkescientific.com/writings/taylor_butina.py .

If anyone uses it for benchmarking, or has improvements, let me know. If I get useful feedback about it, I'll include it in an upcoming chemfp 1.3 release.

On my 7-year old desktop (3.2 GHz Intel Core i3), with 4 OpenMP threads (the default) I estimate it would take a bit over an hour to cluster 1M fingerprints, with 2048 bits/fingerprint, using a threshold of 0.8.

I just ran it now with the following:

% time python taylor_butina.py --profile --threshold 0.8 pubchem_million.fps -o pubchem_million.clusters

The profile report says:

#fingerprints: 1000000 #bits/fp: 881 threshold: 0.8 #matches: 667808946
Load time: 4.9 sec memory: 288 MB
Similarity time: 2037.3 sec memory: 8.00 GB
Clustering time: 17.5 sec memory: -405549056 B
Total time: 2061.3 sec
7044.489u 323.528s 34:26.26 356.5%	0+0k 5+13io 233pf+0w

The "load" lines says the data needed 288 MB for 1M 881 bit fingerprints. This scales linearly in the number of bits and the number of records. The load time is small compared to the similarity time.

The "similarity" search took about 8GB for a sparse matrix with 667,808,946 terms, or about 13 bytes per term (each hit requires an int (4 bytes) and a double (8 bytes), plus overhead.) This does not depend on the number of bits in the fingerprint, only the number of matches in the matrix.

The "similarity" time is linear in the number of bits, so 2048 bits would be about 2.3x slower. It's quadratic in the number of fingerprints, so 2M fingerprints would take about 2 hours.

It took 18 seconds to do the Taylor-Butina clustering, given the sparse matrix. On a multi-core machine, if you watch the CPU usage you'll easily see when it goes from the multi-threaded similarity search code (in C using OpenMP) to the single-threaded clustering code.

The "clustering" line says it took -405549056 bytes. That should be -387 MB. I expected this value to be positive. I'm not sure what's happening there to give a negative number, but that's not an important term.

Cheers,
				Andrew
				da...@da...

Re: [Rdkit-discuss] Clustering

Open-Source Cheminformatics and Machine Learning

Re: [Rdkit-discuss] Clustering