In fact, I was also using a subset of the NCI dataset, and I see that this is a general problem with the NCI dataset (c.f. Chemoinformatics Concepts, Methods, and Tools for Drug Discovery, Bajorath, J=C3=BCrgen, 2= 004)  it contains a number of molecules with a large number of cycles. =20 Presumably, the number of cycles correlates with the M.W., so if I am interested in druglike molecules (which I was), I could just apply a cutoff, maybe twice the ruleoffive value, 1000, which would be pretty safe (just guessing here). On the other hand, it is only necessary to calculate the fingerprint for a given molecule once. Perhaps an SQL database of fingerprints for the NCI dataset would be very useful. Better still (for me :), an n x n matrix of Tanimoto values. Anybody interested in making this publicly available somehow? On Mon, 20050418 at 12:33, Nina Jeliazkova wrote: > Noel, all, >=20 > I have run into the problem of slow fingerprints (and smiles as well) s= ome > months ago, while playing with NCI dataset. There are some molecules in= this > dataset which can run for two days. >=20 > In fact the slow part is AllRingsFinder class and although the algorith= m > implemented for finding all rings is published, it is not very efficien= t in > some cases. I could provide statistics for timing for almost all NCI d= ataset > if anybody is interested. >=20 > A test I had developed is as follows: >=20 > 1) calculate the spanning tree of the molecule (I would be glad to cont= ribute > the code to CDK, I couldn't find spanning tree functionality some month= s ago, > haven't checked recently).=20 > This is a classic and fast algorithms, so not problems with timing. >=20 > 2) identify the number of cyclic bonds (this is straightforward from a > spanning tree) >=20 > 3) identify the maximum bonds per atom=20 >=20 > 4) calling AllRingsFinder is safe for compounds with the number of cycl= ic > bonds less than about 37 (this is heuristic ! ) and maximum bonds per a= tom <=3D > 4 (yes, there are some exotic structures within NCI dataset with more t= han 4 > bonds per atom) >=20 > This makes things safe (btw, some structures which could possibly go fa= st will > be missed), but nevertheless it is just an workaround. >=20 > The better solution is to have a flag inside the AllRingsFinder, so tha= t if it > is called in a thread, one just kills the thread if the allowed time is > exhausted. If anybody interested in code / statistics, please let me know. Feature Requests item #1181323, was opened at 20050412 09:26

Category: cdk.fingerprint
Group: None
Status: Open
Priority: 5
Submitted By: Noel O\'Boyle (baoilleach)
Assigned to: Christoph Steinbeck (steinbeck)
Summary: Test for very slow fingerprints

Initial Comment:
I have been calculating fingerprints for 3000
'reallife' molecules, using the default settings for
the FingerPrinter class (which are not described in the
API JavaDoc  I think they probably should be). Most > > molecules took a fraction of a second to calculate. > > However, a couple of them took up to 8 hours. > > This was due to a large number of subgraphs (I think). > > Is there any way to guesstimate whether a particular > > molecule will be very slow to FingerPrint, so that it > > can be left out of a screen if desired? In the end, it > > took around 4 days to calculate fingerprints for the > > 3000 molecules. To be fair to FingerPrinter, the slow
molecules did not look very druglike, but I would have
prefered to leave 6 molecules out and complete the
calculation in one hour, rather than include them, and
take 4 days.
If you are interested, I have attached one of the slow
molecules.

Noel 