Re: [Rdkit-discuss] Count carbon atoms

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Oct 7, 2015, at 11:30 AM, Christos Kannas wrote:
> Yes there is an easier way, by using substructure search, i.e. do a substructure search for [C] and then get the number of matches.

> m = Chem.MolFromSmiles("CCCCCCCCc1ccccc1")

> patt= Chem.MolFromSmarts("[C]")

> pm = m.GetSubstructMatches(patt)
> print len(pm)
> 8

You'll notice that this returned 8 when Joos Kiener wanted the count of all the C-Atoms. I assume that means all 14 carbon atoms, and not only the 8 aliphatic carbon atoms, so the SMARTS should be tweaked somewhat:

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("CCCCCCCCc1ccccc1")
>>> pat = Chem.MolFromSmarts("[#6]")
>>> len(mol.GetSubstructMatches(pat))
14

On the topic of counting carbons given a molecule, there are two general approaches - the SMARTS pattern, and atom iteration, though it's better to count the number of atomic number matches rather than use the symbol:

from rdkit import Chem
pat = Chem.MolFromSmarts("[#6]")

def count1(mol):
    return len(mol.GetSubstructMatches(pat))

def count2(mol):
    return sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() == 6)

It's possible to time these two functions, to see which has the better performance:

import time

def go(count, mol, n=1000):
    _range = range(n)
    t1 = time.time()
    m = mol
    for x in _range:
        c = count(mol)
    t2 = time.time()
    return c, t2-t1

With that timing scaffold in place:

>>> print go(count1, mol)
(8, 0.020318031311035156)
>>> print go(count2, mol)
(8, 0.07634186744689941)

This makes it seem like the SMARTS match solution is the best. However, there are two lurking details.

First, when the structure has more than ~256 carbons, the iterative solution is faster:

>>> for i in range(10):
...     mol = Chem.MolFromSmiles("C" * (2**i))
...     c1, t1 = go(count1, mol)
...     c2, t2 = go(count2, mol)
...     assert c1 == c2
...     print 2**i, t1, t2
... 
1 0.00930690765381 0.0515530109406
2 0.00659918785095 0.0535910129547
4 0.00970101356506 0.0570240020752
8 0.0159358978271 0.0636560916901
16 0.0273370742798 0.0776889324188
32 0.0527520179749 0.103546857834
64 0.107499837875 0.159503221512
128 0.218550920486 0.266411066055
256 0.513395786285 0.4887611866
512 1.53533577919 0.901851892471

Second, by default there is a limit to the number of SMARTS matches:

>>> mol = Chem.MolFromSmiles("C" * 999)
>>> count1(mol)
999
>>> mol = Chem.MolFromSmiles("C" * 1000)
>>> count1(mol)
1000
>>> mol = Chem.MolFromSmiles("C" * 1001)
>>> count1(mol)
1000

Quoting from the documentation:

  In high-symmetry cases with medium-sized molecules, it is very
  easy to end up with a combinatorial explosion in the number of
  possible matches. This argument prevents that from having
  unintended consequences

Instead, this is a better general-purpose SMART matcher count function:

def count3(mol):
    return len(mol.GetSubstructMatches(pat, maxMatches=mol.GetNumAtoms()))

>>> count3(mol)
1001

				Andrew
				da...@da...

Re: [Rdkit-discuss] Count carbon atoms

Open-Source Cheminformatics and Machine Learning

Re: [Rdkit-discuss] Count carbon atoms