Re: [Rdkit-discuss] Count carbon atoms
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
|
From: Andrew D. <da...@da...> - 2015-10-07 10:55:54
|
On Oct 7, 2015, at 11:30 AM, Christos Kannas wrote:
> Yes there is an easier way, by using substructure search, i.e. do a substructure search for [C] and then get the number of matches.
> m = Chem.MolFromSmiles("CCCCCCCCc1ccccc1")
> patt= Chem.MolFromSmarts("[C]")
> pm = m.GetSubstructMatches(patt)
> print len(pm)
> 8
You'll notice that this returned 8 when Joos Kiener wanted the count of all the C-Atoms. I assume that means all 14 carbon atoms, and not only the 8 aliphatic carbon atoms, so the SMARTS should be tweaked somewhat:
>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("CCCCCCCCc1ccccc1")
>>> pat = Chem.MolFromSmarts("[#6]")
>>> len(mol.GetSubstructMatches(pat))
14
On the topic of counting carbons given a molecule, there are two general approaches - the SMARTS pattern, and atom iteration, though it's better to count the number of atomic number matches rather than use the symbol:
from rdkit import Chem
pat = Chem.MolFromSmarts("[#6]")
def count1(mol):
return len(mol.GetSubstructMatches(pat))
def count2(mol):
return sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() == 6)
It's possible to time these two functions, to see which has the better performance:
import time
def go(count, mol, n=1000):
_range = range(n)
t1 = time.time()
m = mol
for x in _range:
c = count(mol)
t2 = time.time()
return c, t2-t1
With that timing scaffold in place:
>>> print go(count1, mol)
(8, 0.020318031311035156)
>>> print go(count2, mol)
(8, 0.07634186744689941)
This makes it seem like the SMARTS match solution is the best. However, there are two lurking details.
First, when the structure has more than ~256 carbons, the iterative solution is faster:
>>> for i in range(10):
... mol = Chem.MolFromSmiles("C" * (2**i))
... c1, t1 = go(count1, mol)
... c2, t2 = go(count2, mol)
... assert c1 == c2
... print 2**i, t1, t2
...
1 0.00930690765381 0.0515530109406
2 0.00659918785095 0.0535910129547
4 0.00970101356506 0.0570240020752
8 0.0159358978271 0.0636560916901
16 0.0273370742798 0.0776889324188
32 0.0527520179749 0.103546857834
64 0.107499837875 0.159503221512
128 0.218550920486 0.266411066055
256 0.513395786285 0.4887611866
512 1.53533577919 0.901851892471
Second, by default there is a limit to the number of SMARTS matches:
>>> mol = Chem.MolFromSmiles("C" * 999)
>>> count1(mol)
999
>>> mol = Chem.MolFromSmiles("C" * 1000)
>>> count1(mol)
1000
>>> mol = Chem.MolFromSmiles("C" * 1001)
>>> count1(mol)
1000
Quoting from the documentation:
In high-symmetry cases with medium-sized molecules, it is very
easy to end up with a combinatorial explosion in the number of
possible matches. This argument prevents that from having
unintended consequences
Instead, this is a better general-purpose SMART matcher count function:
def count3(mol):
return len(mol.GetSubstructMatches(pat, maxMatches=mol.GetNumAtoms()))
>>> count3(mol)
1001
Andrew
da...@da...
|