Thread: [Rdkit-discuss] identify isomers using canonical SMILES

Open-Source Cheminformatics and Machine Learning

Brought to you by: glandrum

rdkit-discuss

[Rdkit-discuss] identify isomers using canonical SMILES

From: Cheng W. <che...@ho...> - 2009-02-21 22:54:38

Hi,

 

I am a newcomer to RDKit and I just start to read the getting start file and the manual.

What I want to do is to provide a list of species with SMILES and then have RDKit identify

all the isomers among the species.  The species includes hydrocarbons, aromatics, to name

a few.  I would like to know whether this is doable in RDKit (my initial impression is yes).

If yes, could someone give me some hints on how to do this task or point me to the right

place in the user manual?

 

Thanks,

Cheng

_________________________________________________________________
Windows Live™ Hotmail®…more than just e-mail. 
http://windowslive.com/howitworks?ocid=TXT_TAGLM_WL_t2_hm_justgotbetter_howitworks_022009

Re: [Rdkit-discuss] identify isomers using canonical SMILES

From: Greg L. <gre...@gm...> - 2009-02-22 11:54:52

Dear Cheng,

On Sat, Feb 21, 2009 at 11:54 PM, Cheng Wang <che...@ho...> wrote:
>
> I am a newcomer to RDKit and I just start to read the getting start file and
> the manual.
> What I want to do is to provide a list of species with SMILES and then have
> RDKit identify
> all the isomers among the species.  The species includes hydrocarbons,
> aromatics, to name
> a few.  I would like to know whether this is doable in RDKit (my initial
> impression is yes).
> If yes, could someone give me some hints on how to do this task or point me
> to the right
> place in the user manual?

I'm not completely sure what you mean by "all the isomers", can you
provide a little bit more specific information about what you'd like
to do?

-greg

Re: [Rdkit-discuss] identify isomers using canonical SMILES

From: Cheng W. <che...@ho...> - 2009-02-22 20:33:49

Dear Greg,

 

Sorry that I am not very clear on my first e-mail.  Here is what I want to achieve.

 

Nowadays we use some large detailed mechanisms to study combustion behavior.  These

mechanisms normally involve hundreds (sometimes over 1000) species including a lot of

large hydrocarbons (more than 6 Cs).  Because some of these mechanisms are generated

semi-automatically, they include reaction pathways of many isomers.  So one way to make

the simulation run faster is to reduce the mechanism by creating pseudo-species

representing all isomers of the same species family.  Then the reaction pathways involving

these isomers are combined through lumping process. My plan is to use RDKit to identify

the isomers among the species.

 

Thanks,

Cheng

_________________________________________________________________
Access your email online and on the go with Windows Live Hotmail.
http://windowslive.com/online/hotmail?ocid=TXT_TAGLM_WL_HM_AE_Access_022009

Re: [Rdkit-discuss] identify isomers using canonical SMILES

From: Greg L. <gre...@gm...> - 2009-02-23 15:49:05

Dear Cheng,

On Sun, Feb 22, 2009 at 9:33 PM, Cheng Wang <che...@ho...> wrote:
>
> Nowadays we use some large detailed mechanisms to study combustion
> behavior.  These
> mechanisms normally involve hundreds (sometimes over 1000) species including
> a lot of
> large hydrocarbons (more than 6 Cs).  Because some of these mechanisms are
> generated
> semi-automatically, they include reaction pathways of many isomers.  So one
> way to make
> the simulation run faster is to reduce the mechanism by creating
> pseudo-species
> representing all isomers of the same species family.  Then the reaction
> pathways involving
> these isomers are combined through lumping process. My plan is to use RDKit
> to identify
> the isomers among the species.

Ok, I think I have it now. You have a set of molecules and you would
like to group together ones that have the same chemical formula.

Somehow it has happened that the RDKit does not have a function to
generate the chemical formula for a molecule, so one would need to
write it from scratch. Here's a simple (and relatively untested) way
of doing this:

#----------------------------
import collections
import Chem
def ChemicalFormula(mol):
  """ A molecules' chemical formula

  >>> ChemicalFormula(Chem.MolFromSmiles('CC'))
  'C2H6'
  >>> ChemicalFormula(Chem.MolFromSmiles('C(=O)O'))
  'CH2O2'
  >>> ChemicalFormula(Chem.MolFromSmiles('C(=O)[O-]'))
  'CHO2'
  >>> ChemicalFormula(Chem.MolFromSmiles('C(=O)'))
  'CH2O'

  """
  cnts=collections.defaultdict(int)
  for atom in mol.GetAtoms():
    symb = atom.GetSymbol()
    hs = atom.GetTotalNumHs()
    cnts[symb]+=1
    cnts['H']+=hs
  ks = cnts.keys()
  ks.sort()
  res=''
  for k in ks:
    res+=k
    if cnts[k]>1:
      res+=str(cnts[k])
  return res
#----------------------------

For your purposes, this could be simplified a bit since you don't
really need the result as a string, but assuming I understood what you
want to do correctly, this should get you started.

Regards,
-greg

Re: [Rdkit-discuss] identify isomers using canonical SMILES

From: Cheng W. <che...@ho...> - 2009-02-23 18:18:05

Dear Greg,

 

Thanks for the suggestions.  I will try it out.

 

Sincerely,

Cheng

_________________________________________________________________
It’s the same Hotmail®. If by “same” you mean up to 70% faster. 
http://windowslive.com/online/hotmail?ocid=TXT_TAGLM_WL_HM_AE_Same_022009

Re: [Rdkit-discuss] identify isomers using canonical SMILES

From: Andrew D. <da...@da...> - 2009-02-23 18:35:13

On Feb 23, 2009, at 4:48 PM, Greg Landrum wrote:
> Somehow it has happened that the RDKit does not have a function to
> generate the chemical formula for a molecule, so one would need to
> write it from scratch. Here's a simple (and relatively untested) way
> of doing this:

Ideally it would generate the Hill formula,

>   ks = cnts.keys()
>   ks.sort()
>   res=''
>   for k in ks:
>     res+=k
>     if cnts[k]>1:
>       res+=str(cnts[k])

would be more like:

   ks = cnts.keys()
   # Alphabetize everything
   ks.sort()

   # Put into Hill order (C then H then everything else)
   if "C" in cnts:
     ks.remove("C")
     ks.insert(0, "C")
     if "H" in cnts:
       ks.remove("H")
       ks.insert(1, "H")

   ...

There are other solutions which are more efficient for large N, but  
there's only about 100 elements in the world, and only a handful in  
most compounds.


				Andrew
				da...@da...