Re: [Rdkit-discuss] calculating molecular properties on a Pandas dataframe Molecule

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

Thanks for your response.

The problem is that I’d like to chunk Pandas dataframes to different processors.  And efficiently as possible, remove those rows which fail to be converted into RDKit Mols.  What I find however, is that the entire process dies if the PandasTools fails to convert a SMI to a Mol.  Chunking individual rows (chunk = 1) should ensure that row operations get sent to processors and fail and will not affect “good” molecules as they would be in separate dataframes.  But this isn’t every efficient for Pool, I’d rather chuck the dataframe into 5-10% chunks.

So the question is.  How to catch failed compounds within a dataframe and still write out something in the new fields (like add none to ROMol and HAC).

Does that make sense?  Sorry if this isn’t very clear.

Cheers,

mike

From: Greg Landrum <gre...@gm...> 
Sent: 01 November 2019 10:40
To: Mike Mazanetz <mi...@no...>; RDKit Discuss <rdk...@li...>
Subject: Re: [Rdkit-discuss] calculating molecular properties on a Pandas dataframe Molecule

What I'm failing to understand here is what you want to do.

Do you want the rows with molecules that failed to parse to remain in the DataFrame?

If not you can just remove them (there's probably a simpler way to do this, but Pandas never fails to surprise me):

filtered_df = df[df['ROMol'].astype(str).ne('None')]   

-greg

On Thu, Oct 31, 2019 at 11:32 AM Mike Mazanetz <mi...@no... <mailto:mi...@no...> > wrote:

Hi Taka and Jan,

Thanks for your help.

Worked out that I shouldn’t have added the names=[] when I read in my csv file (woops).

It fails if you have a mol which is None, I’ll have to add a line asking it to check that ROMol isn’t None first.  Annoying.

Thanks for your help,

mike

From: Taka Seri <ser...@gm... <mailto:ser...@gm...> > 
Sent: 31 October 2019 10:15
To: Jan Halborg Jensen <jhj...@ch... <mailto:jhj...@ch...> >
Cc: Mike Mazanetz <mi...@no... <mailto:mi...@no...> >; RDKit Discuss <rdk...@li... <mailto:rdk...@li...> >
Subject: Re: [Rdkit-discuss] calculating molecular properties on a Pandas dataframe Molecule

Hi,

Pandas apply function will work too.

AddMoleculeColumnToFrame(DF, "Smiles") at first.

Default setting, rdkit mol object will be added "ROMol" column in your dataframe.

https://www.rdkit.org/docs/source/rdkit.Chem.PandasTools.html

Then call apply function to apply a calculation function an axis of ROMol.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

 DF['HAC'] = DF["ROMol"].apply(Chem.Lipinski.HeavyAtomCount)

Best regards,

Taka

2019年10月31日(木) 18:30 Jan Halborg Jensen <jhj...@ch... <mailto:jhj...@ch...> >:

Hi Mike

This should work

DF[‘HAC’] = [Chem.Lipinski.HeavyAtomCount(mol) for mol in DF[‘Molecule’]]

Best regards, Jan

On 31 Oct 2019, at 10.16, Mike Mazanetz <mi...@no... <mailto:mi...@no...> > wrote:

Hi RDKit Gurus,

I’ve followed the docs and created a molecule column in my Pandas dataframe.

However, I do not seem to be able to do molecular operations on the column.

For example, if you had a SMILES column, how would you calculate heavy atom count and append this result to a new column?

This doesn’t work:

DF[‘HAC’] = Chem.Lipinski.HeavyAtomCount(DF[‘Molecule’])

Where the Molecule column is generated by PandasTools.AddMoleculeColumnToFrame

Thanks,

mike

_______________________________________________
Rdkit-discuss mailing list
 <mailto:Rdk...@li...> Rdk...@li...
 <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

_______________________________________________
Rdkit-discuss mailing list
Rdk...@li... <mailto:Rdk...@li...> 
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] calculating molecular properties on a Pandas dataframe Molecule

Open-Source Cheminformatics and Machine Learning

Re: [Rdkit-discuss] calculating molecular properties on a Pandas dataframe Molecule