[Rdkit-discuss] Cookbook Contribution: Batch Fetch from PubChem + RDKit Visualization
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: ang Ho <hex...@gm...> - 2025-09-20 15:06:10
|
Hi RDKit maintainers, I would like to contribute a new entry to the RDKit Cookbook that demonstrates a streamlined workflow for fetching chemical data from PubChem and visualizing it with RDKit. This example showcases the integration between ChemInformant (a robust PubChem data acquisition library) and RDKit, addressing a common workflow need: efficiently converting chemical identifiers to molecular visualizations. ChemInformant handles the complexity of PubChem API interactions, identifier resolution, network reliability, and data validation, while RDKit provides the powerful molecular processing and visualization capabilities. Key benefits of this integration: - Demonstrates real-world data acquisition workflows - Shows how to handle mixed identifier types (names, CIDs, SMILES) - Illustrates robust error handling and batch processing - Provides a complete pipeline from data fetching to visualization The example requires ChemInformant as a dependency (pip install ChemInformant), which I believe adds value by showing users a practical, production-ready approach to PubChem data integration. Here is the content in .rst format. Please let me know if any changes are needed. Thanks! Best regards, Zhiang He (HzaCode) --- RST CONTENT BELOW --- Batch Fetch from PubChem + RDKit Visualization 从 PubChem 批量获取数据并用 RDKit 可视化 ================================================ Author: Zhiang He (HzaCode) Original Source: https://github.com/HzaCode/ChemInformant Index ID#: RDKitCB_41 Summary: Demonstrates a streamlined workflow for fetching chemical data from PubChem and visualizing it with RDKit. Uses ChemInformant for robust data acquisition, then processes molecules with RDKit for annotated visualization. Dependencies: This example requires ChemInformant (``pip install ChemInformant``) .. testcode:: RDKitCB_41 from rdkit import Chem from rdkit.Chem import Draw, Descriptors from rdkit.Chem.Draw import IPythonConsole import ChemInformant as ci IPythonConsole.ipython_useSVG = True # Example compound identifiers (names, CIDs, or SMILES) identifiers = ["aspirin", "caffeine", "2244"] # mixed identifier types # Fetch molecular data from PubChem using ChemInformant # This handles identifier resolution, network retries, and caching automatically df = ci.get_properties(identifiers, ["canonical_smiles", "molecular_weight", "iupac_name"]) print("Fetched data:") print(df[["input_identifier", "canonical_smiles", "molecular_weight"]].head()) # Convert to RDKit molecules molecules = [] valid_names = [] for idx, row in df.iterrows(): if row["status"] == "OK" and row["canonical_smiles"]: mol = Chem.MolFromSmiles(row["canonical_smiles"]) if mol: # Add atom indices as atom map numbers for visualization for atom in mol.GetAtoms(): atom.SetAtomMapNum(atom.GetIdx()) molecules.append(mol) valid_names.append(row["input_identifier"]) # Create legends with molecular weight information legends = [] for i, name in enumerate(valid_names): mw = Descriptors.MolWt(molecules[i]) legends.append(f"{name}: MW={mw:.1f}") # Generate annotated molecular grid img = Draw.MolsToGridImage(molecules, legends=legends, subImgSize=(250, 250)) img .. testoutput:: RDKitCB_41 Fetched data: input_identifier canonical_smiles molecular_weight 0 aspirin CC(=O)OC1=CC=CC=C1C(=O)O 180.16 1 caffeine CN1C=NC2=C1C(=O)N(C(=O)N2C)C 194.19 2 2244 CC(=O)OC1=CC=CC=C1C(=O)O 180.16 |