[Rdkit-discuss] Cookbook Contribution: Batch Fetch from PubChem + RDKit Visualization

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi RDKit maintainers,

I would like to contribute a new entry to the RDKit Cookbook that
demonstrates a streamlined workflow for fetching chemical data from PubChem
and visualizing it with RDKit.

This example showcases the integration between ChemInformant (a robust
PubChem data acquisition library) and RDKit, addressing a common workflow
need: efficiently converting chemical identifiers to molecular
visualizations. ChemInformant handles the complexity of PubChem API
interactions, identifier resolution, network reliability, and data
validation, while RDKit provides the powerful molecular processing and
visualization capabilities.

Key benefits of this integration:
- Demonstrates real-world data acquisition workflows
- Shows how to handle mixed identifier types (names, CIDs, SMILES)
- Illustrates robust error handling and batch processing
- Provides a complete pipeline from data fetching to visualization

The example requires ChemInformant as a dependency (pip install
ChemInformant), which I believe adds value by showing users a practical,
production-ready approach to PubChem data integration.

Here is the content in .rst format. Please let me know if any changes are
needed.

Thanks!

Best regards,
Zhiang He (HzaCode)

--- RST CONTENT BELOW ---

Batch Fetch from PubChem + RDKit Visualization
从 PubChem 批量获取数据并用 RDKit 可视化
================================================

Author: Zhiang He (HzaCode)
Original Source: https://github.com/HzaCode/ChemInformant
Index ID#: RDKitCB_41
Summary: Demonstrates a streamlined workflow for fetching chemical data
from PubChem and visualizing it with RDKit. Uses ChemInformant for robust
data acquisition, then processes molecules with RDKit for annotated
visualization.

Dependencies: This example requires ChemInformant (``pip install
ChemInformant``)

.. testcode:: RDKitCB_41
   from rdkit import Chem
   from rdkit.Chem import Draw, Descriptors
   from rdkit.Chem.Draw import IPythonConsole
   import ChemInformant as ci

   IPythonConsole.ipython_useSVG = True

   # Example compound identifiers (names, CIDs, or SMILES)
   identifiers = ["aspirin", "caffeine", "2244"]  # mixed identifier types

   # Fetch molecular data from PubChem using ChemInformant
   # This handles identifier resolution, network retries, and caching
automatically
   df = ci.get_properties(identifiers, ["canonical_smiles",
"molecular_weight", "iupac_name"])

   print("Fetched data:")
   print(df[["input_identifier", "canonical_smiles",
"molecular_weight"]].head())

   # Convert to RDKit molecules
   molecules = []
   valid_names = []

   for idx, row in df.iterrows():
       if row["status"] == "OK" and row["canonical_smiles"]:
           mol = Chem.MolFromSmiles(row["canonical_smiles"])
           if mol:
               # Add atom indices as atom map numbers for visualization
               for atom in mol.GetAtoms():
                   atom.SetAtomMapNum(atom.GetIdx())
               molecules.append(mol)
               valid_names.append(row["input_identifier"])

   # Create legends with molecular weight information
   legends = []
   for i, name in enumerate(valid_names):
       mw = Descriptors.MolWt(molecules[i])
       legends.append(f"{name}: MW={mw:.1f}")

   # Generate annotated molecular grid
   img = Draw.MolsToGridImage(molecules, legends=legends, subImgSize=(250,
250))
   img

.. testoutput:: RDKitCB_41
   Fetched data:
     input_identifier              canonical_smiles  molecular_weight
   0          aspirin      CC(=O)OC1=CC=CC=C1C(=O)O            180.16
   1         caffeine  CN1C=NC2=C1C(=O)N(C(=O)N2C)C            194.19
   2             2244      CC(=O)OC1=CC=CC=C1C(=O)O            180.16

[Rdkit-discuss] Cookbook Contribution: Batch Fetch from PubChem + RDKit Visualization

Open-Source Cheminformatics and Machine Learning

[Rdkit-discuss] Cookbook Contribution: Batch Fetch from PubChem + RDKit Visualization