rdkit-discuss Mailing List for RDKit
Open-Source Cheminformatics and Machine Learning
                
                Brought to you by:
                
                    glandrum
                    
                
            
            
        
        
        
    You can subscribe to this list here.
| 2006 | Jan | Feb | Mar | Apr | May (1) | Jun | Jul | Aug | Sep | Oct | Nov (1) | Dec | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2007 | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug (1) | Sep (27) | Oct (4) | Nov (20) | Dec (4) | 
| 2008 | Jan (12) | Feb (2) | Mar (23) | Apr (40) | May (30) | Jun (6) | Jul (35) | Aug (60) | Sep (31) | Oct (33) | Nov (35) | Dec (3) | 
| 2009 | Jan (16) | Feb (77) | Mar (88) | Apr (57) | May (33) | Jun (27) | Jul (55) | Aug (26) | Sep (12) | Oct (45) | Nov (42) | Dec (23) | 
| 2010 | Jan (64) | Feb (17) | Mar (30) | Apr (55) | May (30) | Jun (65) | Jul (112) | Aug (26) | Sep (67) | Oct (20) | Nov (67) | Dec (23) | 
| 2011 | Jan (57) | Feb (43) | Mar (50) | Apr (66) | May (95) | Jun (73) | Jul (64) | Aug (47) | Sep (22) | Oct (56) | Nov (51) | Dec (34) | 
| 2012 | Jan (64) | Feb (45) | Mar (65) | Apr (85) | May (76) | Jun (47) | Jul (75) | Aug (72) | Sep (31) | Oct (77) | Nov (61) | Dec (41) | 
| 2013 | Jan (68) | Feb (63) | Mar (36) | Apr (73) | May (61) | Jun (69) | Jul (98) | Aug (60) | Sep (74) | Oct (102) | Nov (92) | Dec (63) | 
| 2014 | Jan (112) | Feb (84) | Mar (72) | Apr (59) | May (96) | Jun (54) | Jul (91) | Aug (54) | Sep (38) | Oct (47) | Nov (33) | Dec (39) | 
| 2015 | Jan (41) | Feb (115) | Mar (66) | Apr (87) | May (63) | Jun (53) | Jul (61) | Aug (59) | Sep (115) | Oct (42) | Nov (60) | Dec (20) | 
| 2016 | Jan (52) | Feb (72) | Mar (100) | Apr (125) | May (61) | Jun (106) | Jul (62) | Aug (74) | Sep (151) | Oct (151) | Nov (117) | Dec (148) | 
| 2017 | Jan (106) | Feb (75) | Mar (106) | Apr (67) | May (85) | Jun (144) | Jul (53) | Aug (73) | Sep (188) | Oct (106) | Nov (118) | Dec (74) | 
| 2018 | Jan (96) | Feb (43) | Mar (40) | Apr (111) | May (77) | Jun (112) | Jul (64) | Aug (85) | Sep (73) | Oct (117) | Nov (97) | Dec (47) | 
| 2019 | Jan (63) | Feb (112) | Mar (109) | Apr (61) | May (51) | Jun (41) | Jul (57) | Aug (68) | Sep (47) | Oct (126) | Nov (117) | Dec (96) | 
| 2020 | Jan (84) | Feb (82) | Mar (80) | Apr (100) | May (78) | Jun (68) | Jul (76) | Aug (69) | Sep (76) | Oct (73) | Nov (69) | Dec (42) | 
| 2021 | Jan (44) | Feb (30) | Mar (85) | Apr (65) | May (41) | Jun (72) | Jul (55) | Aug (9) | Sep (44) | Oct (44) | Nov (30) | Dec (40) | 
| 2022 | Jan (35) | Feb (29) | Mar (55) | Apr (30) | May (31) | Jun (27) | Jul (49) | Aug (15) | Sep (17) | Oct (25) | Nov (15) | Dec (40) | 
| 2023 | Jan (32) | Feb (10) | Mar (10) | Apr (21) | May (33) | Jun (31) | Jul (12) | Aug (17) | Sep (14) | Oct (12) | Nov (8) | Dec (12) | 
| 2024 | Jan (10) | Feb (18) | Mar (7) | Apr (4) | May (6) | Jun (4) | Jul (5) | Aug (6) | Sep (8) | Oct (1) | Nov (1) | Dec | 
| 2025 | Jan | Feb | Mar (3) | Apr | May | Jun | Jul (1) | Aug (2) | Sep (3) | Oct | Nov | Dec | 
| 
      
      
      From: Andrew D. <da...@da...> - 2025-09-24 14:04:46
      
     | 
| Hi RDKit users,
  I've released chemfp 5.0, my Python package for cheminformatics
fingerprint generation, search, and analysis. You can install it on
Linux-based OSes using:
    python -m pip install chemfp -i https://chemfp.com/packages/
(Append "--upgrade" if you have already installed it.)
For a description of the changes since 4.2 see
  https://chemfp.com/docs/whats_new_in_50.html .
The highlights are:
 • Update the FPB format to handle over 1 billion fingerprints.
 
 • New chemfp shardsearch command-line tool which does similarity
    search across multiple target files and merges the result.
   - Tested with the 977 million structures in GDB-13
 
 • New chemfp simhistogram / chemfp simhist command-line tool and
    corresponding chemfp.simhistogram() high-level API function
    to create a histogram of similarity scores.
 
 • Initial support for count fingerprints:
   - new text-based FPC format based on the FPS format
   - rdkit2fpc tool which uses RDKit's sparse fingerprint generators
   - fpc2fps tool with various method to convert sparse count
       fingerprints to binary fingerprints
 
 • Fast implementations of the 4860-bit Klekota-Roth fingerprint
    for the OpenEye and RDKit toolkits.
Cheers,
 
				Andrew Dalke
				da...@da...
--
Have useful but old in-house cheminformatics software in need of refurbishment?
No one left knows how it works or has the time? Perhaps I can help. Contact me.
 | 
| 
      
      
      From: ang Ho <hex...@gm...> - 2025-09-20 15:06:10
      
     | 
| Hi RDKit maintainers, I would like to contribute a new entry to the RDKit Cookbook that demonstrates a streamlined workflow for fetching chemical data from PubChem and visualizing it with RDKit. This example showcases the integration between ChemInformant (a robust PubChem data acquisition library) and RDKit, addressing a common workflow need: efficiently converting chemical identifiers to molecular visualizations. ChemInformant handles the complexity of PubChem API interactions, identifier resolution, network reliability, and data validation, while RDKit provides the powerful molecular processing and visualization capabilities. Key benefits of this integration: - Demonstrates real-world data acquisition workflows - Shows how to handle mixed identifier types (names, CIDs, SMILES) - Illustrates robust error handling and batch processing - Provides a complete pipeline from data fetching to visualization The example requires ChemInformant as a dependency (pip install ChemInformant), which I believe adds value by showing users a practical, production-ready approach to PubChem data integration. Here is the content in .rst format. Please let me know if any changes are needed. Thanks! Best regards, Zhiang He (HzaCode) --- RST CONTENT BELOW --- Batch Fetch from PubChem + RDKit Visualization 从 PubChem 批量获取数据并用 RDKit 可视化 ================================================ Author: Zhiang He (HzaCode) Original Source: https://github.com/HzaCode/ChemInformant Index ID#: RDKitCB_41 Summary: Demonstrates a streamlined workflow for fetching chemical data from PubChem and visualizing it with RDKit. Uses ChemInformant for robust data acquisition, then processes molecules with RDKit for annotated visualization. Dependencies: This example requires ChemInformant (``pip install ChemInformant``) .. testcode:: RDKitCB_41 from rdkit import Chem from rdkit.Chem import Draw, Descriptors from rdkit.Chem.Draw import IPythonConsole import ChemInformant as ci IPythonConsole.ipython_useSVG = True # Example compound identifiers (names, CIDs, or SMILES) identifiers = ["aspirin", "caffeine", "2244"] # mixed identifier types # Fetch molecular data from PubChem using ChemInformant # This handles identifier resolution, network retries, and caching automatically df = ci.get_properties(identifiers, ["canonical_smiles", "molecular_weight", "iupac_name"]) print("Fetched data:") print(df[["input_identifier", "canonical_smiles", "molecular_weight"]].head()) # Convert to RDKit molecules molecules = [] valid_names = [] for idx, row in df.iterrows(): if row["status"] == "OK" and row["canonical_smiles"]: mol = Chem.MolFromSmiles(row["canonical_smiles"]) if mol: # Add atom indices as atom map numbers for visualization for atom in mol.GetAtoms(): atom.SetAtomMapNum(atom.GetIdx()) molecules.append(mol) valid_names.append(row["input_identifier"]) # Create legends with molecular weight information legends = [] for i, name in enumerate(valid_names): mw = Descriptors.MolWt(molecules[i]) legends.append(f"{name}: MW={mw:.1f}") # Generate annotated molecular grid img = Draw.MolsToGridImage(molecules, legends=legends, subImgSize=(250, 250)) img .. testoutput:: RDKitCB_41 Fetched data: input_identifier canonical_smiles molecular_weight 0 aspirin CC(=O)OC1=CC=CC=C1C(=O)O 180.16 1 caffeine CN1C=NC2=C1C(=O)N(C(=O)N2C)C 194.19 2 2244 CC(=O)OC1=CC=CC=C1C(=O)O 180.16 | 
| 
      
      
      From: Andrew D. <da...@da...> - 2025-09-04 16:40:13
      
     | 
| Hi all, RDKit implements Tanimoto similarity for count fingerprints. I only last week realized there's been a change in what "Tanimoto similarity" means for count fingerprints, and RDKit seems to be the reason for the shift. I'm curious to know the history. * Tanimoto #1 is Σaᵢbᵢ/(Σaᵢ²+Σbᵢ²-Σaᵢbᵢ), that is, it interprets count fingerprints as a vector The oldest citation I have is Bawden, "Browsing and Clustering of Chemical Structures" on p147 of "Chemical structures" (1988) from the first ICCS. A more accessible citation is Willett, "Chemical Similarity Searching" JCICS (1998) 38, 983-996 available at https://web.archive.org/web/20040218213916/http://www-personal.engin.umich.edu:80/~wildd/che697/willett98.pdf . See page 987, the "formula for continuous values" under "Tanimoto Coefficient". My literature search shows it was the main definition for almost 30 years. * Tanimoto #2 is Σmin(aᵢ,bᵢ)/Σmax(aᵢ,bᵢ), that is, what Wikipedia calls the "weighted Jaccard similarity." This is what RDKit uses. It was committed to Code/DataStructs/SparseIntVect.h on 2009-Jun-18, as part of adding Tversky similarity, and a couple of years after adding Dice similarity. I believe that as a result of RDKit's popularity, recent papers have taking to describing this as, for example, "the counted Tanimoto similarity" in like https://jcheminf.biomedcentral.com/articles/10.1186/s13321-025-01081-6 ("also known as the multiset coefficient calculation"). Does anyone here know how RDKit came to be the way it is? In my literature search, I believe the similarity function for Tanimoto #2 was first proposed by Henry Allan Gleason, "Some Applications of the Quadrat Method", Bulletin of the Torrey Botanical Club, Vol. 47, No. 1 (Jan., 1920), pp. 21-33, starting on page 31 where he proposes adding species abundance to Jaccard's similarity. See https://archive.org/details/jstor-2480223/page/n11/mode/2up Some people (and https://en.wikipedia.org/wiki/Jaccard_index) refer to this as Ruzicka similarity, from Ruzicka (1958), but on the Mastodon discussion at https://mstdn.science/@molecule/115142680945701031 you'll wim (@mol...@ms...) got a copy of the relevant part of Ruzicka's paper, and it appears to be identical to Gleason's extension to Jaccard similarity -- not even in the cool looking min/max formulation as attributed in, eg, https://archive.org/details/dictionaryofdist0000deza/mode/2up?q=Ruzicka . The first paper which applied Tanimoto #2 to fingerprints appears to be introduced by Swamidass et al., "Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity", Bioinformatics, Volume 21, Issue suppl_1, June 2005, Pages i359–i368, https://doi.org/10.1093/bioinformatics/bti1055 where they call it the "MinMax" kernel and explicitly compare it to Tanimoto #1. Some papers since then refer to Tanimoto #2 as MinMax. Now, I was able to find a use of (1-Tanimoto #2) as a similarity measure ("measure" used in its mathematical meaning) in Thomas Ott, Albert Kern, Ausgar Schuffenhauer, Maxim Popov, Pierre Acklin, Edgar Jacoby, and Ruedi Stoop, "Sequential Superparamagnetic Clustering for Unbiased Classification of High-Dimensional Chemical Data", J. Chem. Inf. Comput. Sci. 2004, 44, 1358-1364 available from https://tilde.ini.uzh.ch/users/tott/public_html/jcheminf.pdf but it is unnamed -- and a measure, not a similarity. That makes me quite curious on how RDKit ended up the way it does. To be clear, I prefer the similarity function given in #2 over that of #1, though I think having two "Tanimoto" definitions is going to be confusing. If only the Sheffield folks back in the 1980s had known. But hey, that's how we ended up with "Tanimoto" instead of "Jaccard". :) Best regards, Andrew da...@da... P.S. If anyone knows of older citation, please let me know. There aren't good search tools for finding this formula, so it's a lot of tedious manual work. | 
| 
      
      
      From: Andrew D. <da...@da...> - 2025-08-22 16:15:16
      
     | 
| Is anyone here interested in evaluating my new method to emulate count fingerprints using binary fingerprints? I've added that feature to chemfp5.0b2, released yesterday, but I don't have the expertise to evaluate its effectiveness. In short, for most Linux-based OSes, install chemfp, generate count fingerprints, and convert count fingerprints to binary fingerprints using the following steps: python -m pip install chemfp==5.0b2 -i https://chemfp.com/packages/ chemfp rdkit2fpc dataset.sdf.gz -o dataset.fpc chemfp fpc2fps dataset.fpc -o dataset.fps then use chemfp's "simsearch" for similarity search of the FPS (or FPB) files, like: simsearch --query 'c1ccccc1O' -k 5 --out csv dataset.fps The "--help" for these commands are documented at https://chemfp.com/docs/tool_help.html . The "FPC" format is my new text-based exchange format for count fingerprints, described at https://chemfp.com/fpc_format/ . Here's some background. RDKit supports several count fingerprints (Morgan, RDKit fingerprints, Atom Pair, and Torsion). These can be viewed as a list of (feature id, count) pairs. By default RDKit converts these into binary fingerprints by folding the feature id, that is, setting the binary fingerprint bit i to 1, where i = (feature id) modulo fpSize. This method ignores the counts. These fingerprint generators also implement a countSimulation method, which sets additional bits based on count thresholds. For example, if the countBounds is 1,3,9 then it sets 1 bit if the count is at least 1, two bits if the count is at least 3, and three bits if the count is at least 9. (The actual algorithm is a bit more complicated than this.) I've come up with a new method which is a cross between Calvin Mooers' superimposed coding and the Daylight RNG approach. It's based on the observation that Morgan fingerprints are typically quite sparse, eg, for Morgan3 count fingerprints from ChEMBL 33 the average fingerprint has 71 distinct features, with an average feature count of 1.5. That means there are on average 107 distinct possible bits to set in the output binary fingerprint, assuming each count sets 1 bit, eg, that feature 2246728737 with count 2 can set 2 bits. But how to choose those bits? My new method uses the feature id to seed an RNG, which is then used to get `count` output bit positions, randomly chosen from the output fingerprint size. output_fp = BinaryFingerprint(num_bits) for feature_id, count in features: rng = RNG(feature_id) for _ in range(count): bitno = rng.randrange(num_bits) output_fp.SetOnBit(bitno) There are a couple of tunable parameters: 1) the output fingerprint size, 2) the number of bits to set for each count, and 3) an upper bound for the feature count, so the full algorithm is a bit more complicated: output_fp = BinaryFingerprint(num_bits) for feature_id, count in features: rng = RNG(feature_id) for _ in range(min(count, max_count) * bits_per_count): bitno = rng.randrange(num_bits) output_fp.SetOnBit(bitno) The reason for "bits_per_count" is to reduce the effect of collisions. Double the fingerprint size and double the count keeps the output density roughly unchanged, but should reduce the collision rate between two pairs of (feature id, specific count). That's my hand-waving belief, but I don't have the specific experience in evaluating fingerprint effectiveness. I know other RDKit users do, and might be able to help. What I know so far is it's a bit better than RDKit's count simulation at predicting MW. https://mstdn.science/@molecule/115063149386391787 :) The "fpc2fps" command supports other methods, like "scaled", which is a cross between superimposed and the RDKit count simulation. Rather than use `count` random numbers, it takes a lookup table of count thresholds to get the actual repeat to use. See the fpc2fps --help-methods for more complete details, or contact me. This 5.0b2 release also includes a "simhistogram" method to generate a histogram from all possible Tanimoto scores, a "shardsearch" method to search multiple target files ("shards") and merge the results, and it has a reasonably performant implementation of the 4860-bit Klekota-Roth fingerprint. See https://chemfp.com/docs/whats_new_in_50.html to learn more. Best regards, Andrew da...@da... | 
| 
      
      
      From: <tho...@bo...> - 2025-08-13 09:37:32
      
     | 
| Dear all, I am working with peptides and RNA and want to convert sequences into 2D molecules. As we use non-natural and proprietary monomers, I cannot apply the ususal workflows like MolFromHELM, but have developed my own python code to build the macromolecules from their building blocks (basically using Chem.CombineMols and then rdDepictor.Compute2DCoords, see https://github.com/Boehringer-Ingelheim/pyPept/blob/master/src/pyPept/molecule.py). While this works fine for even large peptides (>40 monomers), when doing the same for RNA I run into a problem: after a certain size (about 12 or 13 nucleotides), the 2D embedding returns all coordinates as zeroes and all stereoinformation is lost. I tried the same using MolFromHELM, and there I do not see the same issue, I get valid 2D coordinates up to hundreds of nucleotides (yes, other than what the documentation says, RNA and DNA work, too!). Only if I first generate the molecule and then pass it through either rdCoordGen.AddCoords or Chem.rdDepictor.Compute2DCoords I end up with coordinates as zero. So I suppose MolFromHELM knows sth about the general structure of the building blocks and uses that information, whereas the all-purpose embedders cannot take that into account and subsequently fail. But then again, this MolFromHELM is not an option as I need non-natural monomers (unless there is a way to teach rdkit about non-canonical monomers, but I haven't found anything on it). Here is the relevant code snippet: from rdkit import Chem from rdkit.Chem import rdCoordGen n_nucleotides = 20 polyA = ['R(A)P'] * n_nucleotides polyA = '.'.join(polyA) helm = f'RNA1{{{polyA}}}$$$$V2.0' romol = Chem.MolFromHELM(helm) #rdCoordGen.AddCoords(romol) mb = Chem.MolToMolBlock(romol) print(mb[1:300]) Now everything looks fine, but as soon as I uncomment the rdCoordGen line, the coordinates are zero. Any ideas, suggestions what I could do? Thanks, Th. Thomas Fox NCE Boehringer Ingelheim Pharma GmbH & Co. KG Birkendorfer Str. 65 | 88397 Biberach T +49 (7351) 54-7585<tel:+49%20(7351)%2054-7585> E tho...@bo...<mailto:tho...@bo...> [cid:image001.png@01DC0C43.6E1F9D20]<https://www.boehringer-ingelheim.com/de/> Save my contact [cid:image002.png@01DC0C43.6E1F9D20]<https://eu.signature365.com/vcard/Kw7HIjoOKeNUKEl8-frtUBHxNbDdPdO1Z.vcf> Pflichtangaben finden Sie unter: hier<https://www.boehringer-ingelheim.com/de/unser-unternehmen/gesellschaften-in-deutschland> Mandatory information can be found at: here<https://www.boehringer-ingelheim.com/de/unser-unternehmen/gesellschaften-in-deutschland> Datenschutzhinweis: Klicken Sie hier<https://www.boehringer-ingelheim.com/de/datenschutz>, um weitere Informationen auf der lokalen Unternehmensinternetseite des betreffenden Landes über Datenschutz bei Boehringer Ingelheim und zu Ihren Rechten zu erhalten. Privacy Notice: Click here<https://www.boehringer-ingelheim.com/de/datenschutz> for more information on the local company website of the respective country about data protection at Boehringer Ingelheim and your rights. Diese E-Mail ist vertraulich zu behandeln. Sie kann besonderem rechtlichem Schutz unterliegen. Wenn Sie nicht der richtige Adressat sind, senden Sie bitte diese E-Mail an den Absender zurück, löschen die eingegangene E-Mail und geben den Inhalt der E-Mail nicht weiter. Jegliche unbefugte Bearbeitung, Nutzung, Vervielfältigung oder Verbreitung ist verboten. / This e-mail is confidential and may also be legally privileged. If you are not the intended recipient please reply to sender, delete the e-mail and do not disclose its contents to any person. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. | 
| 
      
      
      From: Noel O'B. <bao...@gm...> - 2025-07-29 07:29:03
      
     | 
| Hi all, If you are familiar with RDKit and are finishing a PhD or postdoc, I encourage you to take a look at the call for applications for an ARISE2 postdoctoral fellowship on our blog ( https://chembl.blogspot.com/2025/07/invite-to-apply-for-arise2-postdoctoral.html). This is a chance to work in the Chemical Biology Services team at EMBL-EBI improving the resources that many in the community rely upon, such as ChEMBL and SureChEMBL. If you are interested, please get in touch. Regards, Noel | 
| 
      
      
      From: Pavel P. <pav...@uk...> - 2025-03-28 07:57:30
      
     | 
| Thank you, Wim. It works. Even a simpler solution can be to remove all 
atoms except required ones. I had to guess :)
However, this is a bug in the recent RDKit versions. The function 
MolFragmentToSmiles works correctly in version 2023, but not in 2024.
On 28/03/2025 00:10, Wim Dehaen wrote:
> Pavel,
> this is a bit hacky, but you can try the below:
> ```
> def get_frag_smi(mol,frag_atoms):
>     if len(frag_atoms) > 1:
>         b2b = [] # bonds to break
>         fsmi = "" #fragment smiles
>         # get bonds outside of fragment
>         for b in mol.GetBonds():
>             b_idx = b.GetBeginAtomIdx()
>             e_idx = b.GetEndAtomIdx()
>             if e_idx not in frag_atoms\
>             or b_idx not in frag_atoms:
>                 b2b.append(b.GetIdx())
>         # break all bonds except those in fragments
>         fmol = Chem.FragmentOnBonds(mol,b2b,addDummies=0)
>         smis = Chem.MolToSmiles(fmol).split(".")
>         # retain the only fragment with more than one atom in there
>         while fsmi == "":
>             smi = smis.pop(0)
>             m = Chem.MolFromSmiles(smi,sanitize=False)
>             if len(m.GetAtoms()) > 1:
>                 fsmi = smi
>     else: #one atom, no canonicalize needed
>         fsmi = Chem.MolFragmentToSmiles(mol, frag_atoms)
>     return fsmi
> ```
> it is based on the observation/assumption that FragmentOnBonds() and 
> then MolToSmiles() canonizes the fragments cleanly.
> > print(get_frag_smi(mol,[1,2,3,17]))
> > print(get_frag_smi(mol,[9,10,11,12]))
> prints `cN(c)O` twice.
>
> best wishes,
> wim
>
> On Thu, Mar 27, 2025 at 12:23 PM Pavel Polishchuk 
> <pav...@uk...> wrote:
>
>     Hello,
>
>       I encountered an issue with SMILES of fragments. Maybe someone
>     may suggest a workaround.
>       I attached the notebook, but will also reproduce some code here.
>
>       We have a structure with two Ns and we take an N atom and
>     adjacent atoms to make a fragment SMILES and got different
>     results, while SMILES represent the same pattern (only the order
>     of atoms is different). I guess this happens due to
>     canonicalization algorithm, which takes into account some
>     additional information missing in the output SMILES (e.g. ring
>     membership). For instance, if we break a saturated cycle (bond
>     8-9), we get identical SMILES output.
>
>     mol = Chem.MolFromSmiles('CCn1c2cccc3CCn(c23)c2ccccc12')
>
>
>     print(Chem.MolFragmentToSmiles(mol, [1,2,3,17], canonical=True))
>     print(Chem.MolFragmentToSmiles(mol, [9,10,11,12], canonical=True))
>
>     cN(C)c
>     cN(c)C
>
>       So, the question is how to workaround this issue? We already
>     have millions of such patterns. So, it will work if we will be
>     able to canonicalize them. However, standard canonicalization does
>     not work, because we have disable sanitization during SMILES
>     parsing. It returns the same output as input SMILES. Any ideas are
>     appreciated.
>
>     print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(C)c', sanitize=False)))
>     print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(c)C', sanitize=False)))
>
>     cN(C)c
>     cN(c)C
>
>       This issue actually came from the code of identification of
>     functional groups.
>
>     Kind regards,
>     Pavel
>     _______________________________________________
>     Rdkit-discuss mailing list
>     Rdk...@li...
>     https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
 | 
| 
      
      
      From: Wim D. <wim...@gm...> - 2025-03-27 23:10:53
      
     | 
| Pavel,
this is a bit hacky, but you can try the below:
```
def get_frag_smi(mol,frag_atoms):
    if len(frag_atoms) > 1:
        b2b = [] # bonds to break
        fsmi = "" #fragment smiles
        # get bonds outside of fragment
        for b in mol.GetBonds():
            b_idx = b.GetBeginAtomIdx()
            e_idx = b.GetEndAtomIdx()
            if e_idx not in frag_atoms\
            or b_idx not in frag_atoms:
                b2b.append(b.GetIdx())
        # break all bonds except those in fragments
        fmol = Chem.FragmentOnBonds(mol,b2b,addDummies=0)
        smis = Chem.MolToSmiles(fmol).split(".")
        # retain the only fragment with more than one atom in there
        while fsmi == "":
            smi = smis.pop(0)
            m = Chem.MolFromSmiles(smi,sanitize=False)
            if len(m.GetAtoms()) > 1:
                fsmi = smi
    else: #one atom, no canonicalize needed
        fsmi = Chem.MolFragmentToSmiles(mol, frag_atoms)
    return fsmi
```
it is based on the observation/assumption that FragmentOnBonds() and then
MolToSmiles() canonizes the fragments cleanly.
> print(get_frag_smi(mol,[1,2,3,17]))
> print(get_frag_smi(mol,[9,10,11,12]))
prints `cN(c)O` twice.
best wishes,
wim
On Thu, Mar 27, 2025 at 12:23 PM Pavel Polishchuk <pav...@uk...>
wrote:
> Hello,
>
>   I encountered an issue with SMILES of fragments. Maybe someone may
> suggest a workaround.
>   I attached the notebook, but will also reproduce some code here.
>
>   We have a structure with two Ns and we take an N atom and adjacent atoms
> to make a fragment SMILES and got different results, while SMILES represent
> the same pattern (only the order of atoms is different). I guess this
> happens due to canonicalization algorithm, which takes into account some
> additional information missing in the output SMILES (e.g. ring membership).
> For instance, if we break a saturated cycle (bond 8-9), we get identical
> SMILES output.
>
> mol = Chem.MolFromSmiles('CCn1c2cccc3CCn(c23)c2ccccc12')
>
>
> print(Chem.MolFragmentToSmiles(mol, [1,2,3,17], canonical=True))
> print(Chem.MolFragmentToSmiles(mol, [9,10,11,12], canonical=True))
>
> cN(C)c
> cN(c)C
>
>   So, the question is how to workaround this issue? We already have
> millions of such patterns. So, it will work if we will be able to
> canonicalize them. However, standard canonicalization does not work,
> because we have disable sanitization during SMILES parsing. It returns the
> same output as input SMILES. Any ideas are appreciated.
>
> print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(C)c', sanitize=False)))
> print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(c)C', sanitize=False)))
>
> cN(C)c
> cN(c)C
>
>   This issue actually came from the code of identification of functional
> groups.
>
> Kind regards,
> Pavel
> _______________________________________________
> Rdkit-discuss mailing list
> Rdk...@li...
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
 | 
| 
      
      
      From: Pavel P. <pav...@uk...> - 2025-03-27 11:19:34
      
     | 
| Hello,
   I encountered an issue with SMILES of fragments. Maybe someone may 
suggest a workaround.
   I attached the notebook, but will also reproduce some code here.
   We have a structure with two Ns and we take an N atom and adjacent 
atoms to make a fragment SMILES and got different results, while SMILES 
represent the same pattern (only the order of atoms is different). I 
guess this happens due to canonicalization algorithm, which takes into 
account some additional information missing in the output SMILES (e.g. 
ring membership). For instance, if we break a saturated cycle (bond 
8-9), we get identical SMILES output.
mol = Chem.MolFromSmiles('CCn1c2cccc3CCn(c23)c2ccccc12')
print(Chem.MolFragmentToSmiles(mol, [1,2,3,17], canonical=True))
print(Chem.MolFragmentToSmiles(mol, [9,10,11,12], canonical=True))
cN(C)c
cN(c)C
   So, the question is how to workaround this issue? We already have 
millions of such patterns. So, it will work if we will be able to 
canonicalize them. However, standard canonicalization does not work, 
because we have disable sanitization during SMILES parsing. It returns 
the same output as input SMILES. Any ideas are appreciated.
print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(C)c', sanitize=False)))
print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(c)C', sanitize=False)))
cN(C)c
cN(c)C
   This issue actually came from the code of identification of 
functional groups.
Kind regards,
Pavel | 
| 
      
      
      From: Andrew D. <da...@da...> - 2024-11-04 16:35:34
      
     | 
| Hi all, I've spent the last while working on some techniques to improve the performance of SMARTS-based fingerprint generators. It's called "talus" and is available at https://hg.sr.ht/~dalke/talus . It's able to improve the performance of Klekota-Roth fingerprint generation by about a factor of 12. These fingerprints have long been described as a slow to generate, eg, "PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints" (2010) at https://onlinelibrary.wiley.com/doi/full/10.1002/jcc.21707 says "The slowest algorithms are the Klekota-Roth fingerprint and Klekota-Roth fingerprint count because they are matching 4860 SMARTS patterns for each molecule.", which they timed as taking 14x the time of MACCS-key generation. The fastest version is available at https://hg.sr.ht/~dalke/talus/browse/KlekotaRoth/kr_filtered_atomtypes.py?rev=tip , which takes a SMILES file and generates the fingerprints in chemfp's FPS format, as a standalone file which depends only on RDKit. ## How does it work? This effort comes from looking at the Klekota-Roth fingerprints (defined in the supplementary data for "Chemical substructures that enrich for biological activity", doi: 10.1093/bioinformatics/btn479, https://academic.oup.com/bioinformatics/article/24/21/2518/192573 and available with a few minor syntax changes in the CDK's Java sources), which contains 4,860 SMARTS strings including [!#1][CH2][CH]([!#1])[!#1] -and- OS(=O)(=O)c1ccc(NN=C(C=O)C=O)cc1 The direct translation into a set of if statements, conceptually like _pat123 = Chem.MolFromSmarts("[!#1][CH2][CH]([!#1])[!#1]") ... if mol.HasSubstructMatch(_pat123): fp.SetBit(123) takes 417 seconds in my standard benchmark of about 30,000 SMILES string, of which only 2.6 seconds is parsing the SMILES string and the rest is the SMARTS matches. I was able to speed this up by a factor of 12 using the following techniques: 1) create a filter based on atom types counts, something like: _at1_pat = Chem.MolFromSmarts("[!#1]") _at2_pat = Chem.MolFromSmarts("[CH2]") _at3_pat = Chem.MolFromSmarts("[CH]") ... num_at1 = len(mol.GetSubstructMatches(_at1_pat)) num_at2 = len(mol.GetSubstructMatches(_at2_pat)) num_at3 = len(mol.GetSubstructMatches(_at3_pat)) ... if (num_at1 >= 3 and num_at2 >= 1 and num_at3 >= 1 and mol.HasSubstructMatch(_pat123)): fp.SetBit(123) 2) Analyze the atom SMARTS to recognize that, for example, both "[CH2]" and "[CH]" will always match "[!#1]", so the minimum counts can be increased to: if (num_at1 >= 5 and num_at2 >= 1 and num_at3 >= 1 and mol.HasSubstructMatch(_pat123)): fp.SetBit(123) 3) Identify SMARTS prefixes, which provide a natural tree structure. For example, the last two SMARTS patterns in the Klekota-Roth keys are: SCCS SCCS(=O)=O There is no reason to test for "SCCS(=O)=O" if "SCCS" does not pass, in which case there's no need to repeat the check for "S" and "C" counts, resulting in something like: if (num_S >= 2 and num_C >= 2 and mol.HasSubstructMatch(_pat_4858)): fp.SetBit(4858) if (num_O >= 2 and mol.HasSubstructMatch(_pat_4859)): fp.SetBit(4859) 4) Improve the effectiveness of SMARTS prefixes The SMARTS patterns are generated by Daylight's canonicalization rules then sorted ASCII-betically, but the SMARTS prefix method works better if the SMARTS starts with a unlikely chain terminal. For example, bit 3994 (key 3995) is "COc1cccc(C=NNC(=O)CO)c1", but "OCC(NN=Cc1cccc(OC)c1)=O" is an equivalent SMARTS with a longer initial chain. 5) Identify SMARTS prefixes which can be inserted as a filter. Here's an example of what that looks like, with a bit number followed by the SMARTS pattern, where the "*" indicates that a pattern is only used for filtering: 2949 Br 3222 BrC * BrCC filter 7 patterns 3332 BrC(C)(Br)Br 3333 BrC(C)C(=O)N * BrCCC filter 3 patterns 3430 BrCC(C)O 2973 BrCCC=O 4683 BrCCC(NC(O)=O)=O 4227 BrC(C(N)NC=O)(Br)Br 4692 BrC(C(O)NC=O)(Br)Br This says the "Br" is one of the keys, so bits 3222, 3332, 3333, 3430, etc. will not be tested unless Br exists. It further notices that "BrCC" is a common prefix to 7 patterns, so on the assumption that the overhead of one rejection test (which should be usual case) saves the time needed to do 7 additional test, it adds that extra filter. The "BrCCC" is provide a further refinement. All told, this brought Klekota-Roth fingerprint generation down to 33.5 seconds, of which 2.3 seconds (7%) was for SMILES processing so another 10x performance gain may be possible. ## These gains are not necessarily portable These impressive performance gains are possible because of how the Klekota-Roth keys were generated. For the subset of the PubChem keys which can be handled by "HasSubstructMatch" to a SMARTS pattern, the overall performance is only 2x, not 12x. ## Possible future directions A clear direction for future improvement would be to build a decision tree based on all reasonable SMARTS subgraphs, tuned by match statistics from a representative selection of molecules. Another extension would be to handle minimum counts, like how "at least 2 rings of size 6", (expressed as "*~1~*~*~*~*~*~1" or "[R]@1@[R]@[R]@[R]@[R]@[R]@1") requires at least 7 ring atoms. Anyone thinking further along these lines may be interested in "Efficient matching of multiple chemical subgraphs" at https://www.nextmovesoftware.com/talks/Sayle_MultipleSmarts_ICCS_201106.pdf . I wanted a system which could generate a Python module, rather than a C/C++/Java library, resulting in different trade-offs. ## Methods to analyze atom and bond SMARTS terms Developing this package required building a parser for the atom and bond SMARTS terms so I could tell if one atom SMARTS is a ubset of another atoms SMARTS. (I let RDKit handle the full SMARTS parsing, then use QueryAtom.GetSmarts() or QueryBond.GetSmarts() to get the actual SMARTS terms). I think it may be of broader interest for anyone working with SMARTS as a syntax level. For example, the test driver takes a SMARTS string and gives a breakdown of the different components, and where that information came from in the SMARTS term: % python smarts_parse.py '[#6a]=@[PH+]' Pattern SMARTS: [#6a]=@[PH+] atoms[0]: [#6&a] -> [c;R;!X0] ^^ ^ elements: [c] ^ in_ring: [R] connectivities: [!X0] + from SMARTS topology atoms[1]: [P&H1&+] -> [P;H1;h0,h1;+1;!X0] ^ elements: [P] ^^ total_hcount: [H1] ^^ implicit_hcount: [h0,h1] ^ charges: [+1] connectivities: [!X0] + from SMARTS topology bonds[0]: =&@ (between atoms 0 and 1) -> '=;@' ^ bondtypes: [=] ^ in_ring: [@] This is able to figure out that "[#6a]" means it must be an aromatic carbon, which means it must be in a ring. It also knows from the SMARTS topology that there must be at least one bond (hence [!X0]). Were it a bit more clever, the "R" should tell it there are at least two bonds, both ring bonds, but that's for the future to fix. It also adds some additional constraints (which I conjectured would be useful atom typing) like how "H1" means the implicit hydrogen count must be only 0 or 1. Some of this work dates back to a SMARTS regular-expression based tokenizer I contributed to Brian Kelly's FROWNS project back in 2001 or so! See https://frowns.sourceforge.net/ . If you want to take this effort further, please contact me and I'll provide some help, thoughts, and advice! Andrew da...@da... | 
| 
      
      
      From: Pavel P. <pav...@uk...> - 2024-10-25 13:34:58
      
     | 
| Dear colleagues, we are glad to invite you to the 8th Advanced In Silico Drug Design workshop which will be 27-31 January 2025 at Palacky University in Olomouc (Czech Republic). This year we cover topics on: - virtual screening - machine learning and AI - structure- and ligand-based drug design tools - pharmacophore modeling - molecular docking and dynamics - de novo design - chemical space visualization and others Lectures and tutorials will be provided by experts in the field from Austria, France, Italy, Israel and Czech Republic. In particular, Prof. Thierry Langer, Prof. Alexandre Varnek, Prof. Johannes Kirchmair, Prof. Hanoch Senderowitz, Prof. Alexander Domling. There is no fee. The web-site of the workshop https://www.kfc.upol.cz/8add. Kind regards, Pavel | 
| 
      
      
      From: <ml...@li...> - 2024-09-27 07:10:03
      
     | 
| Hello,
Recently, I updated some Python code using
from rdkit.Chem.MolStandardize import Standardizer # rdkit <= 
'2023.09.3'
to
from rdkit.Chem.MolStandardize import rdMolStandardize # rdkit >= 
'2024.03.5'
Because after an rdkit fresh install, rdkit was updated and my former 
code
stopped working.
My old code was this:
---
standardizer = Standardizer()
def standardize(preserve_stereo, preserve_taut, mol):
     if preserve_stereo or preserve_taut:
         s_mol = standardizer.standardize(mol)
         # We don't need to get fragment parent, because the charge 
parent is the largest fragment
         s_mol = standardizer.charge_parent(s_mol, skip_standardize=True)
         s_mol = standardizer.isotope_parent(s_mol, 
skip_standardize=True)
         if not preserve_stereo:
             s_mol = standardizer.stereo_parent(s_mol, 
skip_standardize=True)
         if not preserve_taut:
             s_mol = standardizer.tautomer_parent(s_mol, 
skip_standardize=True)
         return standardizer.standardize(s_mol)
     else:
         # standardizer.super_parent(mol): _NOT_ 
standardizer.standardize(mol)
         # which doesn't even unsalt the molecule...
         return standardizer.super_parent(mol)
---
And the new code is:
---
def standardize(preserve_stereo, preserve_taut, mol):
     if preserve_stereo or preserve_taut:
         # We don't need to get fragment parent, because the charge 
parent is the largest fragment
         s_mol = rdMolStandardize.ChargeParent(mol, 
skipStandardize=False)
         s_mol = rdMolStandardize.IsotopeParent(s_mol, 
skipStandardize=True)
         if not preserve_stereo:
             s_mol = rdMolStandardize.StereoParent(s_mol, 
skipStandardize=False)
         if not preserve_taut:
             s_mol = rdMolStandardize.TautomerParent(s_mol, 
skipStandardize=False)
         return s_mol
     else:
         return rdMolStandardize.SuperParent(mol, skipStandardize=False)
---
Which I hope is isofunctional.
The old Standardizer module had a "standardize" method.
Is this method also present in rdMolStandardize?
Has it changed name (e.g. to rdMolStandardize.Cleanup)?
Regards,
Francois.
 | 
| 
      
      
      From: <dd...@wp...> - 2024-09-18 17:34:37
      
     | 
| Hi all, The following survey aims to gather empirical data to better understand the expectations of data format users concerning comparing them. It should take no more than 10 minutes: forms.gle https://forms.gle/K9AR6gbyjCNCk4FL6 Your response would be greatly appreciated! Best, Dominik | 
| 
      
      
      From: Manish S. <ms...@sa...> - 2024-09-12 18:14:45
      
     | 
| Hi Kurt, You might find the following scripts helpful for enumerating compounds and align them to a reference molecule: o RDKitEnumerateCompoundLibrary.py <http://www.mayachemtools.org/docs/scripts/html/RDKitEnumerateCompoundLibrar y.html> o RDKitPerformPositionalAnalogueScan.py <http://www.mayachemtools.org/docs/scripts/html/RDKitPerformPositionalAnalog ueScan.html> o RDKitGenerateConstrainedConformers.py <http://www.mayachemtools.org/docs/scripts/html/RDKitGenerateConstrainedConf ormers.html> o RDKitPerformConstrainedMinimization.py <http://www.mayachemtools.org/docs/scripts/html/RDKitPerformConstrainedMinim ization.html> Let me know of any further questions. Thanks, Manish From: Kurt Thorn <kur...@ar...> Sent: Thursday, September 12, 2024 8:50 AM To: rdk...@li... Subject: [Rdkit-discuss] Impose conformation of molecule substructure? Hi All - I would like to enumerate a virtual library of a compound family we have a crystal structure of, where I want to model structures of multiple substituents at a single site. What I would like to do is enforce that the constant part of the molecule assume the conformation in the crystal structure and enumerate just conformers for the new substituent added. Does anyone have a suggestion for how to achieve this in rdkit? Thanks, Kurt Dr Kurt Thorn Chief Technology Officer +1.609.423.1571 (US Office) +1.415.298.3495 (US Mobile) kur...@ar... <mailto:kur...@ar...> www.arrepath.com <http://www.arrepath.com/> ArrePath Inc. 303A College Road East Princeton, NJ 08540 U.S. | 
| 
      
      
      From: Kurt T. <kur...@ar...> - 2024-09-12 17:26:29
      
     | 
| Thanks Stephen! That code pointed me to the key "coordMap" parameter for fixing atom coordinates. Kurt ________________________________ From: Stephen Roughley <s.d...@go...> Sent: Thursday, September 12, 2024 9:40 AM To: Kurt Thorn <kur...@ar...> Cc: rdk...@li... <rdk...@li...> Subject: Re: [Rdkit-discuss] Impose conformation of molecule substructure? Hi Kurt, The Vernalis KNIME community contribution has a node "Templated Conformer Generator (RDKit)" (see https://hub.knime.com/n/wK3RJiystQYq5M9w ) which will do exactly this. If you don't want to do it in KNIME, then you can see the relevant bits of the Java source at: https://github.com/vernalis/vernalis-knime-nodes/blob/d125b97ad2841133622150c168472168547c4ff3/com.vernalis.knime.chem.pmi/src/com/vernalis/knime/chem/pmi/nodes/confs/rdkitgenerate/RdkitConfgenNodeModel.java#L441-L465 and in particular at: https://github.com/vernalis/vernalis-knime-nodes/blob/d125b97ad2841133622150c168472168547c4ff3/com.vernalis.knime.chem.pmi/src/com/vernalis/knime/chem/pmi/nodes/confs/rdkitgenerate/RdkitConfgenNodeModel.java#L688-L809 Steve On Thu, 12 Sept 2024 at 17:13, Kurt Thorn <kur...@ar...<mailto:kur...@ar...>> wrote: Hi All - I would like to enumerate a virtual library of a compound family we have a crystal structure of, where I want to model structures of multiple substituents at a single site. What I would like to do is enforce that the constant part of the molecule assume the conformation in the crystal structure and enumerate just conformers for the new substituent added. Does anyone have a suggestion for how to achieve this in rdkit? Thanks, Kurt [Logo Description automatically generated] Dr Kurt Thorn Chief Technology Officer +1.609.423.1571 (US Office) +1.415.298.3495 (US Mobile) kur...@ar...<mailto:kur...@ar...> www.arrepath.com<http://www.arrepath.com/> ArrePath Inc. 303A College Road East Princeton, NJ 08540 U.S. _______________________________________________ Rdkit-discuss mailing list Rdk...@li...<mailto:Rdk...@li...> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss | 
| 
      
      
      From: Stephen R. <s.d...@go...> - 2024-09-12 16:40:53
      
     | 
| Hi Kurt, The Vernalis KNIME community contribution has a node "Templated Conformer Generator (RDKit)" (see https://hub.knime.com/n/wK3RJiystQYq5M9w ) which will do exactly this. If you don't want to do it in KNIME, then you can see the relevant bits of the Java source at: https://github.com/vernalis/vernalis-knime-nodes/blob/d125b97ad2841133622150c168472168547c4ff3/com.vernalis.knime.chem.pmi/src/com/vernalis/knime/chem/pmi/nodes/confs/rdkitgenerate/RdkitConfgenNodeModel.java#L441-L465 and in particular at: https://github.com/vernalis/vernalis-knime-nodes/blob/d125b97ad2841133622150c168472168547c4ff3/com.vernalis.knime.chem.pmi/src/com/vernalis/knime/chem/pmi/nodes/confs/rdkitgenerate/RdkitConfgenNodeModel.java#L688-L809 Steve On Thu, 12 Sept 2024 at 17:13, Kurt Thorn <kur...@ar...> wrote: > Hi All - > > I would like to enumerate a virtual library of a compound family we have a > crystal structure of, where I want to model structures of multiple > substituents at a single site. What I would like to do is enforce that the > constant part of the molecule assume the conformation in the crystal > structure and enumerate just conformers for the new substituent added. Does > anyone have a suggestion for how to achieve this in rdkit? > > Thanks, > Kurt > > > > *[image: Logo Description automatically generated]* > > *Dr Kurt Thorn* > > *Chief Technology Officer* > > +1.609.423.1571 (US Office) > > +1.415.298.3495 (US Mobile) > > kur...@ar... > > www.arrepath.com > > > > *ArrePath Inc.* > > 303A College Road East > > Princeton, NJ 08540 > > U.S. > > > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > | 
| 
      
      
      From: Kurt T. <kur...@ar...> - 2024-09-12 16:10:33
      
     | 
| Hi All - I would like to enumerate a virtual library of a compound family we have a crystal structure of, where I want to model structures of multiple substituents at a single site. What I would like to do is enforce that the constant part of the molecule assume the conformation in the crystal structure and enumerate just conformers for the new substituent added. Does anyone have a suggestion for how to achieve this in rdkit? Thanks, Kurt [Logo Description automatically generated] Dr Kurt Thorn Chief Technology Officer +1.609.423.1571 (US Office) +1.415.298.3495 (US Mobile) kur...@ar...<mailto:kur...@ar...> www.arrepath.com<http://www.arrepath.com/> ArrePath Inc. 303A College Road East Princeton, NJ 08540 U.S. | 
| 
      
      
      From: Andrew D. <da...@da...> - 2024-09-11 12:47:44
      
     | 
| Hi Srdjan, > On Sep 11, 2024, at 11:12, Srdjan Pusara <srd...@ho...> wrote: > I would like to ask is it possible to find source code how these [UFF] interaction terms were implemented? The RDKit source code is available at https://github.com/rdkit/rdkit/tree/master . Use the green button labeled "<> Code" to get the source code either through the git version control tool, or as a zip file. If you want to use the web interface, see https://github.com/rdkit/rdkit/tree/master/Code/ForceField/UFF Best regards, Andrew da...@da... | 
| 
      
      
      From: Srdjan P. <srd...@ho...> - 2024-09-11 09:12:54
      
     | 
| Hello, I have seen that Rdkit can return force field parameters between group of atoms (bond_params = rdForceFieldHelpers.GetUFFBondStretchParams(mol, 6, 1),angle_params = rdForceFieldHelpers.GetUFFAngleBendParams(mol, 0, 1, 2) etc). I would like to ask is it possible to find source code how these interaction terms were implemented? I understand that these equations can be implemented by reading original paper, but it would be helpful to access the source code od RDkit where these interaction terms are already implemented. In addiion, I have noticed that original UFF paper has some small errors or typos, so having already implemented source code would help. Thanx for help in advance. | 
| 
      
      
      From: Joe B. <Joe...@Sc...> - 2024-08-28 13:55:51
      
     | 
| Hi all, We are recruiting for a full-time developer with python, RDkit and chemistry experience to work on our Compliance Hub applications. These are used by many of the world's top pharmaceutical companies, CROs and specialist chemical suppliers to ensure compliance with complex and chemical regulations globally. For more information and to apply please see https://blog.scitegrity.com/news/blog-post-1-0-5-1-0-1 It's a remote role, although you do need to be UK based. Best regards Joe Bradley CEO, Scitegrity Limited This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. Scitegrity accepts no liability for any damage caused by any virus transmitted by this email. Scitegrity accept no liability for any advice relating to controlled substances given in this e-mail. | 
| 
      
      
      From: Andrew D. <da...@da...> - 2024-08-27 13:57:15
      
     | 
| On Aug 27, 2024, at 14:44, Ingvar Lagerstedt <in...@ne...> wrote:
> To me it would make sense if RDKit removed the aromatic flags for any atom that is no longer in a ring when deleting an aromatic atom/bond.
> Alternatively remove the aromatic flag on any non-ring atom when attempting to kekulize the structure rather than throwing an exception.
As Noel commented, the toolkit can't make that assumption. Different people will have different reasons for removing an atom. Some will want to remove multiple atoms, for example.
Here is one function to remove an atom. It converts the molecule to Kekulé form, updates the hydrogen counts on the neighboring atoms, and then removes the specified atom:
from rdkit import Chem
m = Chem.MolFromSmiles('c1ccccc1')
mw = Chem.RWMol(m)
Chem.Kekulize(mw, clearAromaticFlags=True)
atom_idx = 0
atom = mw.GetAtomWithIdx(atom_idx)
for bond in atom.GetBonds():
    int_bondtype = int(bond.GetBondType())
    assert int_bondtype in (1, 2, 3), "unexpected bond type!"
    other_atom = bond.GetOtherAtom(atom)
    other_atom.SetNumExplicitHs(
        other_atom.GetNumExplicitHs() + int_bondtype)
mw.RemoveAtom(atom_idx)
Chem.SanitizeMol(mw)
print(Chem.MolToSmiles(mw))
Bear in mind that there can be multiple possible Kekulé assignments, while Chem.Kekulize only picks one. In some cases (not this one, of course) you may need to apply the above removal method to all distinct assignments (for the given ring system) in order to get all the valid transformed molecules.
A more correct function should also also take care when the int_bondtype == 1 and the other_atom has a chiral tag, and does not already have a hydrogen, because you may want to preserve the chiral indicator. Something like this may work (it's copied and pasted from another function, with a few tweaks to match the above naming scheme, but I haven't tested it).
            num_hs = other_atom.GetTotalNumHs()
            if (not num_hs) and other_atom.GetChiralTag():
                for bond_i, b in enumerate(other_atom.GetBonds()):
                    if b.GetIdx() == bond_idx:
                        break
                else:
                    raise AssertionError("Could not find bond")
                want_invert = (bond_i % 2 == 0)
                if want_invert:
                    if other_atom.GetChiralTag() == 2:
                        other_atom.SetChiralTag(Chem.ChiralType.CHI_TETRAHEDRAL_CW)
                    else:
                        other_atom.SetChiralTag(Chem.ChiralType.CHI_TETRAHEDRAL_CCW)
Cheers,
				Andrew
				da...@da...
 | 
| 
      
      
      From: Noel O'B. <bao...@gm...> - 2024-08-27 13:27:05
      
     | 
| There are other more subtle changes that can affect the aromaticity, e.g.
changing a bond order, the charge, or the atomic number of an atom. IMO,
the user needs to take responsibility for knowing if aromaticity might be
invalidated, and perform the appropriate actions. The alternative is for
the toolkit to take the responsibility, trigger a check on every edit and
take a performance hit in the general case. Indeed, atom deletion could be
treated specially, but slippery slope and confusion here we come! :-)
Regards,
Noel
On Tue, 27 Aug 2024 at 13:47, Ingvar Lagerstedt <in...@ne...>
wrote:
> Hello,
>
> When deleting an aromatic atom or bond, the ring information is removed,
> while any remaining atom in the broken aromatic ring is still labelled
> aromatic.  When attempting to sanitize such a molecule I get an exception: "rdkit.Chem.rdchem.AtomKekulizeException:
> non-ring atom 0 marked aromatic"
>
> To recreate:
>
> >>> from rdkit import Chem
>
> >>> m = Chem.MolFromSmiles('c1ccccc1')
>
> >>> mw = Chem.RWMol(m)
>
> >>> mw.RemoveAtom(0)
>
> >>> Chem.SanitizeMol(mw)
>
> [10:40:50] non-ring atom 0 marked aromatic
>
> Traceback (most recent call last):
>
>   File "<stdin>", line 1, in <module>
>
> rdkit.Chem.rdchem.AtomKekulizeException: non-ring atom 0 marked aromatic
>
>
> The example is simplistic, but there are reactions where an aromatic
> system can be broken, such as Zincke-Koenig reaction or Djerassi-Rylander
> oxidation. The exception makes it harder to describe such reactions.
>
>
> I currently check if an atom/bond is aromatic before deleting it, and if
> so remove all aromatic flags in the molecule.
>
>
> To me it would make sense if RDKit removed the aromatic flags for any atom
> that is no longer in a ring when deleting an aromatic atom/bond.
>
> Alternatively remove the aromatic flag on any non-ring atom when
> attempting to kekulize the structure rather than throwing an exception.
>
>
> Compare with a Birch reduction where the ring stays intact, here the
> kekulization/the following aromatize step rightly fails to find an aromatic
> ring, no exception is thrown, and the atoms are marked as non-aromatic.
>
>
> Kind Regards,
>
> Ingvar
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdk...@li...
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
 | 
| 
      
      
      From: Ingvar L. <in...@ne...> - 2024-08-27 12:44:59
      
     | 
| Hello,
When deleting an aromatic atom or bond, the ring information is removed, while any remaining atom in the broken aromatic ring is still labelled aromatic.  When attempting to sanitize such a molecule I get an exception: "rdkit.Chem.rdchem.AtomKekulizeException: non-ring atom 0 marked aromatic"
To recreate:
>>> from rdkit import Chem
>>> m = Chem.MolFromSmiles('c1ccccc1')
>>> mw = Chem.RWMol(m)
>>> mw.RemoveAtom(0)
>>> Chem.SanitizeMol(mw)
[10:40:50] non-ring atom 0 marked aromatic
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
rdkit.Chem.rdchem.AtomKekulizeException: non-ring atom 0 marked aromatic
The example is simplistic, but there are reactions where an aromatic system can be broken, such as Zincke-Koenig reaction or Djerassi-Rylander oxidation. The exception makes it harder to describe such reactions.
I currently check if an atom/bond is aromatic before deleting it, and if so remove all aromatic flags in the molecule.
To me it would make sense if RDKit removed the aromatic flags for any atom that is no longer in a ring when deleting an aromatic atom/bond.
Alternatively remove the aromatic flag on any non-ring atom when attempting to kekulize the structure rather than throwing an exception.
Compare with a Birch reduction where the ring stays intact, here the kekulization/the following aromatize step rightly fails to find an aromatic ring, no exception is thrown, and the atoms are marked as non-aromatic.
Kind Regards,
Ingvar
 | 
| 
      
      
      From: Diogo M. <dio...@gm...> - 2024-08-22 19:16:02
      
     | 
| Hello, We are recruiting a programmer, primarily Python, to improve autodock (molecular docking) and integrate it with other software, such as RDKit and OpenMM. The location is Scripps Research in La Jolla, California. Goals are: - to support development in general, - improve user-friendliness of command line and graphical interfaces, - make autodock components more usable from Python For more details and to apply, see: https://recruiting2.ultipro.com/SCR1003TSRI/JobBoard/98759e7d-7ede-4c0b-ac7b-2c6293c7b522/OpportunityDetail?opportunityId=b92548d1-155c-4c8e-be0e-f59c5b2452e0 Best regards, Diogo | 
| 
      
      
      From: Andrew D. <da...@da...> - 2024-08-05 08:35:38
      
     | 
| Hi RDKit-ers, I have released chemfp 4.2. The new "simarray" functionality computes the full comparison matrix as a NumPy array, eg, for use in some clustering algorithms. It has built-in support for Tanimoto, Dice, cosine, and Hamming comparisons, plus an option to get the individual "a", "b", "c", and "d" components should you need a specialized metric. It processes 100M comparisons per second on my laptop, which means if you had 30 TB of free disk space you could generate the NxN comparisons for ChEMBL in about a day. (I'm curious if someone will do this!) I've also updated chemfp's RDKit-Fingerprint, RDKit-Morgan, RDKit-AtomPair, and RDKit-Torsion fingerprint types to use RDKit's fingerprint generator API, instead of the older function-based API. This includes support for count emulation. Some of the parameter names have changed to follow RDKit's newer convention, and the RDKit-Morgan fingerprints now default to r=3 (to match the RDKit default) rather than r=2. Chemfp still supports the older function-based API, which is used if you specify the older version number explicitly. For a full description of what's new in this release, see https://chemfp.com/docs/whats_new_in_42.html . Chemfp may be the package you’ve been looking for, if you work with binary cheminformatics fingerprints in Python. Chemfp is perhaps best known for its high-performance fingerprint similarity search. Its Taylor/Butina clustering, MaxMin diversity selection, and sphere exclusion, (including directed sphere exclusion) are equally world-class. Or, if you simply need a 100K by 100K distance array to pass into scikit-learn, chemfp’s simarray can generate that in less than a minute. The chemfp homepage is https://chemfp.com/ . To install a pre-compiled chemfp for Linux-based OSes: python -m pip install chemfp -i https://chemfp.com/packages/ The default installation limits or disables a few chemfp features as described in the base license agreement at https://chemfp.com/BaseLicense.txt . To request a license key, which is free for academic use, see https://chemfp.com/license/ . Best regards, Andrew Dalke da...@da... |