rdkit-discuss Mailing List for RDKit (Page 2)
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
(27) |
Oct
(4) |
Nov
(20) |
Dec
(4) |
2008 |
Jan
(12) |
Feb
(2) |
Mar
(23) |
Apr
(40) |
May
(30) |
Jun
(6) |
Jul
(35) |
Aug
(60) |
Sep
(31) |
Oct
(33) |
Nov
(35) |
Dec
(3) |
2009 |
Jan
(16) |
Feb
(77) |
Mar
(88) |
Apr
(57) |
May
(33) |
Jun
(27) |
Jul
(55) |
Aug
(26) |
Sep
(12) |
Oct
(45) |
Nov
(42) |
Dec
(23) |
2010 |
Jan
(64) |
Feb
(17) |
Mar
(30) |
Apr
(55) |
May
(30) |
Jun
(65) |
Jul
(112) |
Aug
(26) |
Sep
(67) |
Oct
(20) |
Nov
(67) |
Dec
(23) |
2011 |
Jan
(57) |
Feb
(43) |
Mar
(50) |
Apr
(66) |
May
(95) |
Jun
(73) |
Jul
(64) |
Aug
(47) |
Sep
(22) |
Oct
(56) |
Nov
(51) |
Dec
(34) |
2012 |
Jan
(64) |
Feb
(45) |
Mar
(65) |
Apr
(85) |
May
(76) |
Jun
(47) |
Jul
(75) |
Aug
(72) |
Sep
(31) |
Oct
(77) |
Nov
(61) |
Dec
(41) |
2013 |
Jan
(68) |
Feb
(63) |
Mar
(36) |
Apr
(73) |
May
(61) |
Jun
(69) |
Jul
(98) |
Aug
(60) |
Sep
(74) |
Oct
(102) |
Nov
(92) |
Dec
(63) |
2014 |
Jan
(112) |
Feb
(84) |
Mar
(72) |
Apr
(59) |
May
(96) |
Jun
(54) |
Jul
(91) |
Aug
(54) |
Sep
(38) |
Oct
(47) |
Nov
(33) |
Dec
(39) |
2015 |
Jan
(41) |
Feb
(115) |
Mar
(66) |
Apr
(87) |
May
(63) |
Jun
(53) |
Jul
(61) |
Aug
(59) |
Sep
(115) |
Oct
(42) |
Nov
(60) |
Dec
(20) |
2016 |
Jan
(52) |
Feb
(72) |
Mar
(100) |
Apr
(125) |
May
(61) |
Jun
(106) |
Jul
(62) |
Aug
(74) |
Sep
(151) |
Oct
(151) |
Nov
(117) |
Dec
(148) |
2017 |
Jan
(106) |
Feb
(75) |
Mar
(106) |
Apr
(67) |
May
(85) |
Jun
(144) |
Jul
(53) |
Aug
(73) |
Sep
(188) |
Oct
(106) |
Nov
(118) |
Dec
(74) |
2018 |
Jan
(96) |
Feb
(43) |
Mar
(40) |
Apr
(111) |
May
(77) |
Jun
(112) |
Jul
(64) |
Aug
(85) |
Sep
(73) |
Oct
(117) |
Nov
(97) |
Dec
(47) |
2019 |
Jan
(63) |
Feb
(112) |
Mar
(109) |
Apr
(61) |
May
(51) |
Jun
(41) |
Jul
(57) |
Aug
(68) |
Sep
(47) |
Oct
(126) |
Nov
(117) |
Dec
(96) |
2020 |
Jan
(84) |
Feb
(82) |
Mar
(80) |
Apr
(100) |
May
(78) |
Jun
(68) |
Jul
(76) |
Aug
(69) |
Sep
(76) |
Oct
(73) |
Nov
(69) |
Dec
(42) |
2021 |
Jan
(44) |
Feb
(30) |
Mar
(85) |
Apr
(65) |
May
(41) |
Jun
(72) |
Jul
(55) |
Aug
(9) |
Sep
(44) |
Oct
(44) |
Nov
(30) |
Dec
(40) |
2022 |
Jan
(35) |
Feb
(29) |
Mar
(55) |
Apr
(30) |
May
(31) |
Jun
(27) |
Jul
(49) |
Aug
(15) |
Sep
(17) |
Oct
(25) |
Nov
(15) |
Dec
(40) |
2023 |
Jan
(32) |
Feb
(10) |
Mar
(10) |
Apr
(21) |
May
(33) |
Jun
(31) |
Jul
(12) |
Aug
(17) |
Sep
(14) |
Oct
(12) |
Nov
(8) |
Dec
(12) |
2024 |
Jan
(10) |
Feb
(18) |
Mar
(7) |
Apr
(4) |
May
(6) |
Jun
(4) |
Jul
(5) |
Aug
(6) |
Sep
(8) |
Oct
(1) |
Nov
(1) |
Dec
|
2025 |
Jan
|
Feb
|
Mar
(3) |
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(2) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Greg L. <gre...@gm...> - 2024-07-26 15:07:08
|
Hi Joao, On Thu, Jul 25, 2024 at 8:03 PM J Sousa <jso...@gm...> wrote: > > In fingerprints calculated by RDKit with includeChirality=True, is the CIP > label (R/S) the atom property directly used to generate the integer > identifier of an atom circular neighborhood? > > Is the CIP label used for Morgan, RTKit, Atom pairs and Torsions > fingerprints? > Yes, that is currently how the fingerprinting code handles chirality. This is not a great way to do it, but it's what we have right now.[1] -greg [1] I haven't put a lot of time into this, but I haven't come up with anything better yet. |
From: J S. <jso...@gm...> - 2024-07-25 18:00:50
|
Hi, In fingerprints calculated by RDKit with includeChirality=True, is the CIP label (R/S) the atom property directly used to generate the integer identifier of an atom circular neighborhood? Is the CIP label used for Morgan, RTKit, Atom pairs and Torsions fingerprints? Thanks, Joao Sousa |
From: Ernst-Georg S. <pg...@tu...> - 2024-07-02 15:56:50
|
Am 27.06.2024 um 11:03 schrieb Wim Dehaen: > I would expect the problem here is kekulization. The SMARTS is pattern > matching using the kekule structure (i.e. double and single bonds, non > aromatic atoms) and is not sanitized whereas the SMILES after parsing > and sanitization has aromatic bonds and aromatic atoms. Try what happens > when you do a SMARTS match with the SMILES with aromatic atoms: > `[2H]c1cc([3H])cc(C2=N[C@](C)([37Cl])CC2)c1` That was it indeed. Thank you, Ernst-Georg |
From: Noel O'B. <bao...@gm...> - 2024-06-27 09:28:07
|
"Every valid SMILES is also a valid SMARTS": I think this is one of John May's lines, which I was never keen on as it makes people think that if you treat a SMILES as a SMARTS that it will match the original SMILES. It mostly will, but I think you have found the difference between the SMILES and SMARTS treatment of "[2H]" - one means deuterium, the other means an isotope of mass 2 with a single implicit hydrogen attached. It doesn't match because the deuterium doesn't have another hydrogen attached. [I think??] Regards, Noel On Thu, 27 Jun 2024 at 10:05, Wim Dehaen <wim...@gm...> wrote: > I would expect the problem here is kekulization. The SMARTS is pattern > matching using the kekule structure (i.e. double and single bonds, non > aromatic atoms) and is not sanitized whereas the SMILES after parsing and > sanitization has aromatic bonds and aromatic atoms. Try what happens when > you do a SMARTS match with the SMILES with aromatic atoms: > `[2H]c1cc([3H])cc(C2=N[C@](C)([37Cl])CC2)c1` > > best wishes > wim > > On Thu, Jun 27, 2024 at 10:56 AM pgchem pgchem <pg...@tu...> > wrote: > >> Hello all, >> >> if every valid SMILES is also a valid SMARTS, why does: >> >> select substruct('[2H]C1=CC([3H])=CC(=C1)C1=N[C@](C)([37Cl])CC1'::mol, >> '[2H]C1=CC([3H])=CC(=C1)C1=N[C@](C)([37Cl])CC1'::mol) >> >> yield "True", but: >> >> select substruct('[2H]C1=CC([3H])=CC(=C1)C1=N[C@](C)([37Cl])CC1'::mol, >> '[2H]C1=CC([3H])=CC(=C1)C1=N[C@](C)([37Cl])CC1'::qmol) >> >> is "False"? The same is observed when using the @> operator. >> >> RDKit 2024.03.3 built from source + PostgreSQL 16.3. >> >> best regards >> >> Ernst-Georg >> >> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdk...@li... >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > |
From: Wim D. <wim...@gm...> - 2024-06-27 09:04:05
|
I would expect the problem here is kekulization. The SMARTS is pattern matching using the kekule structure (i.e. double and single bonds, non aromatic atoms) and is not sanitized whereas the SMILES after parsing and sanitization has aromatic bonds and aromatic atoms. Try what happens when you do a SMARTS match with the SMILES with aromatic atoms: `[2H]c1cc([3H])cc(C2=N[C@](C)([37Cl])CC2)c1` best wishes wim On Thu, Jun 27, 2024 at 10:56 AM pgchem pgchem <pg...@tu...> wrote: > Hello all, > > if every valid SMILES is also a valid SMARTS, why does: > > select substruct('[2H]C1=CC([3H])=CC(=C1)C1=N[C@](C)([37Cl])CC1'::mol, > '[2H]C1=CC([3H])=CC(=C1)C1=N[C@](C)([37Cl])CC1'::mol) > > yield "True", but: > > select substruct('[2H]C1=CC([3H])=CC(=C1)C1=N[C@](C)([37Cl])CC1'::mol, > '[2H]C1=CC([3H])=CC(=C1)C1=N[C@](C)([37Cl])CC1'::qmol) > > is "False"? The same is observed when using the @> operator. > > RDKit 2024.03.3 built from source + PostgreSQL 16.3. > > best regards > > Ernst-Georg > > > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > |
From: pgchem p. <pg...@tu...> - 2024-06-27 08:53:29
|
Hello all, if every valid SMILES is also a valid SMARTS, why does: select substruct('[2H]C1=CC([3H])=CC(=C1)C1=N[C@](C)([37Cl])CC1'::mol, '[2H]C1=CC([3H])=CC(=C1)C1=N[C@](C)([37Cl])CC1'::mol) yield "True", but: select substruct('[2H]C1=CC([3H])=CC(=C1)C1=N[C@](C)([37Cl])CC1'::mol, '[2H]C1=CC([3H])=CC(=C1)C1=N[C@](C)([37Cl])CC1'::qmol) is "False"? The same is observed when using the @> operator. RDKit 2024.03.3 built from source + PostgreSQL 16.3. best regards Ernst-Georg |
From: Eloy F. <elo...@gm...> - 2024-06-24 08:39:37
|
Dear RDKitters, We are recruiting for a Chemical Biology Team Leader. This is an exciting opportunity to lead the Chemical Biology resources, based at the Wellcome Genome Campus in Hinxton near Cambridge, UK. The chemical biology team <https://www.ebi.ac.uk/about/teams/chemical-biology-services/> at EMBL-EBI delivers world-leading databases and resources to the scientific community. Our flagship resource, ChEMBL <https://www.ebi.ac.uk/chembl/>, is a database of high-quality quantitative small-molecule bioactivity data curated from the scientific literature and direct data depositions. SureChEMBL <https://www.surechembl.org/> is a complementary patent resource containing chemical structures and biology/drug discovery annotations extracted daily from patents. UniChem <https://www.ebi.ac.uk/unichem/> links chemical structures across databases. ChEBI <https://www.ebi.ac.uk/chebi/> is a database and ontology of small molecules relevant to biology. Closing Date: 19th July 2024 More details here <https://www.embl.org/jobs/position/EBI02255> Kind regards, Eloy |
From: James W. <jea...@gm...> - 2024-05-14 15:08:58
|
This resolved itself after a refresh, so whether I had a bad download of the file I'm not sure. Anyway, I have the image now, so all's well On Tue, 14 May 2024 at 15:23, Greg Landrum <gre...@gm...> wrote: > Hi James, > > If that's pulling the inchi zip from rdkit.org then the MD5 shouldn't > have changed. > > The easiest thing is to just replace the MD5 in > $RDBASE/Code/cmake/Modules/FindInchi.cmake with what you're getting (after > making sure it is in fact the correct zip file of course). > > -greg > > > On Tue, May 14, 2024 at 1:04 PM James Wallace <jea...@gm...> > wrote: > >> I'm trying to compile RDKit 2023_03_3 into a Docker container, but the >> CMake MD5 check fails for the Inchi library. Is there a way of disabling >> this check, because my presumption is the library has changed going >> forward, but for compatibility reasons, I want to keep this version as >> close to stock as possible. I enclose the specific error below: >> >> #12 1177.0 The md5 checksum for /rdkit/External/INCHI-API/INCHI-1-SRC.zip >> is incorrect; expected: f2efa0c58cef32915686c04d7055b4e9, found: >> 4579f086463c76353a75ecc6193becb9 >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdk...@li... >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > |
From: Greg L. <gre...@gm...> - 2024-05-14 14:23:28
|
Hi James, If that's pulling the inchi zip from rdkit.org then the MD5 shouldn't have changed. The easiest thing is to just replace the MD5 in $RDBASE/Code/cmake/Modules/FindInchi.cmake with what you're getting (after making sure it is in fact the correct zip file of course). -greg On Tue, May 14, 2024 at 1:04 PM James Wallace <jea...@gm...> wrote: > I'm trying to compile RDKit 2023_03_3 into a Docker container, but the > CMake MD5 check fails for the Inchi library. Is there a way of disabling > this check, because my presumption is the library has changed going > forward, but for compatibility reasons, I want to keep this version as > close to stock as possible. I enclose the specific error below: > > #12 1177.0 The md5 checksum for /rdkit/External/INCHI-API/INCHI-1-SRC.zip > is incorrect; expected: f2efa0c58cef32915686c04d7055b4e9, found: > 4579f086463c76353a75ecc6193becb9 > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > |
From: James W. <jea...@gm...> - 2024-05-14 11:01:51
|
I'm trying to compile RDKit 2023_03_3 into a Docker container, but the CMake MD5 check fails for the Inchi library. Is there a way of disabling this check, because my presumption is the library has changed going forward, but for compatibility reasons, I want to keep this version as close to stock as possible. I enclose the specific error below: #12 1177.0 The md5 checksum for /rdkit/External/INCHI-API/INCHI-1-SRC.zip is incorrect; expected: f2efa0c58cef32915686c04d7055b4e9, found: 4579f086463c76353a75ecc6193becb9 |
From: He, A. <he...@bu...> - 2024-05-13 21:25:50
|
Hi Pavel, Do you work with small rings (5, 6, 7) or large cyclic structures (like cyclic peptides)? To distinguish different conformations of small rings, I feel that the torsional angles or apex heights – such geometric values that are alignment-free and depend on the internal coordinates of the molecules - might be more useful than RMSD. You can run conformation generation then put conformers into categories, if you don’t have too many rings and the rings aren’t that big. To get started with a simple example in RDKit, I previously found this tutorial very helpful: https://sunhwan.github.io/blog/2021/02/24/RDKit-ETKDG-Piperazine.html Docking in AutoDock Vina (https://autodock-vina.readthedocs.io/en/latest/) or AutoDock-GPU (https://github.com/ccsb-scripps/AutoDock-GPU) supports sampling of ring conformations on the run. By default, attempts will be made during docking to sample alternate conformers of 7-membered and larger rings. Optionally, you could also turn on the sampling for 6-membered rings and smaller ones. Take a peek at this recent paper to learn about the method: https://www.cambridge.org/core/journals/qrb-discovery/article/performance-evaluation-of-flexible-macrocycle-docking-in-autodock/D8417BC284AEE198EC6AF25C7E677249 The Meeko project (https://github.com/forlilab/Meeko?tab=readme-ov-file#python-tutorial) provides a seamless workflow in Python to export your RDKit molecules into AutoDock-ready formats (and the docking outcomes can be retrieved back to RDKit, too!). The multiple docking outcomes with AutoDock Vina can give you at least some idea of what conformations might fit. You could refine the poses with more advanced methods. Hope this helps! Best regards, Amy H. From: Pavel Polishchuk <pav...@uk...> Date: Monday, May 13, 2024 at 4:43 AM To: rdk...@li... <rdk...@li...> Subject: [Rdkit-discuss] sampling of ring conformation for docking Hello, I use RDKit to embed initial conformations for docking. The issue is with saturated rings. I can use a single random conformer but its geometry may be unsuitable and the whole molecule will fail to dock. I can use several starting conformers for docking and to avoid docking of very similar conformers I can select a few diverse conformers based on RMSD between rings only. However, the issue occurs if a molecule has several such saturated rings. The current workaround is to compute RMSD between corresponding rings individually, then average RMSD values and select a diverse set of conformers. It may work to some extend. However I'm curious whether a better solution possible? Can we sample rings individually and embed a molecule using pre-generated conformers of some parts (rings)? I know about the restricted conformer enumeration function, but it will work if we supply only a single connected part as fixed. It should not work if we have two disconnected parts (rings) with 3D coordinates, because we do not know their relative position to generate 3D coordinates for the rest of atoms in a molecule. Maybe someone will have some ideas/suggestions? Kind regards, Pavel _______________________________________________ Rdkit-discuss mailing list Rdk...@li... https://urldefense.com/v3/__https://lists.sourceforge.net/lists/listinfo/rdkit-discuss__;!!KGKeukY!0p9-LhqopxbW2-tJTOxCwEVRUKO6jN5s_2WifPuV2PCrDjoa_nTmgY9NPdqsyDi2aHTJ3LA1_Kh37wI0Vhn8IlJ5PAEKr5vut811YA$<https://urldefense.com/v3/__https:/lists.sourceforge.net/lists/listinfo/rdkit-discuss__;!!KGKeukY!0p9-LhqopxbW2-tJTOxCwEVRUKO6jN5s_2WifPuV2PCrDjoa_nTmgY9NPdqsyDi2aHTJ3LA1_Kh37wI0Vhn8IlJ5PAEKr5vut811YA$> |
From: Pavel P. <pav...@uk...> - 2024-05-13 08:43:01
|
Hello, I use RDKit to embed initial conformations for docking. The issue is with saturated rings. I can use a single random conformer but its geometry may be unsuitable and the whole molecule will fail to dock. I can use several starting conformers for docking and to avoid docking of very similar conformers I can select a few diverse conformers based on RMSD between rings only. However, the issue occurs if a molecule has several such saturated rings. The current workaround is to compute RMSD between corresponding rings individually, then average RMSD values and select a diverse set of conformers. It may work to some extend. However I'm curious whether a better solution possible? Can we sample rings individually and embed a molecule using pre-generated conformers of some parts (rings)? I know about the restricted conformer enumeration function, but it will work if we supply only a single connected part as fixed. It should not work if we have two disconnected parts (rings) with 3D coordinates, because we do not know their relative position to generate 3D coordinates for the rest of atoms in a molecule. Maybe someone will have some ideas/suggestions? Kind regards, Pavel |
From: Ariadna L. P. <ari...@gm...> - 2024-05-02 08:18:31
|
Hello everyone, Thank you for all your helpful suggestions. I've taken careful note of them, and they have been extremely helpful in guiding my work. 3D-QSAR is also new for me and your insights and expertise have been incredibly valuable. Thank you once again for your generous assistance. Best Regards, Ariadna Llop Missatge de Andrew Dalke <da...@da...> del dia dt., 30 d’abr. 2024 a les 22:45: > Hi Ariadna, > > In general the MACCS keys are not that good for comparing similarity. > They exist still for historical reasons. Back in the 1970s the company > Molecular Design Limited developed a program called "Molecular Access > System" (MACCS) for structure registration, substructure search, and the > like. > > Substructure search is slow, so MACCS includes a set of keys which would > act as fast filters - if the query contained a key but the database entry > did not, then the query could not match that entry. > > In the 1980s when fingerprint similarity search first became popular - > this is before the term "fingerprint" was even coined - people used the > MACCS keys because they were already computed and sitting there, on the > computer system they were already using. > > Over time people developed other types of fingerprints, and different ways > to compare them, and a more complete understanding of how they are coupled > to the types of system being studied. > > For example, in "Comparing structural fingerprints using a > literature-based similarity benchmark" by Sayle and O'Boyle, > "Extended-connectivity fingerprints of diameter 4 and 6 are among the best > performing fingerprints when ranking diverse structures by similarity, as > is the topological torsion fingerprint. However, when ranking very close > analogues, the atom pair fingerprint outperforms the others tested." > > They found the MACCS fingerprints to be one of the worst performers, which > you might expect now that you know the happenstance which made them popular. > > Since you are doing 3D QSAR, you should familiarize yourself with the > fingerprints used in that area. I have no experience with 3D QSAR and > cannot provide advice on what is appropriate. > > The first paper I found using Google Scholar to search for "3d qsar > fingerprints" is "Docking, Interaction Fingerprint, and Three-Dimensional > Quantitative Structure–Activity Relationship (3D-QSAR) of Sigma1 Receptor > Ligands, Analogs of the Neuroprotective Agent RC-33" at > https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6637851/ which uses > Interaction fingerprints. > > The second is "Novel TOPP descriptors in 3D-QSAR analysis of apoptosis > inducing 4-aryl-4H-chromenes: Comparison versus other 2D- and > 3D-descriptors" at > https://www.sciencedirect.com/science/article/pii/S0968089607005834 which > I mention to because it summarizes 7 different descriptor-based approaches, > and places the MACCS keys in last place, far below the second worst ("TOPP > > GRIND > BCI 4096 = ECFP > FCFP > GRID-GOLPE ≫ DRAGON ⋙ MDL 166"). > > No doubt there are many others for you to read through and try out. > > > > # Generate fingerprint descriptor database > > fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols] > > What I can suggest is you try my chemfp package, specifically the 3.2b1 I > just released (bear in mind that it is beta!) > > You can install it with: > > python -m pip install chemfp==4.2b1 -i https://chemfp.com/packages/ > > To generate Morgan fingerprints of radius 2, I suggest you compute them > once and store them in a file, like this command-line example: > > rdkit2fps --morgan2 dataset.smi -o dataset.fps > > (use "--maccs" to generate MACCS keys, "--pair" for atom pairs; and use > "--help" to see what other options are available.) > > To "Calculate pairwise Tanimoto similarity between fingerprints" as a > distance, you can use another command-line tool to generate the matrix as a > NumPy "npy" file, like this: > > chemfp simarray dataset.fps --as-distance -o dataset.npy > > To load this in Python: > > import numpy as np > dists = np.load("dataset.npy") > > If you also need the identifiers: > > with open("dataset.npy", "rb") as f: > dists = np.load(f) > metadata = np.load(f) > ids = np.load(f) > > This should make it easier to iterate over the different clustering > methods available, since you only generate the fingerprints and distance > matrix once. > > If you decide to use interaction fingerprints, or some other fingerprint > type that is not in the RDKit, you can still generate the fingerprints in > FPS format (a simple text format) and use chemfp to generate your matrix > for you, either on the command-line or through its Python API. > > > However, I'm not satisfied with the results and would like to experiment > with MACCS Keys to see if they yield better clustering outcomes. Does > anyone know how to cluster compounds using MACCS fingerprints? Any insights > on the best approach to calculate similarities and cluster using these > fingerprints would be highly appreciated. > > In case I was not clear enough before, MACCS keys make poor fingerprints. > There is no reason to expect they will yield better clustering outcomes, > and multiple papers which suggest they will make worse outcomes. > > Best regards, > > Andrew > da...@da... > > > |
From: Andrew D. <da...@da...> - 2024-04-30 21:10:38
|
Hi Ariadna, In general the MACCS keys are not that good for comparing similarity. They exist still for historical reasons. Back in the 1970s the company Molecular Design Limited developed a program called "Molecular Access System" (MACCS) for structure registration, substructure search, and the like. Substructure search is slow, so MACCS includes a set of keys which would act as fast filters - if the query contained a key but the database entry did not, then the query could not match that entry. In the 1980s when fingerprint similarity search first became popular - this is before the term "fingerprint" was even coined - people used the MACCS keys because they were already computed and sitting there, on the computer system they were already using. Over time people developed other types of fingerprints, and different ways to compare them, and a more complete understanding of how they are coupled to the types of system being studied. For example, in "Comparing structural fingerprints using a literature-based similarity benchmark" by Sayle and O'Boyle, "Extended-connectivity fingerprints of diameter 4 and 6 are among the best performing fingerprints when ranking diverse structures by similarity, as is the topological torsion fingerprint. However, when ranking very close analogues, the atom pair fingerprint outperforms the others tested." They found the MACCS fingerprints to be one of the worst performers, which you might expect now that you know the happenstance which made them popular. Since you are doing 3D QSAR, you should familiarize yourself with the fingerprints used in that area. I have no experience with 3D QSAR and cannot provide advice on what is appropriate. The first paper I found using Google Scholar to search for "3d qsar fingerprints" is "Docking, Interaction Fingerprint, and Three-Dimensional Quantitative Structure–Activity Relationship (3D-QSAR) of Sigma1 Receptor Ligands, Analogs of the Neuroprotective Agent RC-33" at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6637851/ which uses Interaction fingerprints. The second is "Novel TOPP descriptors in 3D-QSAR analysis of apoptosis inducing 4-aryl-4H-chromenes: Comparison versus other 2D- and 3D-descriptors" at https://www.sciencedirect.com/science/article/pii/S0968089607005834 which I mention to because it summarizes 7 different descriptor-based approaches, and places the MACCS keys in last place, far below the second worst ("TOPP > GRIND > BCI 4096 = ECFP > FCFP > GRID-GOLPE ≫ DRAGON ⋙ MDL 166"). No doubt there are many others for you to read through and try out. > # Generate fingerprint descriptor database > fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols] What I can suggest is you try my chemfp package, specifically the 3.2b1 I just released (bear in mind that it is beta!) You can install it with: python -m pip install chemfp==4.2b1 -i https://chemfp.com/packages/ To generate Morgan fingerprints of radius 2, I suggest you compute them once and store them in a file, like this command-line example: rdkit2fps --morgan2 dataset.smi -o dataset.fps (use "--maccs" to generate MACCS keys, "--pair" for atom pairs; and use "--help" to see what other options are available.) To "Calculate pairwise Tanimoto similarity between fingerprints" as a distance, you can use another command-line tool to generate the matrix as a NumPy "npy" file, like this: chemfp simarray dataset.fps --as-distance -o dataset.npy To load this in Python: import numpy as np dists = np.load("dataset.npy") If you also need the identifiers: with open("dataset.npy", "rb") as f: dists = np.load(f) metadata = np.load(f) ids = np.load(f) This should make it easier to iterate over the different clustering methods available, since you only generate the fingerprints and distance matrix once. If you decide to use interaction fingerprints, or some other fingerprint type that is not in the RDKit, you can still generate the fingerprints in FPS format (a simple text format) and use chemfp to generate your matrix for you, either on the command-line or through its Python API. > However, I'm not satisfied with the results and would like to experiment with MACCS Keys to see if they yield better clustering outcomes. Does anyone know how to cluster compounds using MACCS fingerprints? Any insights on the best approach to calculate similarities and cluster using these fingerprints would be highly appreciated. In case I was not clear enough before, MACCS keys make poor fingerprints. There is no reason to expect they will yield better clustering outcomes, and multiple papers which suggest they will make worse outcomes. Best regards, Andrew da...@da... |
From: YUKTI D. <yuk...@st...> - 2024-04-24 19:14:59
|
Can anybody help me doing bioactivity prediction of batch of smiles through RDKit? |
From: Greg L. <gre...@gm...> - 2024-04-23 14:20:24
|
Hi, Please do not duplicate questions/posts between the mailing list and github discussions. That's spamming the community. -greg On Tue, Apr 23, 2024 at 4:10 PM Ariadna Llop Peiró <ari...@gm...> wrote: > Hello everyone, > > I'm currently working with a dataset of chemical compounds, aiming to > cluster them into different series to create a 3D-QSAR model. Up to this > point, I've been using Morgan Fingerprints to generate the descriptors and > cluster the compounds based on their Tanimoto Similarity: > > ``` > # Generate fingerprint descriptor database > fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols] > > > # Calculate pairwise Tanimoto similarity between fingerprints > similarity_matrix = [] > for i in range(len(fps)): > similarities = [] > for j in range(len(fps)): > similarities.append(DataStructs.TanimotoSimilarity(fps[i], fps[j])) > > similarity_matrix.append(similarities) > ``` > > > With the similarity matrix, I applied hierarchical clustering based on a > Tanimoto Similarity threshold to group similar compounds: > > ``` > # Cluster based on Tanimoto similarity > dists = 1 - np.array(similarity_matrix) > hc = hierarchy.linkage(squareform(dists), method='single') > > # Specify a distance threshold or number of clusters > threshold = 0.6 # Adjust this value based on your dendrogram and > similarity values > clusters = hierarchy.fcluster(hc, threshold, criterion='distance') > ``` > > However, I'm not satisfied with the results and would like to experiment > with MACCS Keys to see if they yield better clustering outcomes. Does > anyone know how to cluster compounds using MACCS fingerprints? Any insights > on the best approach to calculate similarities and cluster using these > fingerprints would be highly appreciated. > > Thank you in advance for your suggestions! > > Ariadna Llop > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > |
From: Ariadna L. P. <ari...@gm...> - 2024-04-23 14:07:38
|
Hello everyone, I'm currently working with a dataset of chemical compounds, aiming to cluster them into different series to create a 3D-QSAR model. Up to this point, I've been using Morgan Fingerprints to generate the descriptors and cluster the compounds based on their Tanimoto Similarity: ``` # Generate fingerprint descriptor database fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols] # Calculate pairwise Tanimoto similarity between fingerprints similarity_matrix = [] for i in range(len(fps)): similarities = [] for j in range(len(fps)): similarities.append(DataStructs.TanimotoSimilarity(fps[i], fps[j])) similarity_matrix.append(similarities) ``` With the similarity matrix, I applied hierarchical clustering based on a Tanimoto Similarity threshold to group similar compounds: ``` # Cluster based on Tanimoto similarity dists = 1 - np.array(similarity_matrix) hc = hierarchy.linkage(squareform(dists), method='single') # Specify a distance threshold or number of clusters threshold = 0.6 # Adjust this value based on your dendrogram and similarity values clusters = hierarchy.fcluster(hc, threshold, criterion='distance') ``` However, I'm not satisfied with the results and would like to experiment with MACCS Keys to see if they yield better clustering outcomes. Does anyone know how to cluster compounds using MACCS fingerprints? Any insights on the best approach to calculate similarities and cluster using these fingerprints would be highly appreciated. Thank you in advance for your suggestions! Ariadna Llop |
From: מיכל ר. <mic...@gm...> - 2024-03-27 10:28:22
|
Hi I'm trying to define the following reaction: '([A:1]\[A:2]=[A:3]\[A:4]=[A:5]/[A:6]=[A:7]\[A:8]=[A:9]\[A:10]) >> ([A:1]/[A:2]=[A:9]\[A:10].[A:4]1=[A:5][A:6]=[A:7][A:8]=[A:3]1)' I want the reaction to take place for the cis case specifically as written and not for the all-trans reactant. using rdchiral I manage to eliminate the all-trans reactant, but the product is given in its all-trans case and not in the cis case, as the reaction demands (between atoms 1,2,9,10): reactant: 'CCCC/[NH+]=C/C=C(C)\\C=C/C=C(C)/C=C/C1=C(C)CCCC1(C)C' product: 'CCCC/[NH+]=C(C)/C=C/C1=C(C)CCCC1(C)C' How can I resolve this issue? |
From: Greg L. <gre...@gm...> - 2024-03-20 16:36:08
|
For what it's worth, this one works too: m.GetSubstructMatches(Chem.MolFromSmarts('P1->[Zr+3]<-C1')) It looks like a problem in the way ring closure bonds are being handled in the SMARTS parser. Jan: would you mind creating an issue for this in github? -greg On Wed, Mar 20, 2024 at 3:30 PM Jan Halborg Jensen <jhj...@ch...> wrote: > The following finds no matches: > > m = Chem.MolFromSmiles('C1P->[Zr+3]<-1') > m.GetSubstructMatches(Chem.MolFromSmarts('C1P->[Zr+3]<-1’)) > > But all these work: > > m.GetSubstructMatches(Chem.MolFromSmiles('C1P->[Zr+3]<-1’)) > > m.GetSubstructMatches(Chem.MolFromSmarts('[*]->[Zr+3]’)) > > m = Chem.MolFromSmiles('C1P-[Zr+3]-1') > m.GetSubstructMatches(Chem.MolFromSmarts('C1P-[Zr+3]-1’)) > > > Is this a bug, or is there something I’m missing with regard to the first > case? > > Best regards, Jan > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > |
From: Jan H. J. <jhj...@ch...> - 2024-03-20 14:28:16
|
The following finds no matches: m = Chem.MolFromSmiles('C1P->[Zr+3]<-1') m.GetSubstructMatches(Chem.MolFromSmarts('C1P->[Zr+3]<-1’)) But all these work: m.GetSubstructMatches(Chem.MolFromSmiles('C1P->[Zr+3]<-1’)) m.GetSubstructMatches(Chem.MolFromSmarts('[*]->[Zr+3]’)) m = Chem.MolFromSmiles('C1P-[Zr+3]-1') m.GetSubstructMatches(Chem.MolFromSmarts('C1P-[Zr+3]-1’)) Is this a bug, or is there something I’m missing with regard to the first case? Best regards, Jan |
From: Paolo T. <pao...@gm...> - 2024-03-19 11:16:35
|
Dear Jan, Definitely it is a bug. I’ll try and fix it for the next release which is due in ~2 weeks. Thanks for reporting, cheers Paolo > On 19 Mar 2024, at 11:20, Jan Halborg Jensen <jhj...@ch...> wrote: > > Why does ResonanceMolSupplier only give me one resonance structure for O[NH+]=[C-]NC when O[NH+]=[CH]NC gives me two structures? Is that a bug? > > Best regards, Jan > > > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss |
From: Jan H. J. <jhj...@ch...> - 2024-03-19 10:18:30
|
Why does ResonanceMolSupplier only give me one resonance structure for O[NH+]=[C-]NC when O[NH+]=[CH]NC gives me two structures? Is that a bug? Best regards, Jan |
From: 王昊 <hwa...@16...> - 2024-03-13 13:25:15
|
Hi: I have two molecules as shown below. It seems that they should not have a common substructure or the substructure is smaller,, but they can match the following results. I have added parameters to no avail. How can I solve this problem? code: smi1 = 'CC(=O)OCCc1ccccc1' smi2 = 'CCCCCC' mol1 = Chem.MolFromSmiles(smi1) mol2 = Chem.MolFromSmiles(smi2) params = rdFMCS.MCSParameters() params.BondCompare = rdFMCS.BondCompare.CompareOrderExact params.AtomCompare = rdFMCS.AtomCompare.CompareAny params.MatchValences = True params.MatchChiralTag = True mcs = rdFMCS.FindMCS([mol1, mol2],params) mcs_smarts = mcs.smartsString mcs_smiles = Chem.MolToSmiles(Chem.MolFromSmarts(mcs_smarts)) print (mcs_smarts) print (mcs_smiles) result: [#6]-[#6]-[#6]-,:[#6]-,:[#6]-,:[#6] CCCCCC |
From: Greg L. <gre...@gm...> - 2024-03-13 05:23:20
|
Dear all, The (free) registration for the 2024 RDKit UGM, being held from 11-13 September at the ETH in Zurich, Switzerland, is now open: https://www.eventbrite.com/e/860637719587 You can submit proposals to do talks, tutorials, lightning talks, and posters here: https://forms.gle/5GK5ej7hCdPguwKz8 As in the past couple of years, we will stream the talks for people who cannot attend in person. Best regards, -greg |
From: Ádám B. <bar...@gm...> - 2024-02-23 09:50:33
|
Hello all, Is it possible to add a legend to individual reaction components if I use DrawReaction, like with MolsToGridImage or DrawMolecule? I'm trying to display a reaction that has identifiers (ex: A, B, P1) below each component. I'm currently drawing the reaction with DrawReaction. The reaction is generated from an RxnBlock. Thank you, -- ~Baróthi Ádám |