From: <raj...@us...> - 2008-10-17 20:17:34
|
Revision: 12728 http://cdk.svn.sourceforge.net/cdk/?rev=12728&view=rev Author: rajarshi Date: 2008-10-17 20:17:24 +0000 (Fri, 17 Oct 2008) Log Message: ----------- Various edits and bib entries Modified Paths: -------------- cdk-fingerprint-paper/trunk/paper/bmc_article.bib cdk-fingerprint-paper/trunk/paper/bmc_article.tex Modified: cdk-fingerprint-paper/trunk/paper/bmc_article.bib =================================================================== --- cdk-fingerprint-paper/trunk/paper/bmc_article.bib 2008-10-17 16:14:59 UTC (rev 12727) +++ cdk-fingerprint-paper/trunk/paper/bmc_article.bib 2008-10-17 20:17:24 UTC (rev 12728) @@ -1,11 +1,310 @@ -%% Created for Rajarshi Guha at 2008-10-17 10:30:38 -0400 +%% Created for Rajarshi Guha at 2008-10-17 16:04:40 -0400 %% Saved with string encoding Unicode (UTF-8) +@article{Maggiora:2006aa, + Au = {Maggiora, GM}, + Author = {Maggiora, Gerald M}, + Da = {20060724}, + Date-Added = {2008-10-17 16:04:37 -0400}, + Date-Modified = {2008-10-17 16:04:37 -0400}, + Dcom = {20060920}, + Doi = {10.1021/ci060117s}, + Edat = {2006/07/25 09:00}, + Issn = {1549-9596 (Print)}, + Jid = {101230060}, + Journal = {J.~Chem.~Inf.~Model.}, + Jt = {Journal of chemical information and modeling}, + Keywords = {activity cliff; QSAR; model; landscape; sali}, + Language = {eng}, + Local-Url = {file://localhost/Users/rguha/Documents/articles/ci060117s.pdf}, + Mhda = {2006/07/25 09:01}, + Number = {4}, + Own = {NLM}, + Pages = {1535--1535}, + Pl = {United States}, + Pmid = {16859285}, + Pst = {ppublish}, + Pt = {Editorial}, + Pubm = {Print}, + Rating = {5}, + Read = {Yes}, + So = {J.~Chem.~Inf.~Model.. 2006 Jul-Aug;46(4):1535.}, + Stat = {PubMed-not-MEDLINE}, + Title = {On Outliers and Activity Cliffs--Why {QSAR} Often Disappoints.}, + Volume = {46}, + Year = {2006}, + Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0t +leWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iam +VjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0Rhd +GFfECIuLi8uLi8uLi8uLi9hcnRpY2xlcy9jaTA2MDExN3MucGRm0hsPHB1XTlMuZGF0YU8RAYoA +AAAAAYoAAgAADE1hY2ludG9zaCBIRAAAAAAAAAAAAAAAAAAAAMHmqkJIKwAAABbAjw1jaTA2MDE +xN3MucGRmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AANgOfwxk5ZlBERiBwcnZ3AAQAAgAACSAAAAAAAAAAAAAAAAAAAAAIYXJ0aWNsZXMAEAAIAADB5 +vCSAAAAEQAIAADDGXGmAAAAAQAQABbAjwAH8GcAB/BaABlhNgACADlNYWNpbnRvc2ggSEQ6VXNl +cnM6cmd1aGE6RG9jdW1lbnRzOmFydGljbGVzOmNpMDYwMTE3cy5wZGYAAA4AHAANAGMAaQAwADY +AMAAxADEANwBzAC4AcABkAGYADwAaAAwATQBhAGMAaQBuAHQAbwBzAGgAIABIAEQAEgAsVXNlcn +Mvcmd1aGEvRG9jdW1lbnRzL2FydGljbGVzL2NpMDYwMTE3cy5wZGYAEwABLwAAFQACAAz//wAAg +AbSHyAhIlgkY2xhc3Nlc1okY2xhc3NuYW1loyIjJF1OU011dGFibGVEYXRhVk5TRGF0YVhOU09i +amVjdNIfICYnoickXE5TRGljdGlvbmFyeQAIABEAGwAkACkAMgBEAEkATABRAFMAXABiAGkAdAB +8AIMAhgCIAIoAjQCPAJEAkwCgAKoAzwDUANwCagJsAnECegKFAokClwKeAqcCrAKvAAAAAAAAAg +EAAAAAAAAAKAAAAAAAAAAAAAAAAAAAArw=}, + Bdsk-Url-1 = {http://dx.doi.org/10.1021/ci060117s}} + +@article{Martin:2002ab, + Author = {Martin, Yvonne C and Kofron, James L and Traphagen, Linda M}, + Date-Added = {2008-10-17 16:03:09 -0400}, + Date-Modified = {2008-10-17 16:03:09 -0400}, + Doi = {10.1021/jm020155c}, + Journal = {J.~Med.~Chem.}, + Keywords = {fingerprint; similarity; Combinatorial Chemistry Techniques; Data Interpretation, Statistical; Databases, Factual; Enzyme Inhibitors/chemistry; *Molecular Structure; Monoamine Oxidase Inhibitors/chemistry; qsar; mao;}, + Local-Url = {file://localhost/Users/rguha/Documents/articles/jm020155c.pdf}, + Number = {19}, + Pages = {4350--4358}, + Title = {Do Structurally Similar Molecules Have Similar Biological Activity?}, + Volume = {45}, + Year = {2002}, + Abstract = {To design diverse combinatorial libraries or to select diverse compounds to augment a screening collection, computational chemists frequently reject compounds that are > or =0.85 similar to one already chosen for the combinatorial library or in the screening set. Using Daylight fingerprints, this report shows that for IC(50) values determined as a follow-up to 115 high-throughput screening assays, there is only a 30% chance that a compound that is > or = 0.85 (Tanimoto) similar to an active is itself active. Although this enrichment is greater than that found with random screening and docking to three-dimensional structures, this low fraction of actives within similar compounds occurs not only because of deficiencies in the Daylight fingerprints and Tanimoto similarity calculations but also because similar compounds do not necessarily interact with the target macromolecule in similar ways. The current study emphasizes the statistical or probabilistic nature of library design and that perfect results cannot be expected.}, + Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0t +leWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iam +VjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0Rhd +GFfECIuLi8uLi8uLi8uLi9hcnRpY2xlcy9qbTAyMDE1NWMucGRm0hsPHB1XTlMuZGF0YU8RAYoA +AAAAAYoAAgAADE1hY2ludG9zaCBIRAAAAAAAAAAAAAAAAAAAAMHmqkJIKwAAABbAjw1qbTAyMDE +1NWMucGRmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAYO0sxCWCeVBERiBwcnZ3AAQAAgAACSAAAAAAAAAAAAAAAAAAAAAIYXJ0aWNsZXMAEAAIAADB5 +vCSAAAAEQAIAADEJbq5AAAAAQAQABbAjwAH8GcAB/BaABlhNgACADlNYWNpbnRvc2ggSEQ6VXNl +cnM6cmd1aGE6RG9jdW1lbnRzOmFydGljbGVzOmptMDIwMTU1Yy5wZGYAAA4AHAANAGoAbQAwADI +AMAAxADUANQBjAC4AcABkAGYADwAaAAwATQBhAGMAaQBuAHQAbwBzAGgAIABIAEQAEgAsVXNlcn +Mvcmd1aGEvRG9jdW1lbnRzL2FydGljbGVzL2ptMDIwMTU1Yy5wZGYAEwABLwAAFQACAAz//wAAg +AbSHyAhIlgkY2xhc3Nlc1okY2xhc3NuYW1loyIjJF1OU011dGFibGVEYXRhVk5TRGF0YVhOU09i +amVjdNIfICYnoickXE5TRGljdGlvbmFyeQAIABEAGwAkACkAMgBEAEkATABRAFMAXABiAGkAdAB +8AIMAhgCIAIoAjQCPAJEAkwCgAKoAzwDUANwCagJsAnECegKFAokClwKeAqcCrAKvAAAAAAAAAg +EAAAAAAAAAKAAAAAAAAAAAAAAAAAAAArw=}, + Bdsk-Url-1 = {http://dx.doi.org/10.1021/jm020155c}} + +@article{Durant:2002aa, + Author = {Durant, J.L. and Leland, B.A. and Henry, D.R. and Nourse, J.G.}, + Date-Added = {2008-10-17 15:47:06 -0400}, + Date-Modified = {2008-10-17 15:47:06 -0400}, + Doi = {DOI 10.1021/ci010132r}, + Journal = {J.~Chem.~Inf.~Comput.~Sci.}, + Keywords = {maccs; fingerprint}, + Number = {6}, + Pages = {1273--1280}, + Title = {Reoptimization of {MDL} Keys for Use in Drug Discovery}, + Volume = {42}, + Year = {2002}, + Abstract = {For a number of years MDL products have exposed both 166 bit and 960 bit keysets based on 2D descriptors. These keysets were originally constructed and optimized for substructure searching. We report on improvements in the performance of MDL keysets which are reoptimized for use in molecular similarity. Classification performance for a test data set of 957 compounds was increased from 0.65 for the 166 bit keyset and 0.67 for the 960 bit keyset to 0.71 for a surprisal S/N pruned keyset containing 208 bits and 0.71 for a genetic algorithm optimized keyset containing 548 bits. We present an overview of the underlying technology supporting the definition of descriptors and the encoding of these descriptors into keysets. This technology allows definition of descriptors as combinations of atom properties, bond properties, and atomic neighborhoods at various topological separations as well as supporting a number of custom descriptors. These descriptors can then be used to set one or more bits in a keyset. We constructed various keysets and optimized their performance in clustering bioactive substances. Performance was measured using methodology developed by Briem and Lessel. "Directed pruning" was carried out by eliminating bits from the keysets on the basis of random selection, values of the surprisal of the bit, or values of the surprisal S/N ratio of the bit. The random pruning experiment highlighted the insensitivity of keyset performance for keyset lengths of more than 1000 bits. Contrary to initial expectations, pruning on the basis of the surprisal values of the various bits resulted in keysets which underperformed those resulting from random pruning. In contrast, pruning on the basis of the surprisal SIN ratio was found to yield keysets which performed better than those resulting from random pruning. We also explored the use of genetic algorithms in the selection of optimal keysets. Once more the performance was only a weak function of keyset size, and the optimizations failed to identify a single globally optimal keyset. Instead multiple, equally optimal keysets could be produced which had relatively low overlap of the descriptors they encoded.}, + Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0t +leWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iam +VjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0Rhd +GFfECIuLi8uLi8uLi8uLi9hcnRpY2xlcy9jaTAxMDEzMnIucGRm0hsPHB1XTlMuZGF0YU8RAYoA +AAAAAYoAAgAADE1hY2ludG9zaCBIRAAAAAAAAAAAAAAAAAAAAMHmqkJIKwAAABbAjw1jaTAxMDE +zMnIucGRmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAflYZxMMeWgAAAAAAAAAAAAQAAgAACSAAAAAAAAAAAAAAAAAAAAAIYXJ0aWNsZXMAEAAIAADB5 +vCSAAAAEQAIAADEw1aaAAAAAQAQABbAjwAH8GcAB/BaABlhNgACADlNYWNpbnRvc2ggSEQ6VXNl +cnM6cmd1aGE6RG9jdW1lbnRzOmFydGljbGVzOmNpMDEwMTMyci5wZGYAAA4AHAANAGMAaQAwADE +AMAAxADMAMgByAC4AcABkAGYADwAaAAwATQBhAGMAaQBuAHQAbwBzAGgAIABIAEQAEgAsVXNlcn +Mvcmd1aGEvRG9jdW1lbnRzL2FydGljbGVzL2NpMDEwMTMyci5wZGYAEwABLwAAFQACAAz//wAAg +AbSHyAhIlgkY2xhc3Nlc1okY2xhc3NuYW1loyIjJF1OU011dGFibGVEYXRhVk5TRGF0YVhOU09i +amVjdNIfICYnoickXE5TRGljdGlvbmFyeQAIABEAGwAkACkAMgBEAEkATABRAFMAXABiAGkAdAB +8AIMAhgCIAIoAjQCPAJEAkwCgAKoAzwDUANwCagJsAnECegKFAokClwKeAqcCrAKvAAAAAAAAAg +EAAAAAAAAAKAAAAAAAAAAAAAAAAAAAArw=}, + Bdsk-Url-1 = {http://gateway.isiknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=Alerting&SrcApp=Alerting&DestApp=WOS&DestLinkType=FullRecord;KeyUT=000179527600001}} + +@article{Barnard:1997aa, + Author = {Barnard, J.M. and Downs, G.M.}, + Date-Added = {2008-10-17 15:46:25 -0400}, + Date-Modified = {2008-10-17 15:46:25 -0400}, + Journal = {J.~Chem.~Inf.~Comput.~Sci.}, + Keywords = {bci; digital chemistry; fingerprint}, + Pages = {141--142}, + Title = {Chemical Fragment Generaion and Clustering Software}, + Volume = {37}, + Year = {1997}, + Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0t +leWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iam +VjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0Rhd +GFfECIuLi8uLi8uLi8uLi9hcnRpY2xlcy9jaTk2MDA5MGsucGRm0hsPHB1XTlMuZGF0YU8RAYoA +AAAAAYoAAgAADE1hY2ludG9zaCBIRAAAAAAAAAAAAAAAAAAAAMHmqkJIKwAAABbAjw1jaTk2MDA +5MGsucGRmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAi/jQxQZylQAAAAAAAAAAAAQAAgAACSAAAAAAAAAAAAAAAAAAAAAIYXJ0aWNsZXMAEAAIAADB5 +vCSAAAAEQAIAADFBqrVAAAAAQAQABbAjwAH8GcAB/BaABlhNgACADlNYWNpbnRvc2ggSEQ6VXNl +cnM6cmd1aGE6RG9jdW1lbnRzOmFydGljbGVzOmNpOTYwMDkway5wZGYAAA4AHAANAGMAaQA5ADY +AMAAwADkAMABrAC4AcABkAGYADwAaAAwATQBhAGMAaQBuAHQAbwBzAGgAIABIAEQAEgAsVXNlcn +Mvcmd1aGEvRG9jdW1lbnRzL2FydGljbGVzL2NpOTYwMDkway5wZGYAEwABLwAAFQACAAz//wAAg +AbSHyAhIlgkY2xhc3Nlc1okY2xhc3NuYW1loyIjJF1OU011dGFibGVEYXRhVk5TRGF0YVhOU09i +amVjdNIfICYnoickXE5TRGljdGlvbmFyeQAIABEAGwAkACkAMgBEAEkATABRAFMAXABiAGkAdAB +8AIMAhgCIAIoAjQCPAJEAkwCgAKoAzwDUANwCagJsAnECegKFAokClwKeAqcCrAKvAAAAAAAAAg +EAAAAAAAAAKAAAAAAAAAAAAAAAAAAAArw=}} + +@article{Godden:2005aa, + Author = {Godden, J.W. and Stahura, F.L. and Bajorath, J.}, + Date-Added = {2008-10-17 13:49:10 -0400}, + Date-Modified = {2008-10-17 13:49:10 -0400}, + Journal = {J.~Chem.~Inf.~Model.}, + Keywords = {virtual screening; similarity; fingerprint; 2D; benchmark; dataset; metric}, + Number = {6}, + Pages = {1812--1819}, + Title = {Anatomy of Fingerprint Search Calculations on Structurally Diverse Sets of Active Compounds}, + Volume = {45}, + Year = {2005}, + Abstract = {Similarity searching using molecular fingerprints is a widely used approach for the identification of novel hits. A fingerprint search involves many pairwise comparisons of bit string representations of known active molecules with those precomputed for database compounds. Bit string overlap, as evaluated by various similarity metrics, is used as a measure of molecular similarity. Results of a number of studies focusing on fingerprints suggest that it is difficult, if not impossible, to develop generally applicable search parameters and strategies, irrespective of the compound classes under investigation. Rather, more or less, each individual search problem requires an adjustment of calculation conditions. Thus, there is a need for diagnostic tools to analyze fingerprint-based similarity searching. We report an analysis of fingerprint search calculations on different sets of structurally diverse active compounds. Calculations on five biological activity classes were carried out with two fingerprints in two compound Source databases, and the results were analyzed in histograms. Tanimoto coefficient (Tc) value ranges where active compounds were detected were compared to the distribution of Tc values in the database. The analysis revealed that compound class-specific effects strongly influenced the outcome of these fingerprint calculations. Among the five diverse compound sets studied, very different search results were obtained. The analysis described here can be applied to determine Tc intervals where scaffold hopping Occurs. It can also be used to benchmark fingerprint calculations or estimate their probability of success.}, + Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0t +leWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iam +VjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0Rhd +GFfECIuLi8uLi8uLi8uLi9hcnRpY2xlcy9jaTA1MDI3NncucGRm0hsPHB1XTlMuZGF0YU8RAYoA +AAAAAYoAAgAADE1hY2ludG9zaCBIRAAAAAAAAAAAAAAAAAAAAMHmqkJIKwAAABbAjw1jaTA1MDI +3NncucGRmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAkHOTxR5INAAAAAAAAAAAAAQAAgAACSAAAAAAAAAAAAAAAAAAAAAIYXJ0aWNsZXMAEAAIAADB5 +vCSAAAAEQAIAADFHoB0AAAAAQAQABbAjwAH8GcAB/BaABlhNgACADlNYWNpbnRvc2ggSEQ6VXNl +cnM6cmd1aGE6RG9jdW1lbnRzOmFydGljbGVzOmNpMDUwMjc2dy5wZGYAAA4AHAANAGMAaQAwADU +AMAAyADcANgB3AC4AcABkAGYADwAaAAwATQBhAGMAaQBuAHQAbwBzAGgAIABIAEQAEgAsVXNlcn +Mvcmd1aGEvRG9jdW1lbnRzL2FydGljbGVzL2NpMDUwMjc2dy5wZGYAEwABLwAAFQACAAz//wAAg +AbSHyAhIlgkY2xhc3Nlc1okY2xhc3NuYW1loyIjJF1OU011dGFibGVEYXRhVk5TRGF0YVhOU09i +amVjdNIfICYnoickXE5TRGljdGlvbmFyeQAIABEAGwAkACkAMgBEAEkATABRAFMAXABiAGkAdAB +8AIMAhgCIAIoAjQCPAJEAkwCgAKoAzwDUANwCagJsAnECegKFAokClwKeAqcCrAKvAAAAAAAAAg +EAAAAAAAAAKAAAAAAAAAAAAAAAAAAAArw=}, + Bdsk-Url-1 = {http://dx.doi.org/10.1021/ci050276w}} + +@article{Jain:2008aa, + Author = {Jain, Ajay N. and Nicholls, Anthony}, + Date-Added = {2008-10-17 13:46:54 -0400}, + Date-Modified = {2008-10-17 13:49:25 -0400}, + Journal = {J.~Comp.~Aided Mol.~Des.}, + Keywords = {benchmark; docking; molecular similarity; benchmarking; statistical evaluation; virtual screening}, + Month = {MAR}, + Number = {3--4}, + Pages = {133--139}, + Title = {Recommendations for Evaluation of Computational Methods}, + Volume = {22}, + Year = {2008}, + Abstract = {The field of computational chemistry, particularly as applied to drug design, has become increasingly important in terms of the practical application of predictive modeling to pharmaceutical research and development. Tools for exploiting protein structures or sets of ligands known to bind particular targets can be used for binding-mode prediction, virtual screening, and prediction of activity. A serious weakness within the field is a lack of standards with respect to quantitative evaluation of methods, data set preparation, and data set sharing. Our goal should be to report new methods or comparative evaluations of methods in a manner that supports decision making for practical applications. Here we propose a modest beginning, with recommendations for requirements on statistical reporting, requirements for data sharing, and best practices for benchmark preparation and usage.}, + Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0t +leWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iam +VjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0Rhd +GFfEC8uLi8uLi8uLi8uLi9hcnRpY2xlcy9qYWluLWpjYW1kLTIwMDgtMjItMTMzLnBkZtIbDxwd +V05TLmRhdGFPEQG+AAAAAAG+AAIAAAxNYWNpbnRvc2ggSEQAAAAAAAAAAAAAAAAAAADB5qpCSCs +AAAAWwI8aamFpbi1qY2FtZC0yMDA4LTIyLTEzMy5wZGYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAAAAAAAAAAAAAAAAAAJBzdsUeR60AAAAAAAAAAAAEAAIAAAkgAAAAAAAAAAAAAAAAAAAACGFyd +GljbGVzABAACAAAwebwkgAAABEACAAAxR5/7QAAAAEAEAAWwI8AB/BnAAfwWgAZYTYAAgBGTWFj +aW50b3NoIEhEOlVzZXJzOnJndWhhOkRvY3VtZW50czphcnRpY2xlczpqYWluLWpjYW1kLTIwMDg +tMjItMTMzLnBkZgAOADYAGgBqAGEAaQBuAC0AagBjAGEAbQBkAC0AMgAwADAAOAAtADIAMgAtAD +EAMwAzAC4AcABkAGYADwAaAAwATQBhAGMAaQBuAHQAbwBzAGgAIABIAEQAEgA5VXNlcnMvcmd1a +GEvRG9jdW1lbnRzL2FydGljbGVzL2phaW4tamNhbWQtMjAwOC0yMi0xMzMucGRmAAATAAEvAAAV +AAIADP//AACABtIfICEiWCRjbGFzc2VzWiRjbGFzc25hbWWjIiMkXU5TTXV0YWJsZURhdGFWTlN +EYXRhWE5TT2JqZWN00h8gJieiJyRcTlNEaWN0aW9uYXJ5AAgAEQAbACQAKQAyAEQASQBMAFEAUw +BcAGIAaQB0AHwAgwCGAIgAigCNAI8AkQCTAKAAqgDcAOEA6QKrAq0CsgK7AsYCygLYAt8C6ALtA +vAAAAAAAAACAQAAAAAAAAAoAAAAAAAAAAAAAAAAAAAC/Q==}, + Bdsk-Url-1 = {http://dx.doi.org/10.1007/s10822-008-9196-5}} + +@article{Cherkasov:2008aa, + Author = {Cherkasov, Artern and Ban, Fuqiang and Santos-Filho, Osvaldo and Thorsteinson, Nels and Fallahi, Magid and Hammond, Geoffrey L.}, + Date-Added = {2008-10-17 13:44:54 -0400}, + Date-Modified = {2008-10-17 13:44:54 -0400}, + Journal = {J.~Med.~Chem.}, + Keywords = {virtual screening; docking; comfa; steroid; shbg; benchmark; metrics; comsia}, + Number = {7}, + Pages = {2047--2056}, + Title = {An Updated Steroid Benchmark Set and Its Application in the Discovery of Novel Nanomolar Ligands of Sex Hormone-Binding Globulin}, + Volume = {51}, + Year = {2008}, + Abstract = {A benchmark data set of steroids with known affinity for sex hormone-binding globulin (SHBG) has been widely used to validate popular molecular field-based QSAR techniques. We have expanded the data set by adding a number of nonsteroidal SHBG ligands identified both from the literature and in our previous experimental studies.This updated molecular set has been used herein to develop 4D QSAR models based on "inductive" descriptors and to gain insight into the molecular basis of protein-ligand interactions. Molecular alignment was generated by means of docking active Compounds into the active site of the SHBG. alin Surprisingly, the alignment of the benchmark steroids contradicted the classical ligand-based alignment utilized in previous CoMFA and CoMSIA models yet afforded models with higher statistical significance and predictive power. The resulting QSAR models combined with CoMFA and CoMSiA models as well as structure-based virtual screening allowed discovering several low-micromolar to nanomolar nonsteroidal inhibitors for human SHBG.}, + Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0t +leWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iam +VjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0Rhd +GFfECIuLi8uLi8uLi8uLi9hcnRpY2xlcy9qbTcwMTE0ODUucGRm0hsPHB1XTlMuZGF0YU8RAYoA +AAAAAYoAAgAADE1hY2ludG9zaCBIRAAAAAAAAAAAAAAAAAAAAMHmqkJIKwAAABbAjw1qbTcwMTE +0ODUucGRmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAkHNKxR5HJQAAAAAAAAAAAAQAAgAACSAAAAAAAAAAAAAAAAAAAAAIYXJ0aWNsZXMAEAAIAADB5 +vCSAAAAEQAIAADFHn9lAAAAAQAQABbAjwAH8GcAB/BaABlhNgACADlNYWNpbnRvc2ggSEQ6VXNl +cnM6cmd1aGE6RG9jdW1lbnRzOmFydGljbGVzOmptNzAxMTQ4NS5wZGYAAA4AHAANAGoAbQA3ADA +AMQAxADQAOAA1AC4AcABkAGYADwAaAAwATQBhAGMAaQBuAHQAbwBzAGgAIABIAEQAEgAsVXNlcn +Mvcmd1aGEvRG9jdW1lbnRzL2FydGljbGVzL2ptNzAxMTQ4NS5wZGYAEwABLwAAFQACAAz//wAAg +AbSHyAhIlgkY2xhc3Nlc1okY2xhc3NuYW1loyIjJF1OU011dGFibGVEYXRhVk5TRGF0YVhOU09i +amVjdNIfICYnoickXE5TRGljdGlvbmFyeQAIABEAGwAkACkAMgBEAEkATABRAFMAXABiAGkAdAB +8AIMAhgCIAIoAjQCPAJEAkwCgAKoAzwDUANwCagJsAnECegKFAokClwKeAqcCrAKvAAAAAAAAAg +EAAAAAAAAAKAAAAAAAAAAAAAAAAAAAArw=}, + Bdsk-Url-1 = {http://dx.doi.org/10.1021/jm7011485}} + +@article{Rohrer:2008ab, + Author = {Rohrer, S.G. and Baumann, K.}, + Date-Added = {2008-10-17 13:30:25 -0400}, + Date-Modified = {2008-10-17 13:30:25 -0400}, + Journal = {J.~Chem.~Inf.~Model.}, + Keywords = {virtual screening; benchmark; spatial; pubchem}, + Title = {Maximum Unbiased Validation ({MUV}) Datasets for Virtual Screening Based on PubChem Bioactivity Data}, + Volume = {in press}, + Year = {2008}} + +@article{Verdonk:2004aa, + Address = {Astex Technology Ltd., 436 Cambridge Science Park, Milton Road, CB4 0QA Cambridge, United Kingdom}, + Author = {Verdonk, M. L. and Berdini, V. and Hartshorn, M. J. and Mooij, W. T. M. and Murray, C. W. and Taylor, R. D. and Watson, P.}, + Date-Added = {2008-10-17 13:28:57 -0400}, + Date-Modified = {2008-10-17 13:28:57 -0400}, + Isbn = {1549-9596}, + Journal = {J.~Chem.~Inf.~Comput.~Sci.}, + Journal1 = {Journal of Chemical Information and Computer Sciences}, + Journal2 = {J. Chem. Inf. Comput. Sci.}, + Keywords = {virtual screening; benchmark; enrichment;}, + Number = {3}, + Pages = {793--806}, + Title = {Virtual Screening Using Protein-Ligand Docking: Avoiding Artificial Enrichment}, + Ty = {JOUR}, + Volume = {44}, + Year = {2004}, + Abstract = {Abstract: This study addresses a number of topical issues around the use of protein-ligand docking in virtual screening. We show that, for the validation of such methods, it is key to use focused libraries (containing compounds with one-dimensional properties, similar to the actives), rather than "random" or "drug-like" libraries to test the actives against. We also show that, to obtain good enrichments, the docking program needs to produce reliable binding modes. We demonstrate how pharmacophores can be used to guide the dockings and improve enrichments, and we compare the performance of three consensus-ranking protocols against ranking based on individual scoring functions. Finally, we show that protein-ligand docking can be an effective aid in the screening for weak, fragment-like binders, which has rapidly become a popular strategy for hit identification. All results presented are based on carefully constructed virtual screening experiments against four targets, using the protein-ligand docking program GOLD.}, + Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0t +leWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iam +VjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0Rhd +GFfECIuLi8uLi8uLi8uLi9hcnRpY2xlcy9jaTAzNDI4OXEucGRm0hsPHB1XTlMuZGF0YU8RAYoA +AAAAAYoAAgAADE1hY2ludG9zaCBIRAAAAAAAAAAAAAAAAAAAAMHmqkJIKwAAABbAjw1jaTAzNDI +4OXEucGRmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAkHKtxR5DhgAAAAAAAAAAAAQAAgAACSAAAAAAAAAAAAAAAAAAAAAIYXJ0aWNsZXMAEAAIAADB5 +vCSAAAAEQAIAADFHnvGAAAAAQAQABbAjwAH8GcAB/BaABlhNgACADlNYWNpbnRvc2ggSEQ6VXNl +cnM6cmd1aGE6RG9jdW1lbnRzOmFydGljbGVzOmNpMDM0Mjg5cS5wZGYAAA4AHAANAGMAaQAwADM +ANAAyADgAOQBxAC4AcABkAGYADwAaAAwATQBhAGMAaQBuAHQAbwBzAGgAIABIAEQAEgAsVXNlcn +Mvcmd1aGEvRG9jdW1lbnRzL2FydGljbGVzL2NpMDM0Mjg5cS5wZGYAEwABLwAAFQACAAz//wAAg +AbSHyAhIlgkY2xhc3Nlc1okY2xhc3NuYW1loyIjJF1OU011dGFibGVEYXRhVk5TRGF0YVhOU09i +amVjdNIfICYnoickXE5TRGljdGlvbmFyeQAIABEAGwAkACkAMgBEAEkATABRAFMAXABiAGkAdAB +8AIMAhgCIAIoAjQCPAJEAkwCgAKoAzwDUANwCagJsAnECegKFAokClwKeAqcCrAKvAAAAAAAAAg +EAAAAAAAAAKAAAAAAAAAAAAAAAAAAAArw=}, + Bdsk-Url-1 = {http://pubs3.acs.org/acs/journals/doilookup?in_doi=10.1021/ci034289q}} + +@article{Good:2008aa, + Author = {Good, A.C. and Oprea, T.I.}, + Date-Added = {2008-10-17 13:25:07 -0400}, + Date-Modified = {2008-10-17 13:25:07 -0400}, + Journal = {J.~Comp.~Aided Mol.~Des.}, + Keywords = {virtual screening; benchmark; metric; analog bias}, + Number = {3--4}, + Pages = {169--178}, + Title = {Optimization of {CAMD} Techniques 3. Virtual Screening Enrichment Studies: A Help or Hindrance in Tool Selection? }, + Volume = {22}, + Year = {2008}, + Abstract = {Over the last few years many articles have been published in an attempt to provide performance benchmarks for virtual screening tools. While this research has imparted useful insights, the myriad variables controlling said studies place significant limits on results interpretability. Here we investigate the effects of these variables, including analysis of calculation setup variation, the effect of target choice, active/decoy set selection (with particular emphasis on the effect of analogue bias) and enrichment data interpretation. In addition the optimization of the publicly available DUD benchmark sets through analogue bias removal is discussed, as is their augmentation through the addition of large diverse data sets collated using WOMBAT.}, + Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0t +leWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iam +VjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0Rhd +GFfEC8uLi8uLi8uLi8uLi9hcnRpY2xlcy9nb29kLWpjYW1kLTIwMDgtMjItMTY5LnBkZtIbDxwd +V05TLmRhdGFPEQG+AAAAAAG+AAIAAAxNYWNpbnRvc2ggSEQAAAAAAAAAAAAAAAAAAADB5qpCSCs +AAAAWwI8aZ29vZC1qY2FtZC0yMDA4LTIyLTE2OS5wZGYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAAAAAAAAAAAAAAAAAAJByZsUeQnsAAAAAAAAAAAAEAAIAAAkgAAAAAAAAAAAAAAAAAAAACGFyd +GljbGVzABAACAAAwebwkgAAABEACAAAxR56uwAAAAEAEAAWwI8AB/BnAAfwWgAZYTYAAgBGTWFj +aW50b3NoIEhEOlVzZXJzOnJndWhhOkRvY3VtZW50czphcnRpY2xlczpnb29kLWpjYW1kLTIwMDg +tMjItMTY5LnBkZgAOADYAGgBnAG8AbwBkAC0AagBjAGEAbQBkAC0AMgAwADAAOAAtADIAMgAtAD +EANgA5AC4AcABkAGYADwAaAAwATQBhAGMAaQBuAHQAbwBzAGgAIABIAEQAEgA5VXNlcnMvcmd1a +GEvRG9jdW1lbnRzL2FydGljbGVzL2dvb2QtamNhbWQtMjAwOC0yMi0xNjkucGRmAAATAAEvAAAV +AAIADP//AACABtIfICEiWCRjbGFzc2VzWiRjbGFzc25hbWWjIiMkXU5TTXV0YWJsZURhdGFWTlN +EYXRhWE5TT2JqZWN00h8gJieiJyRcTlNEaWN0aW9uYXJ5AAgAEQAbACQAKQAyAEQASQBMAFEAUw +BcAGIAaQB0AHwAgwCGAIgAigCNAI8AkQCTAKAAqgDcAOEA6QKrAq0CsgK7AsYCygLYAt8C6ALtA +vAAAAAAAAACAQAAAAAAAAAoAAAAAAAAAAAAAAAAAAAC/Q==}, + Bdsk-Url-1 = {http://dx.doi.org/10.1007/s10822-007-9167-2}} + @article{Nicholls:2008aa, Author = {Nicholls, A.}, Date-Added = {2008-10-17 10:28:29 -0400}, @@ -16,8 +315,23 @@ Pages = {239--255}, Title = {What Do We Know and When Do We Know It?}, Volume = {22}, - Year = {2008}, -} + Year = {2008}, + Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0t +leWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iam +VjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0Rhd +GFfEC0uLi8uLi8uLi8uLi9hcnRpY2xlcy9uaWNob2xscy1wZXJmLW1ldHJpYy5wZGbSGw8cHVdO +Uy5kYXRhTxEBtgAAAAABtgACAAAMTWFjaW50b3NoIEhEAAAAAAAAAAAAAAAAAAAAweaqQkgrAAA +AFsCPGG5pY2hvbGxzLXBlcmYtbWV0cmljLnBkZgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAAAAAAAAAAAAAAABV0fDD4NFgAAAAAAAAAAAABAACAAAJIAAAAAAAAAAAAAAAAAAAAAhhcnRpY +2xlcwAQAAgAAMHm8JIAAAARAAgAAMPhF7AAAAABABAAFsCPAAfwZwAH8FoAGWE2AAIARE1hY2lu +dG9zaCBIRDpVc2VyczpyZ3VoYTpEb2N1bWVudHM6YXJ0aWNsZXM6bmljaG9sbHMtcGVyZi1tZXR +yaWMucGRmAA4AMgAYAG4AaQBjAGgAbwBsAGwAcwAtAHAAZQByAGYALQBtAGUAdAByAGkAYwAuAH +AAZABmAA8AGgAMAE0AYQBjAGkAbgB0AG8AcwBoACAASABEABIAN1VzZXJzL3JndWhhL0RvY3VtZ +W50cy9hcnRpY2xlcy9uaWNob2xscy1wZXJmLW1ldHJpYy5wZGYAABMAAS8AABUAAgAM//8AAIAG +0h8gISJYJGNsYXNzZXNaJGNsYXNzbmFtZaMiIyRdTlNNdXRhYmxlRGF0YVZOU0RhdGFYTlNPYmp +lY3TSHyAmJ6InJFxOU0RpY3Rpb25hcnkACAARABsAJAApADIARABJAEwAUQBTAFwAYgBpAHQAfA +CDAIYAiACKAI0AjwCRAJMAoACqANoA3wDnAqECowKoArECvALAAs4C1QLeAuMC5gAAAAAAAAIBA +AAAAAAAACgAAAAAAAAAAAAAAAAAAALz}} @article{Irwin:2005aa, Address = {Department of Pharmaceutical Chemistry, University of California San Francisco, Genentech Hall, 600 16th Street, San Francisco, California 94143, USA.}, @@ -54,7 +368,7 @@ Volume = {45}, Year = {2005}, Abstract = {A critical barrier to entry into structure-based virtual screening is the lack of a suitable, easy to access database of purchasable compounds. We have therefore prepared a library of 727,842 molecules, each with 3D structure, using catalogs of compounds from vendors (the size of this library continues to grow). The molecules have been assigned biologically relevant protonation states and are annotated with properties such as molecular weight, calculated LogP, and number of rotatable bonds. Each molecule in the library contains vendor and purchasing information and is ready for docking using a number of popular docking programs. Within certain limits, the molecules are prepared in multiple protonation states and multiple tautomeric forms. In one format, multiple conformations are available for the molecules. This database is available for free download (http://zinc.docking.org) in several common file formats including SMILES, mol2, 3D SDF, and DOCK flexibase format. A Web-based query tool incorporating a molecular drawing interface enables the database to be searched and browsed and subsets to be created. Users can process their own molecules by uploading them to a server. Our hope is that this database will bring virtual screening libraries to a wide community of structural biologists and medicinal chemists.}, -} + Bdsk-Url-1 = {http://dx.doi.org/10.1021/ci049714}} @article{Kontoyianni:2005aa, Author = {Kontoyianni, M. and Sokol, G.S. and McClellan, L.M.}, @@ -69,7 +383,22 @@ Title = {Evaluation of Library Ranking Efficiency in Virtual Screening}, Volume = {26}, Year = {2005}, -} + Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0t +leWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iam +VjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0Rhd +GFfEC8uLi8uLi8uLi8uLi9hcnRpY2xlcy9rb250b3lpYW5uaS1jb21wYXJpc29uLnBkZtIbDxwd +V05TLmRhdGFPEQG+AAAAAAG+AAIAAAxNYWNpbnRvc2ggSEQAAAAAAAAAAAAAAAAAAADB5qpCSCs +AAAAWwI8aa29udG95aWFubmktY29tcGFyaXNvbi5wZGYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAAAAAAAAAAAAAAAAAADZMi8MZ5ZsAAAAAAAAAAAAEAAIAAAkgAAAAAAAAAAAAAAAAAAAACGFyd +GljbGVzABAACAAAwebwkgAAABEACAAAwxod2wAAAAEAEAAWwI8AB/BnAAfwWgAZYTYAAgBGTWFj +aW50b3NoIEhEOlVzZXJzOnJndWhhOkRvY3VtZW50czphcnRpY2xlczprb250b3lpYW5uaS1jb21 +wYXJpc29uLnBkZgAOADYAGgBrAG8AbgB0AG8AeQBpAGEAbgBuAGkALQBjAG8AbQBwAGEAcgBpAH +MAbwBuAC4AcABkAGYADwAaAAwATQBhAGMAaQBuAHQAbwBzAGgAIABIAEQAEgA5VXNlcnMvcmd1a +GEvRG9jdW1lbnRzL2FydGljbGVzL2tvbnRveWlhbm5pLWNvbXBhcmlzb24ucGRmAAATAAEvAAAV +AAIADP//AACABtIfICEiWCRjbGFzc2VzWiRjbGFzc25hbWWjIiMkXU5TTXV0YWJsZURhdGFWTlN +EYXRhWE5TT2JqZWN00h8gJieiJyRcTlNEaWN0aW9uYXJ5AAgAEQAbACQAKQAyAEQASQBMAFEAUw +BcAGIAaQB0AHwAgwCGAIgAigCNAI8AkQCTAKAAqgDcAOEA6QKrAq0CsgK7AsYCygLYAt8C6ALtA +vAAAAAAAAACAQAAAAAAAAAoAAAAAAAAAAAAAAAAAAAC/Q==}} @article{Bender:2005aa, Author = {Bender, A. and Glen, R.C.}, @@ -111,8 +440,7 @@ not being able to capture structural properties relevant to binding. This fact can partly be explained by highly nonlinear structure-activity relationships, which represent a severe limitation of the "similar - property principle" in the context of bioactivity.}, -} + property principle" in the context of bioactivity.}} @article{Bender:2004ab, Author = {Bender, A. and Glen, R.C.}, @@ -159,7 +487,7 @@ defines and limits its use. The key issues of solvation effects, heterogeneity of binding sites and the fundamental problem of the form of similarity measure to use are addressed.}, -} + Bdsk-Url-1 = {http://dx.doi.org/10.1039/B409813G}} @article{Stahura:2004aa, Author = {Stahura, F.L. and Bajorath,J.}, @@ -184,8 +512,7 @@ Title = {Managing Bias in {ROC} Curves}, Volume = {22}, Year = {2008}, - Abstract = {Two modifications to the standard use of receiver operating characteristic (ROC) curves for evaluating virtual screening methods are proposed. The first is to replace the linear plots usually used with semi-logarithmic ones (pROC plots), including when doing ``area under the curve'' (AUC) calculations. Doing so is a simple way to bias the statistic to favor identification of ``hits'' early in the recovery curve rather than late. A second suggested modification entails weighting each active based on the size of the lead series to which it belongs. Two weighting schemes are described: arithmetic, in which the weight for each active is inversely proportional to the size of the cluster from which it comes; and harmonic, in which weights are inversely proportional to the rank of each active within its class. Either scheme is able to distinguish biased from unbiased screening statistics, but the harmonically weighted AUC in particular emphasizes the ability to place representatives of each class of active early in the recovery curve.}, -} + Abstract = {Two modifications to the standard use of receiver operating characteristic (ROC) curves for evaluating virtual screening methods are proposed. The first is to replace the linear plots usually used with semi-logarithmic ones (pROC plots), including when doing ``area under the curve'' (AUC) calculations. Doing so is a simple way to bias the statistic to favor identification of ``hits'' early in the recovery curve rather than late. A second suggested modification entails weighting each active based on the size of the lead series to which it belongs. Two weighting schemes are described: arithmetic, in which the weight for each active is inversely proportional to the size of the cluster from which it comes; and harmonic, in which weights are inversely proportional to the rank of each active within its class. Either scheme is able to distinguish biased from unbiased screening statistics, but the harmonically weighted AUC in particular emphasizes the ability to place representatives of each class of active early in the recovery curve.}} @article{Truchon:2007aa, Author = {Truchon, J.-F. and Bayly, C. I.}, @@ -197,8 +524,7 @@ Pages = {488--508}, Title = {Evaluating Virtual Screening Methods: Good and Bad Metrics for the ``Early Recognition'' Problem }, Volume = {47}, - Year = {2007}, -} + Year = {2007}} @article{Flower:1998aa, Author = {Flower, D.R.}, Modified: cdk-fingerprint-paper/trunk/paper/bmc_article.tex =================================================================== --- cdk-fingerprint-paper/trunk/paper/bmc_article.tex 2008-10-17 16:14:59 UTC (rev 12727) +++ cdk-fingerprint-paper/trunk/paper/bmc_article.tex 2008-10-17 20:17:24 UTC (rev 12728) @@ -101,7 +101,7 @@ \newenvironment{bmcformat}{\begin{raggedright}\baselineskip20pt\sloppy\setboolean{publ}{false}}{\end{raggedright}\baselineskip20pt\sloppy} % Publication style settings -% \newenvironment{bmcformat}{\fussy\setboolean{publ}{true}}{\fussy} +%\newenvironment{bmcformat}{\fussy\setboolean{publ}{true}}{\fussy} @@ -132,8 +132,8 @@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - \author{Charles Darwin\correspondingauthor$^{1,2}$% - \email{Charles A Darwin\correspondingauthor - ch...@lo...}% + \author{Rajarshi Guha\correspondingauthor$^{1}$% + \email{Rajarshi Guha\correspondingauthor - rg...@in...}% \and Jane E Doe\correspondingauthor$^2$% \email{Jane E Doe\correspondingauthor - jan...@ca...} @@ -150,8 +150,7 @@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \address{% - \iid(1)Life Sciences Department, Kings College London, Cornwall House,% - Waterloo Road, London, UK\\ + \iid(1)School of Informatics, Indiana University, Bloomington, IN 47408\\ \iid(2)Department of Zoology, Cambridge, Waterloo Road, London, UK\\ \iid(3)Marine Ecology Department, Institute of Marine Sciences Kiel, % D\"{u}sternbrooker Weg 20, 24105 Kiel, Germany @@ -224,15 +223,45 @@ %% Background %% %% \section*{Background} + + Binary fingerprints are bit string representations of molecular + structures and come in a variety of types. In the most common + type, each bit of the fingerprint corresponds to a specific + substructural feature (say an aromatic ring or an aldehyde + group). Other forms of fingerprints include hashed fingerprints + and atom environment fingerprints. While these representations + were initially designed for similarity searching in databases, + they have been become an important component of virtual screening + pipelines. That this is possible, is due to the ``similarity + principle''\cite{Martin:2002ab}, underlying much of virtual + screening in drug discovery scenarios, which states that similar + molecules will have similar activities. While there have been many + counter-examples\cite{Maggiora:2006aa}, this approach has been + fruitful in a number of cases. + A number of fingerprint implementations are available from + commercial vendors and a few from academic groups. The Chemistry + Development Kit (CDK) is an Open Source Java + library\cite{Steinbeck:2003bh,Steinbeck:2006aa} for + cheminformatics and provides several fingerprint + implementations. More specifically, it provides two structural key + type fingerprints and two hashed fingerprints. While the library + has been used in a number of projects, there has been no formal + testing of how well the CDK fingerprints perform in a virtual + screening scenario. It should be noted that the two structural key + fingerprints are implementations of well studied schemes + (MACCS\cite{Durant:2002aa} and EState keys) and their performance + is well known. However the two hashed fingerprints, while based on + the well known Daylight specification, have never been formally + benchmarked. The goal of this study is to compare the performance + of the CDK hashed fingerprints to other well known fingerprint + types. %%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Results and Discussion %% %% \section*{Results and Discussion} - \subsection*{Rank correlations} - \subsection*{Enrichment curves} \subsection*{Information content} @@ -251,10 +280,26 @@ %%%%%%%%%%%%%%%%%% \section*{Methods} - \subsection*{Fingerprint design} - Summary of how CDK FP's are calculated. Brief mention of MACCS, - BCI and ECFP + \subsection*{Fingerprints} + Summary of how CDK FP's are calculated. + In addition to the path based fingerprints described above, we + also considered structural key type fingerprints. In these types, + each bit positions corresponds explicitly to a substructural + features. In this study we employed the MACCS 166 bit + keys\cite{Durant:2002aa} (implemented in the CDK) and the BCI 1052 + bit keys\cite{Barnard:1997aa}. + + Finally, we also considered atom environment fingerprints, + specifically the extended connectivity fingerprints (ECFP) as + implemented in Pipeline Pilot (Scitegic, Inc.). These types of + fingerprints characterize each atom in terms of the environment + around it, usually going up to 6 or 8 bonds from the atom in + question. The ECFP's characterize the atoms using features such as + hydrogen bonding donor capability, lipophilicity and so on. In + this study we considered the ECFP-6 type, which considers atoms up + to 6 bonds away from a central atom. + \subsection*{Measures of effectiveness} Use of enrichment curves, enrichment factors. Note that they are not the best of @@ -266,8 +311,24 @@ \subsection*{Benchmark Datasets} - 17 PubChem derived datasets from MUV\cite{Rohrer:2008aa}, note that these are designed - so that 2D fp's do not have an unfair advantage. + A number of datasets have been employed for benchmarking + fingerprint methods including ZINC\cite{Irwin:2005aa} and the MDL + Drug Discovery Report (MDDR). For the purposes of this study we + employed the 17 virtual screening benchmark datasets described by + Rohrer and Baumann\cite{Rohrer:2008ab}, collectively termed the + Maximum Unbiased Validation (MUV) datasets. These datasets are + derived from PubChem bioassays, each dataset corresponding to a + specific bioassay. Examples of the targets considered by these + datasets include FXIa inhibitors, FXIIa inhibitors, SF1 and HIV + RT-RNase inhibitors. More broadly, the datasets cover several + target classes including proteases, GPCR's, kinases and nuclear + receptors. These datasets were constructed to specifically avoid + the problem encountered with other datasets, namely, that many + datasets lend an unfair advantage for 2D methods over 3D + methods. More specifically, the actives in each of the datasets + exhibit a wide variety of scaffold classes, thus avoiding the + problems of analog bias\cite{Good:2008aa} and artificial + enrichment\cite{Verdonk:2004aa}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |