From: Chris M. <c.m...@ga...> - 2011-05-28 13:49:08
|
On 27/05/2011 22:58, A. Heifets wrote: > I'm trying to follow http://openbabel.org/wiki/Tutorial:Fingerprints > and obabel -H fs on my data but I have some strange digressions from > the tutorial. > > First, I'm not convinced that the index was built correctly (although > there were no error messages to imply that it failed). I suspect that the problem is the size of your datafile, which I guess is probably about 18GB. fs index files contain displacements into this file but they are only 32 bits, making the maximum file size 4GB. This means that the maximum number of molecules might be about 2 million for sdf files although much greater for SMILES files. This limitation is not documented and needs to be. There also needs to be a warning when preparing the index when the datafile is found to be too large. And, of course, the deficiency needs to be eliminated by changing the structure of the index file, which will not happen in the next release, but maybe will later. I don't understand your difficulty with the -s parameter being interpreted as SMILES instead of a file name. Possibly it could be because of a corrupted fs file. Can you try again with a nice small dataset? Thanks for the detailed reporting. Chris If you look at > the first log below [1], you can see that OpenBabel found 8.2 million > molecules and converted 7 thousand. Is there a way to tell why it > didn't convert the rest. Invalid structures? Abort after the Expand > Warning? I'm also unsure why OB reports taking 39 seconds when the > date stamps report 20 minutes. > > I tested whether the entire database was converted by pulling the last > molecule and querying the index for it. If the whole file was > successfully converted, then the search would find it (or, at least, > other molecules with Tanimoto coefficient = 1). So, I copy and pasted > the last molecule into a file [2]. My second problem (see log [3]) > was that OpenBabel interprets the '-s' parameter as a SMILES string, > unlike the "obabel -H fs" help which says I can pass in a filename. > > Fortunately, I had a copy of the SMILES string, so I tried querying > with that. As you can see in log [4], no Tanimoto coefficient 1 > molecules were pulled out, so I take that to confirm my initial > suspicions that the index didn't get all of the molecules. This > surprises me since, as you can see in log [2] below, the molecule > seems fine to me; I'm not sure why it didn't get added to the index. > > My question: what am I doing wrong? > > Thanks! > > Cheers, > Abe > > > [1] The log of making the index. The SVN build is fresh: > nohup bash -c "date&& > /home/aheifets/opt/openbabel-svn/build/bin/obabel --errorlevel 5 -isdf > DB.sdf -ofs -ODB_manual.fs&& date">index1.log 2>index1.err > $ cat index1.* > ============================== > *** Open Babel Warning in Expand > Alias CH3. was not chemically interpreted > > This will prepare an index of RXN_db.sdf and may take some time... > It contains 8271991 molecules Estimated completion time 7e+02 minutes > > It took 39 seconds > 7151 molecules converted > Fri May 27 16:36:35 EDT 2011 > Fri May 27 16:57:04 EDT 2011 > > [2] The contents of last_mol.sdf: > $ cat last_mol.sdf > 0028.mol#1 > OpenBabel05201115292D > > 21 21 0 0 0 0 0 0 0 0999 V2000 > -0.2826 -1.4437 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > 0.5424 -1.4437 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 > 0.5424 -0.6187 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 > -0.1721 -0.2062 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > -0.1721 0.6188 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > -0.8865 1.0313 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > -0.8865 1.8562 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 > -1.6010 2.2687 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > -2.3155 1.8562 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > 3.9796 -1.4437 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > 3.1546 -2.2687 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > 2.9149 -0.6187 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 > 1.3674 -1.4437 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 > 0.5424 -2.2687 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 > -0.6951 -2.1582 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > -1.5201 -2.1582 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > -1.9326 -1.4437 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > -2.7576 -1.4437 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 > -3.5826 -1.4437 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > -1.5201 -0.7293 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > -0.6951 -0.7293 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > 1 21 2 0 0 0 0 > 1 2 1 0 0 0 0 > 2 14 2 0 0 0 0 > 2 13 2 0 0 0 0 > 2 3 1 0 0 0 0 > 3 4 1 0 0 0 0 > 4 5 1 0 0 0 0 > 5 6 1 0 0 0 0 > 6 7 1 0 0 0 0 > 7 8 1 0 0 0 0 > 8 9 1 0 0 0 0 > 10 9 2 0 0 0 0 > 11 9 1 0 0 0 0 > 12 8 2 0 0 0 0 > 15 1 1 0 0 0 0 > 16 15 2 0 0 0 0 > 17 16 1 0 0 0 0 > 17 18 1 0 0 0 0 > 18 19 1 0 0 0 0 > 20 17 2 0 0 0 0 > 21 20 1 0 0 0 0 > M END >> <cansmi> > CSc1ccc(cc1)S(=O)(=O)NCCCOC(=O)C(=C)C > >> <formula> > C14H19NO4S2 > >> <InChI> > InChI=1S/C14H19NO4S2/c1-11(2)14(16)19-10-4-9-15-21(17,18)13-7-5-12(20-3)6-8-13/h5-8,15H,1,4,9-10H2,2-3H3 > > $$$$ > > [3] The log where OpenBabel interprets an SDF filename as a SMILES string: > $ /home/aheifets/opt/openbabel-svn/build/bin/obabel ./DB_manual.fs > -osdf -Ojunk.sdf -s last_mol.sdf -at5 > ============================== > *** Open Babel Warning in ReadMolecule > Either the file contains Atom Lists, which are not currently > supported and are ignored > or the atom or bond count is>999, which is not allowed in V2000 MDL files. > ============================== > *** Open Babel Error in ReadMolecule > last_mol.sdf contained a character '_' which is invalid in SMILES > ============================== > *** Open Babel Error in ObtainTarget > Cannot read the SMILES string > 0 molecules converted > $ cat last_mol.sdf | grep '_' > $ > > [4] The log where OpenBabel doesn't seem to find the original query molecule: > $ /home/aheifets/opt/openbabel-svn/build/bin/obabel ./DB_manual.fs > -osdf -Ojunk.sdf -s 'CSc1ccc(cc1)S(=O)(=O)NCCCOC(=O)C(=C)C' -at5 -ofpt > 5 molecules converted > $ cat junk.sdf >> 0450.cdx >> 00452001.cdx#3 Tanimoto from 0450.cdx = 0.221591 >> 0590.cdx#7 Tanimoto from 0450.cdx = 0.278302 >> 0260.cdx Tanimoto from 0450.cdx = 0.256039 >> 025001.cdx#2 Tanimoto from 0450.cdx = 0.238889 > > |