From: Shamshad A. <sha...@gm...> - 2014-07-03 14:19:32
|
Hi Nills, Well, I missed something in the illustration. Currently in following information is being saved with scaffolds - 1. SCAFFOLD (boolean) - to indicate it is an scaffold 2. NO_OF_MOLECULES (int) - number of molecules in the scaffold. 3. RING_SIZE (int) - number of ring in the scaffold 4. PARENT_COUNT (int) - number of parents 5. PARENT_0, PARENT_1.... PARENT_N (string) - smiles string of the nth parent So, tree/network structure is preserved in file. Molecules are unaffected i.e. they are saved without adding any additional property to identify which scaffold they belong to. >> technical question: Does your.... This is really an issue. If we keep all the molecules in memory it will consume a lot of memory and if we open a file, write the molecule to it, and close it then for each molecule again it's not an efficient way. I'm currently storing scaffolds and molecules in a map (Map<String, OutputScaffold[1]>) which of course stays in memory. In my opinion after processing certain number of molecules (say 1000) we should write the structure to the temp files and at the end of the process we should merge all such temp files. However, not sure it would the efficient and yet it is little vague to me. How can we handle it more efficiently? Have you any other idea? [1] - OutputScaffold has following fields - private static class OutputScaffold { private String smiles; private List<String> parents; private List<String> children; private List<IMolecule> molecules; private int ringSize; } Thanks, Shamshad On Thu, Jul 3, 2014 at 1:22 PM, <nl...@us...> wrote: > Hi Shamshad, > > thank you for the illustration. So currently the tree/network structure is > not > saved? I think this is a very important feature. One solution might be to > assign a unique id to every scaffold. Then each molecule can have a > property > "SCAFFOLD_ID" that identifies the associated scaffold. A scaffold could > have > an additional property "PARENT_SCAFFOLD_IDS" that contains a single > scaffold > id in case of a tree and a list of parent scaffolds for networks. > > A technical question: Does your approach require to keep all molecules and > scaffolds in memory? Avoiding this would be great and would allow to use > the > CLI with very large data sets. > > > Regards, > Nils > > On Wednesday 02 July 2014 21:49:49 Shamshad Alam wrote: > > Hi Nils, > > > > Nice suggestions! > > > > I'm currently saving scaffolds and associated molecules in a single sdf > > file. The file structure can be better understood with the help of > attached > > illustration. > > > > Thanks, > > Shamshad > > > > On Wed, Jul 2, 2014 at 1:58 PM, <nl...@us...> wrote: > > > Hi Shamshad, > > > > > > I think it would be good not only to write out the scaffolds, but also > the > > > molecules with an additional property that identifies the associated > > > scaffold. > > > How do you plan to save the tree/network structure? > > > An additional parameter --min-ring-size might also be useful. > > > > > > > > > Regards, > > > > > > Nils > > > > > > On Wednesday 25 June 2014 08:37:11 Shamshad Alam wrote: > > > > Hi, > > > > > > > > I am working on Command Line Interface (CLI) which is a GSoC-2014 > > > > > > project. > > > > > > > And I would like get some feedback on the commands and parameters > used > > > > with commands in CLI. > > > > > > > > In the first phase, our aim is to implement the commands to generate > > > > Scaffold tree from the molecules in the SDF file or in the database. > We > > > > > > are > > > > > > > also offering user to specify the ring size of the scaffolds which > are > > > > to > > > > be included in output. The generated scaffold is saved in the sdf > file > > > > > > that > > > > > > > can be further used for analysis. An SDF file is required to generate > > > > scaffold tree from the file and connection data is needed to make > > > > connection with database to generate scaffold tree from database. > > > > > > > > In the later stages we have planned to implement filtering by > structure > > > > > > and > > > > > > > substructure e. g. only certain subtrees based on the structure / > > > > substructure search will saved in output. > > > > > > > > These are proposed commands and their parameters so far: > > > > > > > > (1) generate : This command is used to generate scaffold tree / > network > > > > > > and > > > > > > > it needs a source of molecules like sdf file and a destination file > to > > > > > > save > > > > > > > generated scaffolds. It is used in combination with following > parameters > > > > > > - > > > > > > > (a) To generate scaffold tree / network loading molecules from file > > > > -n | --network : Use this parameter to generate scaffold network, > > > > absence > > > > of this parameter means you want to generate scaffold tree > > > > -i | --input-file <file_location> : Specify location of input file to > > > > > > read > > > > > > > molecules from. > > > > -o | --output-file <file_location> : Specify file in which generated > > > > scaffold would be saved > > > > -m | --max-ring-size <number> : Specify the maximum ring size of > > > > scaffold > > > > that should be included in the output > > > > > > > > (b) To generate scaffold tree / network by loading molecules from > > > > > > Scaffold > > > > > > > Hunter database > > > > -c | --connection-name <connection-name> : Name of the connection > that > > > > would be used for connection with the database to retrieve molecules. > > > > -d | --dataset <dataset_name> : Specify name of the dataset to > retrieve > > > > molecules from and generate scaffold tree > > > > -o | --output-file <file_location> : Specify file in which generated > > > > scaffold would be saved > > > > -m | --max-ring-size <number> : Specify the maximum ring size of > > > > scaffold > > > > that should be included in the output > > > > -n | --network : Use this parameter to generate scaffold network, > > > > absence > > > > of this parameter means you want to generate scaffold tree > > > > > > > > These are some examples of 'generate' command : > > > > > > > > (a) Read molecules from file > > > > > > > > sh generate -i <input-file> > > > > Read molecules from input file and generate scaffold tree. Generated > > > > tree > > > > is saved in a file automatically. > > > > > > > > sh generate -i <input-file> -o scaffold.sdf > > > > Read molecules from input file and generate scaffold tree. Generated > > > > tree > > > > is saved in scaffold.sdf. > > > > > > > > sh generate -i <input-file> -o scaffold-2.sdf -n > > > > Read molecules from input file and generate scaffold 'network'. > > > > Generated > > > > network is saved in scaffold-2.sdf. > > > > > > > > sh generate -i <input-file> -o scaffold-3.sdf -m 5 > > > > Read molecules from input file and generate scaffold tree. Generated > > > > tree > > > > is saved in scaffold-3.sdf. All scaffolds with rings more than 5 are > not > > > > saved. > > > > > > > > (b) Read molecules from database > > > > > > > > sh generate -c <connection-name> > > > > Read molecules from database which connection data is pointed by > > > > <connection-name> and generate scaffold tree. Generated tree is > saved in > > > > file automatically. > > > > > > > > sh generate -c <connection-name> -o scaffold.sdf > > > > Read molecules from database which connection data is pointed by > > > > <connection-name> and generate scaffold tree. Generated tree is > saved in > > > > scaffold.sdf. > > > > > > > > Similarly, you can use -n to generate network and -m <number> to > limit > > > > > > the > > > > > > > ring size as you've done in scaffold tree generation from file. > > > > > > > > (2) connection [list | save | delete] : This command is used to > manage > > > > connection data which is required for connection with database. It is > > > > followed by list, save or delete action. These are parameters > supported > > > > > > by > > > > > > > the command : > > > > > > > > -c | --connection-name <name> : name of the connection > > > > -t | --database-type : Type of database (mySql, HSQLDB) > > > > -u | --url : Url of the mySql database server or file location of > HSQLDB > > > > -n | --database-name : Name of the database > > > > -un | --user-name : user name to login > > > > -p | --password : password of the database (You can avoid this > parameter > > > > and specify it at runtime during connection is made to database) > > > > > > > > Here is some uses of connection command - > > > > > > > > sh connection list > > > > display the names of available connections on screen > > > > > > > > sh connection list -c <name> > > > > display the details of particular connection data pointed by name > > > > > > excluding > > > > > > > password > > > > > > > > sh connection delete -c <name> > > > > delete the connection data pointed by the name > > > > > > > > connection save -c <name> -t <hsqldb | mysql> -u <url> -n > > > > <database-name> > > > > -un <user-name> > > > > save a new connection data with specified name which can be used > later > > > > to > > > > make connection with database > > > > > > > > Feedback - > > > > > > > > We are seeking your feedback on implemented commands and proposed > > > > parameters to make the command line interface more useful. So, please > > > > > > give > > > > > > > your inputs to a few questions given here : > > > > 1. What are the features of scaffold hunter you want to use in > command > > > > > > line > > > > > > > interface? > > > > 2. Do you find the parameters provided for scaffold tree / network > > > > generation user friendly? > > > > 3. Do you need some more parameters to control the generation > strategy? > > > > 4. Do you think some proposed parameters are redundant? If so, please > > > > specify the name of those parameters. > > > > > > > > You may also put any other suggestions regarding command line > interface > > > > > > > > Thanks, > > > > Shamshad > -- -Shamshad Alam +91 9911631198 LinkedIn : in.linkedin.com/pub/shamshad-alam/58/939/70 GitHub: @shamshad-npti Facbook: @shamshad.saralindia |