From: Dave H. <doc...@gm...> - 2007-08-28 21:03:35
|
I would like to import a number of Affymetrix .CEL files into the GUS database, which was installed from top of trunk from the GUS svn repository. The CEL files each have some text headers, and then binary data afterwards. So I suppose they are in CEL Version 4 format. Doing some search on previous posts, I came across this one: http://sourceforge.net/mailarchive/message.php?msg_id=Pine.LNX.4.61.0512141526200.18143%40hera.pcbi.upenn.edu It seems that at the time of the post (12/2005), the way these .CEL files would be imported was that the headers would go to one of the Affymetrix views (AffymetrixMAS4 or AffymetrixMAS5 or AffymetrixCEL), the actual file would sit in the file system, and we'd insert a row to the RAD.Quantification table with a URI pointing to the location of the .CEL file. Also, looking through the different plugins in both the Supported and Community folders, it seems LoadBatchArrayResults supports the cel4 format. Is this the plugin I should use? Any help would be much appreciated. Thanks. Best regards, Dave Hau |
From: Elisabetta M. <man...@pc...> - 2007-08-29 00:23:37
|
Hi Dave, let me clarify GUS vs Affy. Affymetrix quantified results are of two types, corresponding to 2 different level of analysis: (i) probe-cell level results (e.g. from .CEL files), which contain intensity values for each individual probe cell on the chip; and (ii) probe-set level results (e.g. obtained from MAS4 or MAS 5 and in the .CHP files, or from RMA or gcRMA) which contain *summarized* intensities for probe sets on the chip. The GUS schema in principle supports storage of both: (i) the probe cell results would go into a view of RAD.ElementResultImp (in fact there is a view to this end called RAD.AffymetrixCEL); (ii) the probe set results would go to view of RAD.CompositeElementResultImp. For the latter, currently we have views to accomodate MAS4 or 5 (RAD.AffymetrixMAS4 or RAD.AffymetrixMAS5) and RMA/gcRMA results (RAD.RMAExpress, which will actually be renamed RAD.RMA in the next GUS release). Now, here at CBIL, we do not store or support loading of the .CEL file data in the database, because we really only use the probe-set level results in our applications, so we have no need to store .CEL in the db. So the way we do it is as follows: * for every Affymetrix assay, we have TWO related quantifications, one corresponding to the .CEL quantification and the other corresponding to whatever summarization quantification was created (e.g. with MAS4, MAS5, RMA); * we place 2 entries in RAD.Quantifications, one pointing to the uri of the .CEL file (which we keep on our server) and one pointing to the uri of the probe-set level result file * we however do not store the data from the .CEL file in RAD.AffymetrixCEL * we only store the data from the probe-set level results in one of the RAD.CompositeElementResultImp views mentioned above. The current plugin in GUS::Supported, as Junmin mentioned in the posting you are referring to, can be used to populate the data for the probe-set level results. As far as I know, we do not have currently a plugin to store the .CEL files in the db. So the db allows for the latter, but you'd have to write your own plugin. We didn't find useful to store .CEL results in GUS, but again this depends on the type of applications you might be interested in. Hope this helps, Elisabetta On Tue, 28 Aug 2007, Dave Hau wrote: > I would like to import a number of Affymetrix .CEL files into the GUS > database, which was installed from top of trunk from the GUS svn > repository. The CEL files each have some text headers, and then binary > data afterwards. So I suppose they are in CEL Version 4 format. > > Doing some search on previous posts, I came across this one: > > http://sourceforge.net/mailarchive/message.php?msg_id=Pine.LNX.4.61.0512141526200.18143%40hera.pcbi.upenn.edu > > It seems that at the time of the post (12/2005), the way these .CEL > files would be imported was that the headers would go to one of the > Affymetrix views (AffymetrixMAS4 or AffymetrixMAS5 or AffymetrixCEL), > the actual file would sit in the file system, and we'd insert a row to > the RAD.Quantification table with a URI pointing to the location of the > .CEL file. > > Also, looking through the different plugins in both the Supported and > Community folders, it seems LoadBatchArrayResults supports the cel4 > format. Is this the plugin I should use? > > Any help would be much appreciated. Thanks. > > Best regards, > Dave Hau |
From: Dave H. <doc...@gm...> - 2007-09-12 23:13:25
|
Elisabetta, Thanks for your and John Brestelli's (via personal email) very informative replies. They are very helpful indeed. Regarding loading .CEL files (probe cell data, not probe set data), John mentioned the plugin GUS::Community::Plugin::LoadBatchArrayResults which I had noticed too. The help page for this plugin mentions a number of quantification protocols supported including mas4/mas5 (Affymetrix MAS 4.0 and 5.0 Probe Set quantification protocol) and cel4/cel5 (Affymetrix MAS 4.0 and 5.0 Probe Cell quantification protocol). It seems that cel4/cel5 would correspond to the .CEL files I need to load (i.e. probe *cell* data). Is this correct? I was wondering because you mentioned in your reply that there's no plugin available for loading probe cell data. Also, in the Affymetrix file format description document ( http://www.affymetrix.com/support/developer/AffxFileFormats.ZIP ), two file formats are described: Version 3 files (text data) generated by the MAS software, and version 4 files (binary data) generated by the GCOS software. So both cel4 and cel5 for the plugin would correspond to Version 3 files, right? That means the LoadBatchArrayResults plugin does not support the Version 4 (binary) file format, correct? Thanks again for your help. Best regards, Dave Hau Elisabetta Manduchi wrote: > > Hi Dave, > let me clarify GUS vs Affy. > Affymetrix quantified results are of two types, corresponding to 2 > different level of analysis: > > (i) probe-cell level results (e.g. from .CEL files), which contain > intensity values for each individual probe cell on the chip; and > (ii) probe-set level results (e.g. obtained from MAS4 or MAS 5 and in > the .CHP files, or from RMA or gcRMA) which contain *summarized* > intensities for probe sets on the chip. > > The GUS schema in principle supports storage of both: > > (i) the probe cell results would go into a view of > RAD.ElementResultImp (in fact there is a view to this end called > RAD.AffymetrixCEL); > (ii) the probe set results would go to view of > RAD.CompositeElementResultImp. For the latter, currently we have views > to accomodate MAS4 or 5 (RAD.AffymetrixMAS4 or RAD.AffymetrixMAS5) and > RMA/gcRMA results (RAD.RMAExpress, which will actually be renamed > RAD.RMA in the next GUS release). > > Now, here at CBIL, we do not store or support loading of the .CEL file > data in the database, because we really only use the probe-set level > results in our applications, so we have no need to store .CEL in the db. > So the way we do it is as follows: > * for every Affymetrix assay, we have TWO related quantifications, one > corresponding to the .CEL quantification and the other corresponding > to whatever summarization quantification was created (e.g. with MAS4, > MAS5, RMA); > * we place 2 entries in RAD.Quantifications, one pointing to the uri > of the .CEL file (which we keep on our server) and one pointing to the > uri of the probe-set level result file > * we however do not store the data from the .CEL file in > RAD.AffymetrixCEL > * we only store the data from the probe-set level results in one of > the RAD.CompositeElementResultImp views mentioned above. > > The current plugin in GUS::Supported, as Junmin mentioned in the > posting you are referring to, can be used to populate the data for the > probe-set level results. As far as I know, we do not have currently a > plugin to store the .CEL files in the db. > So the db allows for the latter, but you'd have to write your own > plugin. We didn't find useful to store .CEL results in GUS, but again > this depends on the type of applications you might be interested in. > Hope this helps, > Elisabetta > > > On Tue, 28 Aug 2007, Dave Hau wrote: > >> I would like to import a number of Affymetrix .CEL files into the GUS >> database, which was installed from top of trunk from the GUS svn >> repository. The CEL files each have some text headers, and then binary >> data afterwards. So I suppose they are in CEL Version 4 format. >> >> Doing some search on previous posts, I came across this one: >> >> http://sourceforge.net/mailarchive/message.php?msg_id=Pine.LNX.4.61.0512141526200.18143%40hera.pcbi.upenn.edu >> >> >> It seems that at the time of the post (12/2005), the way these .CEL >> files would be imported was that the headers would go to one of the >> Affymetrix views (AffymetrixMAS4 or AffymetrixMAS5 or AffymetrixCEL), >> the actual file would sit in the file system, and we'd insert a row to >> the RAD.Quantification table with a URI pointing to the location of the >> .CEL file. >> >> Also, looking through the different plugins in both the Supported and >> Community folders, it seems LoadBatchArrayResults supports the cel4 >> format. Is this the plugin I should use? >> >> Any help would be much appreciated. Thanks. >> >> Best regards, >> Dave Hau > |
From: Elisabetta M. <man...@pc...> - 2007-09-12 23:52:53
|
Hi Dave, the LoadBatchArrayResult wants to know the cel protocols because it enters 2 entries in RAD.Quantification per assay: one for the .CEL quantification and one for the probe set quantification. What's entered in Quantification are just the protocol references (e.g. reference to entries in RAD.Protocol describing the CEL 4, MAS 5, RMA protocols) and the uri with the path to the actual data files on the fileserver). Then LoadBatchArrayResult calls LoadSimpleArrayResults which actually takes care of entering the quantified data in views of RAD.ElementResultImp or RAD.CompositeElementResultImp. Now, definitely the latter plugin will populate views such as AffymetrixMAS4 and AFFymetrixMAS5 and RMAExpress, which corresponds to probe set quantified data. I believe from earlier correspondence Junmin (here cc-ed), who wrote that LoadSimpleArrayResult, said that doesn't support loading of AffymetrixCel. But I see this view mentioned in the code of that plugin, so I'm deferring to Junmin to double-check on that. The plugin *only accepts text files* as data files. So files like the Metrics files (the .txt correspondent of the .CHP MAS4/5 files) will do, as well as RMA like text files. I believe with GCOS it is possible to export the data as metrics (txt) files corresponding to quantifications using the MAS5 algorithm. Elisabetta --- On Wed, 12 Sep 2007, Dave Hau wrote: > Elisabetta, > > Thanks for your and John Brestelli's (via personal email) very informative > replies. They are very helpful indeed. > > Regarding loading .CEL files (probe cell data, not probe set data), John > mentioned the plugin GUS::Community::Plugin::LoadBatchArrayResults which I > had noticed too. The help page for this plugin mentions a number of > quantification protocols supported including mas4/mas5 (Affymetrix MAS 4.0 > and 5.0 Probe Set quantification protocol) and cel4/cel5 (Affymetrix MAS 4.0 > and 5.0 Probe Cell quantification protocol). It seems that cel4/cel5 would > correspond to the .CEL files I need to load (i.e. probe *cell* data). Is this > correct? I was wondering because you mentioned in your reply that there's no > plugin available for loading probe cell data. > > Also, in the Affymetrix file format description document ( > http://www.affymetrix.com/support/developer/AffxFileFormats.ZIP ), two file > formats are described: Version 3 files (text data) generated by the MAS > software, and version 4 files (binary data) generated by the GCOS software. > So both cel4 and cel5 for the plugin would correspond to Version 3 files, > right? That means the LoadBatchArrayResults plugin does not support the > Version 4 (binary) file format, correct? > > Thanks again for your help. > > Best regards, > Dave Hau > > > Elisabetta Manduchi wrote: >> >> Hi Dave, >> let me clarify GUS vs Affy. >> Affymetrix quantified results are of two types, corresponding to 2 >> different level of analysis: >> >> (i) probe-cell level results (e.g. from .CEL files), which contain >> intensity values for each individual probe cell on the chip; and >> (ii) probe-set level results (e.g. obtained from MAS4 or MAS 5 and in the >> .CHP files, or from RMA or gcRMA) which contain *summarized* intensities >> for probe sets on the chip. >> >> The GUS schema in principle supports storage of both: >> >> (i) the probe cell results would go into a view of RAD.ElementResultImp (in >> fact there is a view to this end called RAD.AffymetrixCEL); >> (ii) the probe set results would go to view of >> RAD.CompositeElementResultImp. For the latter, currently we have views to >> accomodate MAS4 or 5 (RAD.AffymetrixMAS4 or RAD.AffymetrixMAS5) and >> RMA/gcRMA results (RAD.RMAExpress, which will actually be renamed RAD.RMA >> in the next GUS release). >> >> Now, here at CBIL, we do not store or support loading of the .CEL file data >> in the database, because we really only use the probe-set level results in >> our applications, so we have no need to store .CEL in the db. >> So the way we do it is as follows: >> * for every Affymetrix assay, we have TWO related quantifications, one >> corresponding to the .CEL quantification and the other corresponding to >> whatever summarization quantification was created (e.g. with MAS4, MAS5, >> RMA); >> * we place 2 entries in RAD.Quantifications, one pointing to the uri of the >> .CEL file (which we keep on our server) and one pointing to the uri of the >> probe-set level result file >> * we however do not store the data from the .CEL file in RAD.AffymetrixCEL >> * we only store the data from the probe-set level results in one of the >> RAD.CompositeElementResultImp views mentioned above. >> >> The current plugin in GUS::Supported, as Junmin mentioned in the posting >> you are referring to, can be used to populate the data for the probe-set >> level results. As far as I know, we do not have currently a plugin to store >> the .CEL files in the db. >> So the db allows for the latter, but you'd have to write your own plugin. >> We didn't find useful to store .CEL results in GUS, but again this depends >> on the type of applications you might be interested in. >> Hope this helps, >> Elisabetta >> >> >> On Tue, 28 Aug 2007, Dave Hau wrote: >> >>> I would like to import a number of Affymetrix .CEL files into the GUS >>> database, which was installed from top of trunk from the GUS svn >>> repository. The CEL files each have some text headers, and then binary >>> data afterwards. So I suppose they are in CEL Version 4 format. >>> >>> Doing some search on previous posts, I came across this one: >>> >>> http://sourceforge.net/mailarchive/message.php?msg_id=Pine.LNX.4.61.0512141526200.18143%40hera.pcbi.upenn.edu >>> >>> It seems that at the time of the post (12/2005), the way these .CEL >>> files would be imported was that the headers would go to one of the >>> Affymetrix views (AffymetrixMAS4 or AffymetrixMAS5 or AffymetrixCEL), >>> the actual file would sit in the file system, and we'd insert a row to >>> the RAD.Quantification table with a URI pointing to the location of the >>> .CEL file. >>> >>> Also, looking through the different plugins in both the Supported and >>> Community folders, it seems LoadBatchArrayResults supports the cel4 >>> format. Is this the plugin I should use? >>> >>> Any help would be much appreciated. Thanks. >>> >>> Best regards, >>> Dave Hau >> > |
From: Elisabetta M. <man...@pc...> - 2007-09-13 00:14:19
|
Hi Dave, I just went back and quickly looked over the LoadBatchArrayResult code, which refreshed my memory... First, one correction: below I said that this enters 2 quantifications. Actually the quantifications are assumed to have already been entered and corresponding to the protocols provided (for cel and probe set); but what LoadBatchArrayResult does is relating the .cel and probe set quantifications (populating RAD.RelatedQuantification). Second: independently of whether or not LoadSimpleArrayResult can load .cel data, LoadBatchArrayResult only calls this, in the case of Affy data, to load *probe set* data. In fact from the code, the list of possible views that this plugin will populate is given in: $globalRef->{'resultSubclassView'} = { 'mas4'=>'AffymetrixMAS4', 'mas5'=>'AffymetrixMAS5', 'genepix'=>'GenePixElementResult', 'arrayvision'=>'ArrayVisionElementResult', 'rmaexpress'=>'RMAExpress', 'moid' => 'MOIDResult', }; So for Affy data, the relevant ones are those corresponding to the keys mas4, mas5, rmaexpress and moid. All of these are for probe set results. Thus, as is, this plugin won't load .cel data. It could be that the auxiliary Community plugin LoadSimpleArrayResult is able to load .cel data, but this is what I'm deferring to Junmin for. Elisabetta On Wed, 12 Sep 2007, Elisabetta Manduchi wrote: > > Hi Dave, > the LoadBatchArrayResult wants to know the cel protocols because it > enters 2 entries in RAD.Quantification per assay: one for the .CEL > quantification and one for the probe set quantification. What's entered in > Quantification are just the protocol references (e.g. reference to entries > in RAD.Protocol describing the CEL 4, MAS 5, RMA protocols) > and the uri with the path to the actual data files on the fileserver). > Then LoadBatchArrayResult calls LoadSimpleArrayResults which actually > takes care of entering the quantified data in views of > RAD.ElementResultImp or RAD.CompositeElementResultImp. > Now, definitely the latter plugin will populate views such as > AffymetrixMAS4 and AFFymetrixMAS5 and RMAExpress, which corresponds to > probe set quantified data. I believe from earlier correspondence Junmin > (here cc-ed), who wrote that LoadSimpleArrayResult, said that doesn't > support loading of AffymetrixCel. But I see this view mentioned in the > code of that plugin, so I'm deferring to Junmin to double-check on that. > The plugin *only accepts text files* as data files. So files like the > Metrics files (the .txt correspondent of the .CHP MAS4/5 files) will do, > as well as RMA like text files. I believe with GCOS it is possible to > export the data as metrics (txt) files corresponding to quantifications > using the MAS5 algorithm. > Elisabetta > > --- > > On Wed, 12 Sep 2007, Dave Hau wrote: > >> Elisabetta, >> >> Thanks for your and John Brestelli's (via personal email) very informative >> replies. They are very helpful indeed. >> >> Regarding loading .CEL files (probe cell data, not probe set data), John >> mentioned the plugin GUS::Community::Plugin::LoadBatchArrayResults which I >> had noticed too. The help page for this plugin mentions a number of >> quantification protocols supported including mas4/mas5 (Affymetrix MAS 4.0 >> and 5.0 Probe Set quantification protocol) and cel4/cel5 (Affymetrix MAS 4.0 >> and 5.0 Probe Cell quantification protocol). It seems that cel4/cel5 would >> correspond to the .CEL files I need to load (i.e. probe *cell* data). Is this >> correct? I was wondering because you mentioned in your reply that there's no >> plugin available for loading probe cell data. >> >> Also, in the Affymetrix file format description document ( >> http://www.affymetrix.com/support/developer/AffxFileFormats.ZIP ), two file >> formats are described: Version 3 files (text data) generated by the MAS >> software, and version 4 files (binary data) generated by the GCOS software. >> So both cel4 and cel5 for the plugin would correspond to Version 3 files, >> right? That means the LoadBatchArrayResults plugin does not support the >> Version 4 (binary) file format, correct? >> >> Thanks again for your help. >> >> Best regards, >> Dave Hau >> >> >> Elisabetta Manduchi wrote: >>> >>> Hi Dave, >>> let me clarify GUS vs Affy. >>> Affymetrix quantified results are of two types, corresponding to 2 >>> different level of analysis: >>> >>> (i) probe-cell level results (e.g. from .CEL files), which contain >>> intensity values for each individual probe cell on the chip; and >>> (ii) probe-set level results (e.g. obtained from MAS4 or MAS 5 and in the >>> .CHP files, or from RMA or gcRMA) which contain *summarized* intensities >>> for probe sets on the chip. >>> >>> The GUS schema in principle supports storage of both: >>> >>> (i) the probe cell results would go into a view of RAD.ElementResultImp (in >>> fact there is a view to this end called RAD.AffymetrixCEL); >>> (ii) the probe set results would go to view of >>> RAD.CompositeElementResultImp. For the latter, currently we have views to >>> accomodate MAS4 or 5 (RAD.AffymetrixMAS4 or RAD.AffymetrixMAS5) and >>> RMA/gcRMA results (RAD.RMAExpress, which will actually be renamed RAD.RMA >>> in the next GUS release). >>> >>> Now, here at CBIL, we do not store or support loading of the .CEL file data >>> in the database, because we really only use the probe-set level results in >>> our applications, so we have no need to store .CEL in the db. >>> So the way we do it is as follows: >>> * for every Affymetrix assay, we have TWO related quantifications, one >>> corresponding to the .CEL quantification and the other corresponding to >>> whatever summarization quantification was created (e.g. with MAS4, MAS5, >>> RMA); >>> * we place 2 entries in RAD.Quantifications, one pointing to the uri of the >>> .CEL file (which we keep on our server) and one pointing to the uri of the >>> probe-set level result file >>> * we however do not store the data from the .CEL file in RAD.AffymetrixCEL >>> * we only store the data from the probe-set level results in one of the >>> RAD.CompositeElementResultImp views mentioned above. >>> >>> The current plugin in GUS::Supported, as Junmin mentioned in the posting >>> you are referring to, can be used to populate the data for the probe-set >>> level results. As far as I know, we do not have currently a plugin to store >>> the .CEL files in the db. >>> So the db allows for the latter, but you'd have to write your own plugin. >>> We didn't find useful to store .CEL results in GUS, but again this depends >>> on the type of applications you might be interested in. >>> Hope this helps, >>> Elisabetta >>> >>> >>> On Tue, 28 Aug 2007, Dave Hau wrote: >>> >>>> I would like to import a number of Affymetrix .CEL files into the GUS >>>> database, which was installed from top of trunk from the GUS svn >>>> repository. The CEL files each have some text headers, and then binary >>>> data afterwards. So I suppose they are in CEL Version 4 format. >>>> >>>> Doing some search on previous posts, I came across this one: >>>> >>>> http://sourceforge.net/mailarchive/message.php?msg_id=Pine.LNX.4.61.0512141526200.18143%40hera.pcbi.upenn.edu >>>> >>>> It seems that at the time of the post (12/2005), the way these .CEL >>>> files would be imported was that the headers would go to one of the >>>> Affymetrix views (AffymetrixMAS4 or AffymetrixMAS5 or AffymetrixCEL), >>>> the actual file would sit in the file system, and we'd insert a row to >>>> the RAD.Quantification table with a URI pointing to the location of the >>>> .CEL file. >>>> >>>> Also, looking through the different plugins in both the Supported and >>>> Community folders, it seems LoadBatchArrayResults supports the cel4 >>>> format. Is this the plugin I should use? >>>> >>>> Any help would be much appreciated. Thanks. >>>> >>>> Best regards, >>>> Dave Hau >>> >> > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > -- Elisabetta Manduchi Computational Biology and Informatics Laboratory Center for Bioinformatics University of Pennsylvania 1428 Blockley Hall 423 Guardian Drive Philadelphia, PA 19104-6021 phone: 215-573-4408 fax: 215 573-3111 email: man...@pc... web: http://www.cbil.upenn.edu/~manduchi --- |
From: Junmin L. <ju...@pc...> - 2007-09-13 14:09:23
|
Hi, Dave, I had couple discussion with other people in ArrayExpress and Joe white from Harvard in terms of raw data loading in previous MGED workshops. The consensus is that especially for the CEL file, people don't load them into database, unless you got some convincing use cases or strong needs to load cel file into database. So give it a second thought before you even proceed. ---junmin On Wed, 12 Sep 2007, Dave Hau wrote: > Elisabetta, > > Thanks for your and John Brestelli's (via personal email) very > informative replies. They are very helpful indeed. > > Regarding loading .CEL files (probe cell data, not probe set data), John > mentioned the plugin GUS::Community::Plugin::LoadBatchArrayResults which > I had noticed too. The help page for this plugin mentions a number of > quantification protocols supported including mas4/mas5 (Affymetrix MAS > 4.0 and 5.0 Probe Set quantification protocol) and cel4/cel5 (Affymetrix > MAS 4.0 and 5.0 Probe Cell quantification protocol). It seems that > cel4/cel5 would correspond to the .CEL files I need to load (i.e. probe > *cell* data). Is this correct? I was wondering because you mentioned in > your reply that there's no plugin available for loading probe cell data. > > Also, in the Affymetrix file format description document ( > http://www.affymetrix.com/support/developer/AffxFileFormats.ZIP ), two > file formats are described: Version 3 files (text data) generated by the > MAS software, and version 4 files (binary data) generated by the GCOS > software. So both cel4 and cel5 for the plugin would correspond to > Version 3 files, right? That means the LoadBatchArrayResults plugin does > not support the Version 4 (binary) file format, correct? > > Thanks again for your help. > > Best regards, > Dave Hau > > > Elisabetta Manduchi wrote: >> >> Hi Dave, >> let me clarify GUS vs Affy. >> Affymetrix quantified results are of two types, corresponding to 2 >> different level of analysis: >> >> (i) probe-cell level results (e.g. from .CEL files), which contain >> intensity values for each individual probe cell on the chip; and >> (ii) probe-set level results (e.g. obtained from MAS4 or MAS 5 and in >> the .CHP files, or from RMA or gcRMA) which contain *summarized* >> intensities for probe sets on the chip. >> >> The GUS schema in principle supports storage of both: >> >> (i) the probe cell results would go into a view of >> RAD.ElementResultImp (in fact there is a view to this end called >> RAD.AffymetrixCEL); >> (ii) the probe set results would go to view of >> RAD.CompositeElementResultImp. For the latter, currently we have views >> to accomodate MAS4 or 5 (RAD.AffymetrixMAS4 or RAD.AffymetrixMAS5) and >> RMA/gcRMA results (RAD.RMAExpress, which will actually be renamed >> RAD.RMA in the next GUS release). >> >> Now, here at CBIL, we do not store or support loading of the .CEL file >> data in the database, because we really only use the probe-set level >> results in our applications, so we have no need to store .CEL in the db. >> So the way we do it is as follows: >> * for every Affymetrix assay, we have TWO related quantifications, one >> corresponding to the .CEL quantification and the other corresponding >> to whatever summarization quantification was created (e.g. with MAS4, >> MAS5, RMA); >> * we place 2 entries in RAD.Quantifications, one pointing to the uri >> of the .CEL file (which we keep on our server) and one pointing to the >> uri of the probe-set level result file >> * we however do not store the data from the .CEL file in >> RAD.AffymetrixCEL >> * we only store the data from the probe-set level results in one of >> the RAD.CompositeElementResultImp views mentioned above. >> >> The current plugin in GUS::Supported, as Junmin mentioned in the >> posting you are referring to, can be used to populate the data for the >> probe-set level results. As far as I know, we do not have currently a >> plugin to store the .CEL files in the db. >> So the db allows for the latter, but you'd have to write your own >> plugin. We didn't find useful to store .CEL results in GUS, but again >> this depends on the type of applications you might be interested in. >> Hope this helps, >> Elisabetta >> >> >> On Tue, 28 Aug 2007, Dave Hau wrote: >> >>> I would like to import a number of Affymetrix .CEL files into the GUS >>> database, which was installed from top of trunk from the GUS svn >>> repository. The CEL files each have some text headers, and then binary >>> data afterwards. So I suppose they are in CEL Version 4 format. >>> >>> Doing some search on previous posts, I came across this one: >>> >>> http://sourceforge.net/mailarchive/message.php?msg_id=Pine.LNX.4.61.0512141526200.18143%40hera.pcbi.upenn.edu >>> >>> >>> It seems that at the time of the post (12/2005), the way these .CEL >>> files would be imported was that the headers would go to one of the >>> Affymetrix views (AffymetrixMAS4 or AffymetrixMAS5 or AffymetrixCEL), >>> the actual file would sit in the file system, and we'd insert a row to >>> the RAD.Quantification table with a URI pointing to the location of the >>> .CEL file. >>> >>> Also, looking through the different plugins in both the Supported and >>> Community folders, it seems LoadBatchArrayResults supports the cel4 >>> format. Is this the plugin I should use? >>> >>> Any help would be much appreciated. Thanks. >>> >>> Best regards, >>> Dave Hau >> > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > |
From: Dave H. <doc...@gm...> - 2007-09-13 23:34:19
|
Thanks Junmin and Elisabetta for your helpful comments. The consensus not to load CEL files into the database - is it because we only query for probe set data based on the gene, but not for probe cell data? If I store the CEL file in the filesystem and only store a file URI in the database, does RAD provide a way to run summarization algorithms (e.g. RMA, Plier) on those files? Can I load multiple sets of probe set data for a single set of probe cell data (e.g. one for RMA, one for Plier)? Also, according to the instructions in the RAD website on how to load a complete microarray study into the GUS database, the first step mentions "Further array annotation can be loaded via GUS::Community::Plugin::InsertArray2DbRefAndNaSeq. I tried to run this plugin, but got this error: FATAL: Can't locate GUS/Model/RAD/CompositeElementDbRef.pm in @INC Do you know where I can find this CompositeElementDbRef.pm file? I would like to load the annotation file I obtained from the Affymetrix website for the HG-U133_Plus_2 array into the GUS database. What's the best way to go about this? Thanks very much for your help. Best regards, Dave Junmin Liu wrote: > Hi, Dave, > I had couple discussion with other people in ArrayExpress and Joe > white from Harvard in terms of raw data loading in previous MGED > workshops. > > The consensus is that especially for the CEL file, people don't load > them into database, unless you got some convincing use cases or strong > needs to load cel file into database. > > So give it a second thought before you even proceed. > ---junmin > > > > On Wed, 12 Sep 2007, Dave Hau wrote: > >> Elisabetta, >> >> Thanks for your and John Brestelli's (via personal email) very >> informative replies. They are very helpful indeed. >> >> Regarding loading .CEL files (probe cell data, not probe set data), John >> mentioned the plugin GUS::Community::Plugin::LoadBatchArrayResults which >> I had noticed too. The help page for this plugin mentions a number of >> quantification protocols supported including mas4/mas5 (Affymetrix MAS >> 4.0 and 5.0 Probe Set quantification protocol) and cel4/cel5 (Affymetrix >> MAS 4.0 and 5.0 Probe Cell quantification protocol). It seems that >> cel4/cel5 would correspond to the .CEL files I need to load (i.e. probe >> *cell* data). Is this correct? I was wondering because you mentioned in >> your reply that there's no plugin available for loading probe cell data. >> >> Also, in the Affymetrix file format description document ( >> http://www.affymetrix.com/support/developer/AffxFileFormats.ZIP ), two >> file formats are described: Version 3 files (text data) generated by the >> MAS software, and version 4 files (binary data) generated by the GCOS >> software. So both cel4 and cel5 for the plugin would correspond to >> Version 3 files, right? That means the LoadBatchArrayResults plugin does >> not support the Version 4 (binary) file format, correct? >> >> Thanks again for your help. >> >> Best regards, >> Dave Hau >> >> >> Elisabetta Manduchi wrote: >>> >>> Hi Dave, >>> let me clarify GUS vs Affy. >>> Affymetrix quantified results are of two types, corresponding to 2 >>> different level of analysis: >>> >>> (i) probe-cell level results (e.g. from .CEL files), which contain >>> intensity values for each individual probe cell on the chip; and >>> (ii) probe-set level results (e.g. obtained from MAS4 or MAS 5 and in >>> the .CHP files, or from RMA or gcRMA) which contain *summarized* >>> intensities for probe sets on the chip. >>> >>> The GUS schema in principle supports storage of both: >>> >>> (i) the probe cell results would go into a view of >>> RAD.ElementResultImp (in fact there is a view to this end called >>> RAD.AffymetrixCEL); >>> (ii) the probe set results would go to view of >>> RAD.CompositeElementResultImp. For the latter, currently we have views >>> to accomodate MAS4 or 5 (RAD.AffymetrixMAS4 or RAD.AffymetrixMAS5) and >>> RMA/gcRMA results (RAD.RMAExpress, which will actually be renamed >>> RAD.RMA in the next GUS release). >>> >>> Now, here at CBIL, we do not store or support loading of the .CEL file >>> data in the database, because we really only use the probe-set level >>> results in our applications, so we have no need to store .CEL in the >>> db. >>> So the way we do it is as follows: >>> * for every Affymetrix assay, we have TWO related quantifications, one >>> corresponding to the .CEL quantification and the other corresponding >>> to whatever summarization quantification was created (e.g. with MAS4, >>> MAS5, RMA); >>> * we place 2 entries in RAD.Quantifications, one pointing to the uri >>> of the .CEL file (which we keep on our server) and one pointing to the >>> uri of the probe-set level result file >>> * we however do not store the data from the .CEL file in >>> RAD.AffymetrixCEL >>> * we only store the data from the probe-set level results in one of >>> the RAD.CompositeElementResultImp views mentioned above. >>> >>> The current plugin in GUS::Supported, as Junmin mentioned in the >>> posting you are referring to, can be used to populate the data for the >>> probe-set level results. As far as I know, we do not have currently a >>> plugin to store the .CEL files in the db. >>> So the db allows for the latter, but you'd have to write your own >>> plugin. We didn't find useful to store .CEL results in GUS, but again >>> this depends on the type of applications you might be interested in. >>> Hope this helps, >>> Elisabetta >>> >>> >>> On Tue, 28 Aug 2007, Dave Hau wrote: >>> >>>> I would like to import a number of Affymetrix .CEL files into the GUS >>>> database, which was installed from top of trunk from the GUS svn >>>> repository. The CEL files each have some text headers, and then binary >>>> data afterwards. So I suppose they are in CEL Version 4 format. >>>> >>>> Doing some search on previous posts, I came across this one: >>>> >>>> http://sourceforge.net/mailarchive/message.php?msg_id=Pine.LNX.4.61.0512141526200.18143%40hera.pcbi.upenn.edu >>>> >>>> >>>> >>>> It seems that at the time of the post (12/2005), the way these .CEL >>>> files would be imported was that the headers would go to one of the >>>> Affymetrix views (AffymetrixMAS4 or AffymetrixMAS5 or AffymetrixCEL), >>>> the actual file would sit in the file system, and we'd insert a row to >>>> the RAD.Quantification table with a URI pointing to the location of >>>> the >>>> .CEL file. >>>> >>>> Also, looking through the different plugins in both the Supported and >>>> Community folders, it seems LoadBatchArrayResults supports the cel4 >>>> format. Is this the plugin I should use? >>>> >>>> Any help would be much appreciated. Thanks. >>>> >>>> Best regards, >>>> Dave Hau >>> >> >> >> ------------------------------------------------------------------------- >> >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2005. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> Gusdev-gusdev mailing list >> Gus...@li... >> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >> > |
From: Elisabetta M. <man...@pc...> - 2007-09-14 14:30:20
|
Hi Dave, in line: > Thanks Junmin and Elisabetta for your helpful comments. > > The consensus not to load CEL files into the database - is it because we only > query for probe set data based on the gene, but not for probe cell data? If I yes typically people query the summarized results at the probe set level. > store the CEL file in the filesystem and only store a file URI in the > database, does RAD provide a way to run summarization algorithms (e.g. RMA, > Plier) on those files? Not currently. RAD provides the database where the results of such algorithms can be stored. One could certainly write a plugin that goes to the .CEL file indicated by the uri and then uses it to run their summarization algorithms of choice. However we do not currently have any such plugin in Supported or Community. > Can I load multiple sets of probe set data for a > single set of probe cell data (e.g. one for RMA, one for Plier)? Certainly. You would create as many entries in RAD.Quantification as the number of summarization protocols you run (e.g. MAS 5, RMA, Plier) on the same .CEL file, each such entry will point to the appropriate summarization protocol. You would additionally have a quantification referring to the .CEL file. In RAD.RelatedQuantification you can connect to the .cel quantification each of the others (summarization ones) that have used that .cel file. Then you can load the results of the summarization algorithms in the corresponding views of RAD.CompositeElementResultImp. Currently we have views for MAS4, MAS5, RMAExpress (which will simply be renamed in the next release RMA, and which accomodates RMA, gcRMA, etc.) and MOID. But it's easy to create additional views of the same table in your own istance that might accomodate other summarization programs. > Also, according to the instructions in the RAD website on how to load a > complete microarray study into the GUS database, the first step mentions > "Further array annotation can be loaded via > GUS::Community::Plugin::InsertArray2DbRefAndNaSeq. I tried to run this > plugin, but got this error: > > FATAL: Can't locate GUS/Model/RAD/CompositeElementDbRef.pm in @INC > > Do you know where I can find this CompositeElementDbRef.pm file? I think this is because the tables RAD.(Composite)ElementDbRef and RAD.(Composite)ElementNASequence where added after the last official GUS release. They are scheduled for the next GUS release (which probably won't occur in the near future). We have added them to our own instance of GUS at CBIL. So, if you want to use these tables, you first need to add those 4 tables to your db instance (you can find the latest sql for GUS in the GusSchema svn at https://www.cbil.upenn.edu/svn/gus/GusSchema/trunk/Definition/config/gus_schema.xml). (Note that this contains also other modifications made to tables subsequently to the 3.5 GUS release). Then you need to populate Core.TableInfo with entries for these new tables. Then you need to rebuild GUS forcing rebuilding of the objects. This way the code generator will see the new tables and create the corresponding objects, including the one you are referring to above. > I would like to load the annotation file I obtained from the Affymetrix > website for the HG-U133_Plus_2 array into the GUS database. What's the best > way to go about this? There are multiple choices for where to store array annotation at the moment. 1. RAD.CompositeElementDbRef and RAD.CompositeElementNASequence have been added to more quickly annotate Affy data with Entrez Genes and RefSeq info respectively. 2. Another possibility is to use the external_database_release_id and source_id pair in RAD.ShortOligoFamily to point to one preferred annotation for each probe set (but you would have to choose one). 3. Another, less structured possibility, is to use RAD.CompositeElementAnnotation, where you use the attribute 'name' to denote the annotation (e.g. "Entrez Gene", "RefSeq", etc.) and the attribute 'value' for the annotation (e.g. entrez gene id, or refseq id, etc.) itself. This has less structured but it will allow you to load as many annotations as you like. Elisabetta |
From: Junmin L. <ju...@pc...> - 2007-09-14 16:36:23
|
Hi, Dave, Again in line: >> The consensus not to load CEL files into the database - is it because we only >> query for probe set data based on the gene, but not for probe cell data? If I > > yes typically people query the summarized results at the probe set > level. Generally speaking, schema design and data management have to be in the context of contract or any requirements you are obligated to. Ask the question what is the next if you load CEL? or what is the next if you load array data and etc? GUS and its app stacks certainly will allow you do those things, but it is critical you have some judgement calls. And the cost of loading raw data then querying them out is pretty expensive. > There are multiple choices for where to store array annotation at the > moment. > 1. RAD.CompositeElementDbRef and RAD.CompositeElementNASequence have been > added to more quickly annotate Affy data with Entrez Genes and RefSeq info > respectively. > 2. Another possibility is to use the external_database_release_id and > source_id pair in RAD.ShortOligoFamily to point to one preferred > annotation for each probe set (but you would have to choose one). > 3. Another, less structured possibility, is to use > RAD.CompositeElementAnnotation, where you use the attribute 'name' to > denote the annotation (e.g. "Entrez Gene", "RefSeq", etc.) and the > attribute 'value' for the annotation (e.g. entrez gene id, or refseq id, > etc.) itself. This has less structured but it will allow you to load as > many annotations as you like. I normally favor the consistant data management policy, that means, you don't need documentation somewhere saying "case 1, load data into table a, b, c; case 2, load data into table d, e, f; case 3, load data into table g, h, i", which not only make you data loading tough, also will make you app code built on top db stink. We didn't manage our own db perfectly neither. But hopefully our experiences could prove useful to you. I strongly suggest you look at the MAGE-Tab spec for raw/processed data and ADF spec for array data on ArrayExpress site, for MAGE-Tab and ADF are proved to be very effective for large db like AE. If you can make your app/db align to the standards as we are trying to do also, it certainly give you a safe edge. ---junmin |
From: Dave H. <doc...@gm...> - 2007-09-14 19:02:13
|
Junmin and Elisabetta, thanks again for your helpful comments. Couple of questions. 1. The HG-U133_Plus_2 array annotation file I downloaded from Affymetrix is an xml file in MAGE-ML format. On the RAD download page ( http://www.cbil.upenn.edu/downloads/RAD/ ), I see a tool called mage2tab-v0.9, which I assume would be able to convert the annotation file to MAGE-TAB format. Then in order to load this MAGE-TAB file into GUS, I noticed on the CBIL Lab Meetings web page, for Thursday March 15, 2007, Junmin gave a talk on MR-Ti, and the description mentions the loadMageDoc GUS plugin. I notice (and have downloaded) a file on the RAD download page called "MR_T_ForGUS35.tar.gz" but the loadMageDoc plugin is not in there. Is there a way for me to obtain this plugin? 2. I ran "apt-probeset-summarize" in the Affymetrix Power Tools (APT) package ( http://www.affymetrix.com/support/developer/powertools/index.affx ) and obtained probe set data for my .CEL files, one set for RMA and another set for PLIER. Is there a plugin that will readily load these APT output files into GUS as probe set data? 3. The GUS installation I'm using is top of trunk from the CBIL svn repository. This is because I'm using postgresql on the back end, and the 3.5 GUS package gave me a lot of problems. These seem to have been fixed in the top of trunk. However, in order to use existing plugins, would it be advisable to use top of trunk (including the new schema changes for new features that Elisabetta mentioned)? If not, is there, or do you plan on releasing a bug-fix version of 3.5 that contains bug fixes back-ported to 3.5, but does not contain any of the new features not yet released? 4. Is there any way in RAD or GUS to load pathological images (e.g. associated with biosamples used for hybridization) into the GUS database? Thanks very much, Dave Junmin Liu wrote: > Hi, Dave, > Again in line: > >>> The consensus not to load CEL files into the database - is it >>> because we only >>> query for probe set data based on the gene, but not for probe cell >>> data? If I >> >> yes typically people query the summarized results at the probe set >> level. > > Generally speaking, schema design and data management have to be in > the context of contract or any requirements you are obligated to. > > Ask the question what is the next if you load CEL? or what is the next > if you load array data and etc? > > GUS and its app stacks certainly will allow you do those things, but > it is critical you have some judgement calls. And the cost of loading > raw data then querying them out is pretty expensive. > >> There are multiple choices for where to store array annotation at the >> moment. >> 1. RAD.CompositeElementDbRef and RAD.CompositeElementNASequence have >> been >> added to more quickly annotate Affy data with Entrez Genes and RefSeq >> info >> respectively. >> 2. Another possibility is to use the external_database_release_id and >> source_id pair in RAD.ShortOligoFamily to point to one preferred >> annotation for each probe set (but you would have to choose one). >> 3. Another, less structured possibility, is to use >> RAD.CompositeElementAnnotation, where you use the attribute 'name' to >> denote the annotation (e.g. "Entrez Gene", "RefSeq", etc.) and the >> attribute 'value' for the annotation (e.g. entrez gene id, or refseq id, >> etc.) itself. This has less structured but it will allow you to load as >> many annotations as you like. > > I normally favor the consistant data management policy, that means, > you don't need documentation somewhere saying "case 1, load data into > table a, b, c; case 2, load data into table d, e, f; case 3, load data > into table g, h, i", which not only make you data loading tough, also > will make you app code built on top db stink. > > We didn't manage our own db perfectly neither. But hopefully our > experiences could prove useful to you. > > I strongly suggest you look at the MAGE-Tab spec for raw/processed > data and ADF spec for array data on ArrayExpress site, for MAGE-Tab > and ADF are proved to be very effective for large db like AE. If you > can make your app/db align to the standards as we are trying to do > also, it certainly give you a safe edge. > > ---junmin > |
From: Elisabetta M. <man...@pc...> - 2007-09-14 19:43:31
|
Hi Dave, I'll respond to 2 and 4. For (1) I defer to Junmin. For (3) all I can say is that it is in our lab's plans to release bug-fixes and new releases of GUS, however this keeps being postponed due to other priorities. In the meantime for postresql questions re GUS, John Iodice might be able to help you. Getting back to your question (2), first of all, as mentioned in my previous email we currently have a view for RMA results, but we do not have a view for Plier results. If you need a view for Plier in your instance of the DB though, you can simply create such a view with the attributes you need in your own instance. It would be a view of RAD.CompositeElementResultImp. Once created, remember to update Core.TableInfo and rebuild GUS, so that the objects for the new view are in place. The current available plugins to load data into RAD.CompositeElementResultImp views are: LoadArrayResult (in Supported) which loads the results of one assay at a time, and LoadBatchResult which we have already discussed. The documentation of these plugins, available from svn illustrates, what the input format should be. The idea guiding the design of these plugins we made available was that they would be *generic*, i.e. they would be able to take data from a wide variety of quantification software and load them into RAD. So we opted for one generic code at the expense of some work to put the input into the appropriate format. If a project/lab typically gets files in a particular data format, then it might be worth for them to write a plugin which is specific to that rather than using the generic plugin. This way they can use the output as spit out by the software they use. It is fairly simple to write a plugin specific to one's needs using the Plugin package. So if you expect to deal most of the timewith a particular type of output (e.g. from APT) you might consider writing a specific plugin. Regarding your question (4), the answer is no. We do not store images in GUS. For certain types of images, like microarray images (e.g. files resulting from scanning, like .TIF or .DAT) we store in the db their uri to the fileserver (in RAD.Acquisition.uri). Hope this helps, Elisabetta --- On Fri, 14 Sep 2007, Dave Hau wrote: > Junmin and Elisabetta, thanks again for your helpful comments. > > Couple of questions. > > 1. The HG-U133_Plus_2 array annotation file I downloaded from Affymetrix is > an xml file in MAGE-ML format. On the RAD download page ( > http://www.cbil.upenn.edu/downloads/RAD/ ), I see a tool called > mage2tab-v0.9, which I assume would be able to convert the annotation file to > MAGE-TAB format. Then in order to load this MAGE-TAB file into GUS, I > noticed on the CBIL Lab Meetings web page, for Thursday March 15, 2007, > Junmin gave a talk on MR-Ti, and the description mentions the loadMageDoc GUS > plugin. I notice (and have downloaded) a file on the RAD download page > called "MR_T_ForGUS35.tar.gz" but the loadMageDoc plugin is not in there. Is > there a way for me to obtain this plugin? > > 2. I ran "apt-probeset-summarize" in the Affymetrix Power Tools (APT) > package ( http://www.affymetrix.com/support/developer/powertools/index.affx ) > and obtained probe set data for my .CEL files, one set for RMA and another > set for PLIER. Is there a plugin that will readily load these APT output > files into GUS as probe set data? > > 3. The GUS installation I'm using is top of trunk from the CBIL svn > repository. This is because I'm using postgresql on the back end, and the > 3.5 GUS package gave me a lot of problems. These seem to have been fixed in > the top of trunk. However, in order to use existing plugins, would it be > advisable to use top of trunk (including the new schema changes for new > features that Elisabetta mentioned)? If not, is there, or do you plan on > releasing a bug-fix version of 3.5 that contains bug fixes back-ported to > 3.5, but does not contain any of the new features not yet released? > > 4. Is there any way in RAD or GUS to load pathological images (e.g. > associated with biosamples used for hybridization) into the GUS database? > > Thanks very much, > Dave > > > > Junmin Liu wrote: >> Hi, Dave, >> Again in line: >> >>>> The consensus not to load CEL files into the database - is it because we >>>> only >>>> query for probe set data based on the gene, but not for probe cell data? >>>> If I >>> >>> yes typically people query the summarized results at the probe set >>> level. >> >> Generally speaking, schema design and data management have to be in the >> context of contract or any requirements you are obligated to. >> >> Ask the question what is the next if you load CEL? or what is the next if >> you load array data and etc? >> >> GUS and its app stacks certainly will allow you do those things, but it is >> critical you have some judgement calls. And the cost of loading raw data >> then querying them out is pretty expensive. >> >>> There are multiple choices for where to store array annotation at the >>> moment. >>> 1. RAD.CompositeElementDbRef and RAD.CompositeElementNASequence have been >>> added to more quickly annotate Affy data with Entrez Genes and RefSeq info >>> respectively. >>> 2. Another possibility is to use the external_database_release_id and >>> source_id pair in RAD.ShortOligoFamily to point to one preferred >>> annotation for each probe set (but you would have to choose one). >>> 3. Another, less structured possibility, is to use >>> RAD.CompositeElementAnnotation, where you use the attribute 'name' to >>> denote the annotation (e.g. "Entrez Gene", "RefSeq", etc.) and the >>> attribute 'value' for the annotation (e.g. entrez gene id, or refseq id, >>> etc.) itself. This has less structured but it will allow you to load as >>> many annotations as you like. >> >> I normally favor the consistant data management policy, that means, you >> don't need documentation somewhere saying "case 1, load data into table a, >> b, c; case 2, load data into table d, e, f; case 3, load data into table g, >> h, i", which not only make you data loading tough, also will make you app >> code built on top db stink. >> >> We didn't manage our own db perfectly neither. But hopefully our >> experiences could prove useful to you. >> >> I strongly suggest you look at the MAGE-Tab spec for raw/processed data and >> ADF spec for array data on ArrayExpress site, for MAGE-Tab and ADF are >> proved to be very effective for large db like AE. If you can make your >> app/db align to the standards as we are trying to do also, it certainly >> give you a safe edge. >> >> ---junmin >> > |
From: Elisabetta M. <man...@pc...> - 2007-09-15 10:09:06
|
A clarification regarding point (2) and my response to that below. LoadBatchArrayResults is more flexible regarding input format than LoadArrayResults. In fact, LoadArrayResults requires the data_file provided to be in the format specified in the documentation: https://www.cbil.upenn.edu/svn/gus/GusAppFramework/trunk/Supported/doc/LoadArrayResults.html so typically requires some parsing of the original software output prior to being input into this plugin. In LoadBatchArrayResults the software output is assumed to be tab-delimited text, however typically output from programs like MAS4, MAS5, RMAExpress, MOID, GenePix or ArrayVision, can be used as is, but the user needs to provide an xml_file which tells the plugins how this output should be reformatted before the plugins calls LoadSimpleArrayResults (a simplified version of LoadArrayResults that requires a similar data_file input format) to load it into RAD. The files LoadBatchArrayResuts*.xml in https://www.cbil.upenn.edu/svn/gus/GusAppFramework/trunk/Community/config/ are examples of such specifications. Basically they tell the plugin how to map the columns of the software output to columns whose headers are acceptable as data_file input for LoadSimpleArrayResults, i.e. columns compatible with the fields of the view to be populated. It is also possible in this xml file (see GenePix example) to specify how to transform a subset of these columns through a function (e.g. see coordGenePix2RAD). Thus for your APT files for RMA, if they are tab-delimited text, by providing the correct xml_file which tells LoadBatchArrayResult how to read them and map its columns to fields in the RAD.RMAExpress view, you can load them through this plugin (note the xml file will be similar to the RMAExpress.xml example in the website above, but you might have to adjust the input header names according to those in your file). For Plier, as mentioned, there is now view yet. Once the view is created, if the APT output is tab-delimited, you could use LoadBatchArrayResult but first you need to extend its code to accomodate the plier protocol (list of current protocols accepted by LoadBatchArrayResults is at https://www.cbil.upenn.edu/svn/gus/GusAppFramework/trunk/Community/doc/LoadBatchArrayResults.html) and second you will need to create the appropriate xml_file which describes how the APT output should be mapped. As mentioned below though, if in your situation you expect to always use only a couple of software packages for summarization, always with the same type of format, it might be more efficient to write a specific (and simpler) plugin that deals directly only with those. Elisabetta On Fri, 14 Sep 2007, Elisabetta Manduchi wrote: > > Hi Dave, > I'll respond to 2 and 4. For (1) I defer to Junmin. > For (3) all I can say is that it is in our lab's plans to release bug-fixes > and new releases of GUS, however this keeps being postponed due to other > priorities. In the meantime for postresql questions re GUS, John Iodice might > be able to help you. > Getting back to your question (2), first of all, as mentioned in my previous > email we currently have a view for RMA results, but we do not have a view for > Plier results. If you need a view for Plier in your instance of the DB > though, you can simply create such a view with the attributes you need in > your own instance. It would be a view of RAD.CompositeElementResultImp. Once > created, remember to update Core.TableInfo and rebuild GUS, so that the > objects for the new view are in place. > The current available plugins to load data into RAD.CompositeElementResultImp > views are: LoadArrayResult (in Supported) which loads the results of one > assay at a time, and LoadBatchResult which we have already discussed. The > documentation of these plugins, available from svn illustrates, what the > input format should be. The idea guiding the design of these plugins we made > available was that they would be *generic*, i.e. they would be able to take > data from a wide variety of quantification software and load them into RAD. > So we opted for one generic code at the expense of some work to put the input > into the appropriate format. > If a project/lab typically gets files in a particular data format, then it > might be worth for them to write a plugin which is specific to that rather > than using the generic plugin. This way they can use the output as spit out > by the software they use. It is fairly simple to write a plugin specific to > one's needs using the Plugin package. So if you expect to deal most of the > timewith a particular type of output (e.g. from APT) you might consider > writing a specific plugin. > > Regarding your question (4), the answer is no. We do not store images in GUS. > For certain types of images, like microarray images (e.g. files resulting > from scanning, like .TIF or .DAT) we store in the db their uri to the > fileserver (in RAD.Acquisition.uri). > Hope this helps, > Elisabetta > > --- > > On Fri, 14 Sep 2007, Dave Hau wrote: > >> Junmin and Elisabetta, thanks again for your helpful comments. >> >> Couple of questions. >> >> 1. The HG-U133_Plus_2 array annotation file I downloaded from Affymetrix >> is an xml file in MAGE-ML format. On the RAD download page ( >> http://www.cbil.upenn.edu/downloads/RAD/ ), I see a tool called >> mage2tab-v0.9, which I assume would be able to convert the annotation file >> to MAGE-TAB format. Then in order to load this MAGE-TAB file into GUS, I >> noticed on the CBIL Lab Meetings web page, for Thursday March 15, 2007, >> Junmin gave a talk on MR-Ti, and the description mentions the loadMageDoc >> GUS plugin. I notice (and have downloaded) a file on the RAD download >> page called "MR_T_ForGUS35.tar.gz" but the loadMageDoc plugin is not in >> there. Is there a way for me to obtain this plugin? >> >> 2. I ran "apt-probeset-summarize" in the Affymetrix Power Tools (APT) >> package ( >> http://www.affymetrix.com/support/developer/powertools/index.affx ) and >> obtained probe set data for my .CEL files, one set for RMA and another set >> for PLIER. Is there a plugin that will readily load these APT output >> files into GUS as probe set data? >> >> 3. The GUS installation I'm using is top of trunk from the CBIL svn >> repository. This is because I'm using postgresql on the back end, and the >> 3.5 GUS package gave me a lot of problems. These seem to have been fixed >> in the top of trunk. However, in order to use existing plugins, would it >> be advisable to use top of trunk (including the new schema changes for new >> features that Elisabetta mentioned)? If not, is there, or do you plan on >> releasing a bug-fix version of 3.5 that contains bug fixes back-ported to >> 3.5, but does not contain any of the new features not yet released? >> >> 4. Is there any way in RAD or GUS to load pathological images (e.g. >> associated with biosamples used for hybridization) into the GUS database? >> >> Thanks very much, >> Dave >> >> >> >> Junmin Liu wrote: >> > Hi, Dave, >> > Again in line: >> > >> > > > The consensus not to load CEL files into the database - is it >> > > > because we only >> > > > query for probe set data based on the gene, but not for probe cell >> > > > data? If I >> > > >> > > yes typically people query the summarized results at the probe set >> > > level. >> > >> > Generally speaking, schema design and data management have to be in the >> > context of contract or any requirements you are obligated to. >> > >> > Ask the question what is the next if you load CEL? or what is the next >> > if you load array data and etc? >> > >> > GUS and its app stacks certainly will allow you do those things, but it >> > is critical you have some judgement calls. And the cost of loading raw >> > data then querying them out is pretty expensive. >> > >> > > There are multiple choices for where to store array annotation at the >> > > moment. >> > > 1. RAD.CompositeElementDbRef and RAD.CompositeElementNASequence have >> > > been >> > > added to more quickly annotate Affy data with Entrez Genes and RefSeq >> > > info >> > > respectively. >> > > 2. Another possibility is to use the external_database_release_id and >> > > source_id pair in RAD.ShortOligoFamily to point to one preferred >> > > annotation for each probe set (but you would have to choose one). >> > > 3. Another, less structured possibility, is to use >> > > RAD.CompositeElementAnnotation, where you use the attribute 'name' to >> > > denote the annotation (e.g. "Entrez Gene", "RefSeq", etc.) and the >> > > attribute 'value' for the annotation (e.g. entrez gene id, or refseq >> > > id, >> > > etc.) itself. This has less structured but it will allow you to load >> > > as >> > > many annotations as you like. >> > >> > I normally favor the consistant data management policy, that means, you >> > don't need documentation somewhere saying "case 1, load data into table >> > a, b, c; case 2, load data into table d, e, f; case 3, load data into >> > table g, h, i", which not only make you data loading tough, also will >> > make you app code built on top db stink. >> > >> > We didn't manage our own db perfectly neither. But hopefully our >> > experiences could prove useful to you. >> > >> > I strongly suggest you look at the MAGE-Tab spec for raw/processed data >> > and ADF spec for array data on ArrayExpress site, for MAGE-Tab and ADF >> > are proved to be very effective for large db like AE. If you can make >> > your app/db align to the standards as we are trying to do also, it >> > certainly give you a safe edge. >> > >> > ---junmin >> > >> > -- Elisabetta Manduchi Computational Biology and Informatics Laboratory Center for Bioinformatics University of Pennsylvania 1428 Blockley Hall 423 Guardian Drive Philadelphia, PA 19104-6021 phone: 215-573-4408 fax: 215 573-3111 email: man...@pc... web: http://www.cbil.upenn.edu/~manduchi --- |
From: Dave H. <doc...@gm...> - 2007-09-14 19:51:14
|
My bad... I just noticed the LoadMageDoc plugin is in the community plugin directory. Thanks Elisabetta for your prompt reply. - Dave Elisabetta Manduchi wrote: > > Hi Dave, > I'll respond to 2 and 4. For (1) I defer to Junmin. > For (3) all I can say is that it is in our lab's plans to release > bug-fixes and new releases of GUS, however this keeps being postponed > due to other priorities. In the meantime for postresql questions re > GUS, John Iodice might be able to help you. > Getting back to your question (2), first of all, as mentioned in my > previous email we currently have a view for RMA results, but we do not > have a view for Plier results. If you need a view for Plier in your > instance of the DB though, you can simply create such a view with the > attributes you need in your own instance. It would be a view of > RAD.CompositeElementResultImp. Once created, remember to update > Core.TableInfo and rebuild GUS, so that the objects for the new view > are in place. > The current available plugins to load data into > RAD.CompositeElementResultImp views are: LoadArrayResult (in > Supported) which loads the results of one assay at a time, and > LoadBatchResult which we have already discussed. The documentation of > these plugins, available from svn illustrates, what the input format > should be. The idea guiding the design of these plugins we made > available was that they would be *generic*, i.e. they would be able to > take data from a wide variety of quantification software and load them > into RAD. So we opted for one generic code at the expense of some work > to put the input into the appropriate format. > If a project/lab typically gets files in a particular data format, > then it might be worth for them to write a plugin which is specific to > that rather than using the generic plugin. This way they can use the > output as spit out by the software they use. It is fairly simple to > write a plugin specific to one's needs using the Plugin package. So if > you expect to deal most of the timewith a particular type of output > (e.g. from APT) you might consider writing a specific plugin. > > Regarding your question (4), the answer is no. We do not store images > in GUS. For certain types of images, like microarray images (e.g. > files resulting from scanning, like .TIF or .DAT) we store in the db > their uri to the fileserver (in RAD.Acquisition.uri). > Hope this helps, > Elisabetta > > --- > > On Fri, 14 Sep 2007, Dave Hau wrote: > >> Junmin and Elisabetta, thanks again for your helpful comments. >> >> Couple of questions. >> >> 1. The HG-U133_Plus_2 array annotation file I downloaded from >> Affymetrix is an xml file in MAGE-ML format. On the RAD download >> page ( http://www.cbil.upenn.edu/downloads/RAD/ ), I see a tool >> called mage2tab-v0.9, which I assume would be able to convert the >> annotation file to MAGE-TAB format. Then in order to load this >> MAGE-TAB file into GUS, I noticed on the CBIL Lab Meetings web page, >> for Thursday March 15, 2007, Junmin gave a talk on MR-Ti, and the >> description mentions the loadMageDoc GUS plugin. I notice (and have >> downloaded) a file on the RAD download page called >> "MR_T_ForGUS35.tar.gz" but the loadMageDoc plugin is not in there. >> Is there a way for me to obtain this plugin? >> >> 2. I ran "apt-probeset-summarize" in the Affymetrix Power Tools >> (APT) package ( >> http://www.affymetrix.com/support/developer/powertools/index.affx ) >> and obtained probe set data for my .CEL files, one set for RMA and >> another set for PLIER. Is there a plugin that will readily load >> these APT output files into GUS as probe set data? >> >> 3. The GUS installation I'm using is top of trunk from the CBIL svn >> repository. This is because I'm using postgresql on the back end, >> and the 3.5 GUS package gave me a lot of problems. These seem to >> have been fixed in the top of trunk. However, in order to use >> existing plugins, would it be advisable to use top of trunk >> (including the new schema changes for new features that Elisabetta >> mentioned)? If not, is there, or do you plan on releasing a bug-fix >> version of 3.5 that contains bug fixes back-ported to 3.5, but does >> not contain any of the new features not yet released? >> >> 4. Is there any way in RAD or GUS to load pathological images (e.g. >> associated with biosamples used for hybridization) into the GUS >> database? >> >> Thanks very much, >> Dave >> >> >> >> Junmin Liu wrote: >>> Hi, Dave, >>> Again in line: >>> >>>>> The consensus not to load CEL files into the database - is it >>>>> because we only >>>>> query for probe set data based on the gene, but not for probe cell >>>>> data? If I >>>> >>>> yes typically people query the summarized results at the probe set >>>> level. >>> >>> Generally speaking, schema design and data management have to be in >>> the context of contract or any requirements you are obligated to. >>> >>> Ask the question what is the next if you load CEL? or what is the >>> next if you load array data and etc? >>> >>> GUS and its app stacks certainly will allow you do those things, but >>> it is critical you have some judgement calls. And the cost of >>> loading raw data then querying them out is pretty expensive. >>> >>>> There are multiple choices for where to store array annotation at the >>>> moment. >>>> 1. RAD.CompositeElementDbRef and RAD.CompositeElementNASequence >>>> have been >>>> added to more quickly annotate Affy data with Entrez Genes and >>>> RefSeq info >>>> respectively. >>>> 2. Another possibility is to use the external_database_release_id and >>>> source_id pair in RAD.ShortOligoFamily to point to one preferred >>>> annotation for each probe set (but you would have to choose one). >>>> 3. Another, less structured possibility, is to use >>>> RAD.CompositeElementAnnotation, where you use the attribute 'name' to >>>> denote the annotation (e.g. "Entrez Gene", "RefSeq", etc.) and the >>>> attribute 'value' for the annotation (e.g. entrez gene id, or >>>> refseq id, >>>> etc.) itself. This has less structured but it will allow you to >>>> load as >>>> many annotations as you like. >>> >>> I normally favor the consistant data management policy, that means, >>> you don't need documentation somewhere saying "case 1, load data >>> into table a, b, c; case 2, load data into table d, e, f; case 3, >>> load data into table g, h, i", which not only make you data loading >>> tough, also will make you app code built on top db stink. >>> >>> We didn't manage our own db perfectly neither. But hopefully our >>> experiences could prove useful to you. >>> >>> I strongly suggest you look at the MAGE-Tab spec for raw/processed >>> data and ADF spec for array data on ArrayExpress site, for MAGE-Tab >>> and ADF are proved to be very effective for large db like AE. If you >>> can make your app/db align to the standards as we are trying to do >>> also, it certainly give you a safe edge. >>> >>> ---junmin >>> >> > |
From: Junmin L. <ju...@pc...> - 2007-09-14 21:08:05
|
Hi, Dave, LoadMageDoc plugin is replacement for RAD study annotator if you heard about it, it is for loading mage-ml or mage-tab, and meta data part only. By meta data I mean information about protocol, samples, assay, acquisition, quantification, study design, study factor and etc., excluding the array annotation, raw data and process data. MAGE-ML can contain array design/reporter/feature info, but normally people seperate them out into single tab-delim file called ADF. Try to find out or ask ArrayExpress for ADF file of that affy chip. MR_TforGUS35 package only contain code for export data from RAD to MAGE-ML and its associated data file: raw/processed data. RAD/MR_T/lib/perl/MageImport truck contains all of the perl package which the LoadMageDoc plugin depends on. The MR_Ti itself as toolkit can be downloaded here: https://www.cbil.upenn.edu/magewiki/index.php/mage2tab Sorry we have poor documentation on those things. As this plugin is purely in-house now. We only expose the toolkit to community including non-GUS users. ---junmin On Fri, 14 Sep 2007, Dave Hau wrote: > My bad... I just noticed the LoadMageDoc plugin is in the community > plugin directory. > > Thanks Elisabetta for your prompt reply. > > - Dave > > > Elisabetta Manduchi wrote: >> >> Hi Dave, >> I'll respond to 2 and 4. For (1) I defer to Junmin. >> For (3) all I can say is that it is in our lab's plans to release >> bug-fixes and new releases of GUS, however this keeps being postponed >> due to other priorities. In the meantime for postresql questions re >> GUS, John Iodice might be able to help you. >> Getting back to your question (2), first of all, as mentioned in my >> previous email we currently have a view for RMA results, but we do not >> have a view for Plier results. If you need a view for Plier in your >> instance of the DB though, you can simply create such a view with the >> attributes you need in your own instance. It would be a view of >> RAD.CompositeElementResultImp. Once created, remember to update >> Core.TableInfo and rebuild GUS, so that the objects for the new view >> are in place. >> The current available plugins to load data into >> RAD.CompositeElementResultImp views are: LoadArrayResult (in >> Supported) which loads the results of one assay at a time, and >> LoadBatchResult which we have already discussed. The documentation of >> these plugins, available from svn illustrates, what the input format >> should be. The idea guiding the design of these plugins we made >> available was that they would be *generic*, i.e. they would be able to >> take data from a wide variety of quantification software and load them >> into RAD. So we opted for one generic code at the expense of some work >> to put the input into the appropriate format. >> If a project/lab typically gets files in a particular data format, >> then it might be worth for them to write a plugin which is specific to >> that rather than using the generic plugin. This way they can use the >> output as spit out by the software they use. It is fairly simple to >> write a plugin specific to one's needs using the Plugin package. So if >> you expect to deal most of the timewith a particular type of output >> (e.g. from APT) you might consider writing a specific plugin. >> >> Regarding your question (4), the answer is no. We do not store images >> in GUS. For certain types of images, like microarray images (e.g. >> files resulting from scanning, like .TIF or .DAT) we store in the db >> their uri to the fileserver (in RAD.Acquisition.uri). >> Hope this helps, >> Elisabetta >> >> --- >> >> On Fri, 14 Sep 2007, Dave Hau wrote: >> >>> Junmin and Elisabetta, thanks again for your helpful comments. >>> >>> Couple of questions. >>> >>> 1. The HG-U133_Plus_2 array annotation file I downloaded from >>> Affymetrix is an xml file in MAGE-ML format. On the RAD download >>> page ( http://www.cbil.upenn.edu/downloads/RAD/ ), I see a tool >>> called mage2tab-v0.9, which I assume would be able to convert the >>> annotation file to MAGE-TAB format. Then in order to load this >>> MAGE-TAB file into GUS, I noticed on the CBIL Lab Meetings web page, >>> for Thursday March 15, 2007, Junmin gave a talk on MR-Ti, and the >>> description mentions the loadMageDoc GUS plugin. I notice (and have >>> downloaded) a file on the RAD download page called >>> "MR_T_ForGUS35.tar.gz" but the loadMageDoc plugin is not in there. >>> Is there a way for me to obtain this plugin? >>> >>> 2. I ran "apt-probeset-summarize" in the Affymetrix Power Tools >>> (APT) package ( >>> http://www.affymetrix.com/support/developer/powertools/index.affx ) >>> and obtained probe set data for my .CEL files, one set for RMA and >>> another set for PLIER. Is there a plugin that will readily load >>> these APT output files into GUS as probe set data? >>> >>> 3. The GUS installation I'm using is top of trunk from the CBIL svn >>> repository. This is because I'm using postgresql on the back end, >>> and the 3.5 GUS package gave me a lot of problems. These seem to >>> have been fixed in the top of trunk. However, in order to use >>> existing plugins, would it be advisable to use top of trunk >>> (including the new schema changes for new features that Elisabetta >>> mentioned)? If not, is there, or do you plan on releasing a bug-fix >>> version of 3.5 that contains bug fixes back-ported to 3.5, but does >>> not contain any of the new features not yet released? >>> >>> 4. Is there any way in RAD or GUS to load pathological images (e.g. >>> associated with biosamples used for hybridization) into the GUS >>> database? >>> >>> Thanks very much, >>> Dave >>> >>> >>> >>> Junmin Liu wrote: >>>> Hi, Dave, >>>> Again in line: >>>> >>>>>> The consensus not to load CEL files into the database - is it >>>>>> because we only >>>>>> query for probe set data based on the gene, but not for probe cell >>>>>> data? If I >>>>> >>>>> yes typically people query the summarized results at the probe set >>>>> level. >>>> >>>> Generally speaking, schema design and data management have to be in >>>> the context of contract or any requirements you are obligated to. >>>> >>>> Ask the question what is the next if you load CEL? or what is the >>>> next if you load array data and etc? >>>> >>>> GUS and its app stacks certainly will allow you do those things, but >>>> it is critical you have some judgement calls. And the cost of >>>> loading raw data then querying them out is pretty expensive. >>>> >>>>> There are multiple choices for where to store array annotation at the >>>>> moment. >>>>> 1. RAD.CompositeElementDbRef and RAD.CompositeElementNASequence >>>>> have been >>>>> added to more quickly annotate Affy data with Entrez Genes and >>>>> RefSeq info >>>>> respectively. >>>>> 2. Another possibility is to use the external_database_release_id and >>>>> source_id pair in RAD.ShortOligoFamily to point to one preferred >>>>> annotation for each probe set (but you would have to choose one). >>>>> 3. Another, less structured possibility, is to use >>>>> RAD.CompositeElementAnnotation, where you use the attribute 'name' to >>>>> denote the annotation (e.g. "Entrez Gene", "RefSeq", etc.) and the >>>>> attribute 'value' for the annotation (e.g. entrez gene id, or >>>>> refseq id, >>>>> etc.) itself. This has less structured but it will allow you to >>>>> load as >>>>> many annotations as you like. >>>> >>>> I normally favor the consistant data management policy, that means, >>>> you don't need documentation somewhere saying "case 1, load data >>>> into table a, b, c; case 2, load data into table d, e, f; case 3, >>>> load data into table g, h, i", which not only make you data loading >>>> tough, also will make you app code built on top db stink. >>>> >>>> We didn't manage our own db perfectly neither. But hopefully our >>>> experiences could prove useful to you. >>>> >>>> I strongly suggest you look at the MAGE-Tab spec for raw/processed >>>> data and ADF spec for array data on ArrayExpress site, for MAGE-Tab >>>> and ADF are proved to be very effective for large db like AE. If you >>>> can make your app/db align to the standards as we are trying to do >>>> also, it certainly give you a safe edge. >>>> >>>> ---junmin >>>> >>> >> > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > |