Re: [Gusdev-gusdev] RAD3 Schema Questions

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Paul,
Since I referred you her, I'll take point and answer these questions. 
See the comments below.

WE ask if you take/adapt any code, that you pay attention to the Apache 
inspired license and add references back to gusdb.org. Thanks!

Angel

Paul Boutros wrote:

>Hi all,
>
>Hopefully this is the right place for these questions.  If not, please let me 
>know where I can ask.
>
>I have a moderately extensive Oracle DB for cDNA microarray data.  As 
>requirements have increased it is now considered desirable to store Affy data 
>as well as enhanced sample Annotation.  I looked into MAGE-ML, and was referred 
>from a list there to GUS DB.
>
>I've started implementing a portion of your schema into my own DB, in 
>particular the annotation portion (e.g. ExternalDatabaseRelease, BioMaterialImp 
>& associated views, LabelMethod, etc.).
>
>I'm thinking about using an even larger fraction -- including the 
>CompositeElementImp/CompositeElementResultImp and ElementImp/ElementResultImp 
>tables -- because I really like the schema design you've done.
>
>But here is my problem.  I'm having some difficulty interpreting the meanings 
>of those tables.  My main questions:
>1. What are the differences between the xxxElementImp and xxxElementResultImp 
>tables?  What goes into each?  My understanding is that the xxxElementImp store 
>details about the array *layout* while the xxxElementResultImp store details 
>about the data from specific arrays.
>
This is correct. For example the Array table would contain "Affy array 
U74A", the ShortOligo view on the ElementImp table would contain the 
probe pair information and the ShortOligoFamily view on 
CompositeElementImp would contain information of the probe sets. For 
microarray data, Array = "MicroArray X" and the Spot view on ElementImp 
would contain the Features (e.g. physical locations) and the sequence 
that is spotted there. For MAGE purposes, we decided to put Reporter and 
CompositeSequence information in the SpotFamily view on the 
CompositeElementImp table.

The Results go into the various views on xxxElemenResultImp, such as 
ArrayVisionResult or AffymetrixMAS4.
Check out the documentation on the ArrayVision view from the GUS schema 
browser (look for the tables with the RAD prefix):

http://www.cbil.upenn.edu/cgi-bin/GUS30/schemaBrowser.pl?db=GUS30

http://www.cbil.upenn.edu/cgi-bin/GUS30/schemaBrowser.pl?db=GUS30&table=RAD3::ArrayVisionElementResult&path=RAD3::ArrayVisionElementResult

>2. If the above description is right, would that mean that for each physical 
>array (each "chip") there are records in all four tables?  Is that necessary 
>for cases with repeated chip "layouts"?
>
Well, it depends on what data you have and what you are going to use 
this DB for. But let me first state that this is actually the most space 
efficient way of storing array layouts and results. We separated the 
array layout information (as you noted) from the results in order to use 
the layouts repeatedly for multiple analysis on the same chips.

So to answer your question:
If your intent to provide a DB to keep track of LIMS information for 
something like a microarray core facility, (e.g. you are never going to 
work with the data from within the database) then you do not need the 
xxxElementResultImp tables at all. You can just store the Array 
definitions and the Hybridization information on the Assay -> 
Acquisition -> Quantification tables:
Assay = Hybridization ,
Acquisition = Scanning information and the location of the image file,
Quantification = Feature extraction / quantification software parameters 
and the location of the result file

But for our purposes let's assume that you need to store the data in the 
DB and work with it there:

For Affy, if you only produce / receive MAS* files, then you do not need 
the Element*Imp branch, since you will not need to store the individual 
probe pairs or the CEL file results on these probe pairs

For microarray data, if you do not want to group elements into some 
bigger concept, like a gene, or group the individual elements by source 
plate information, then you do not need the CompositeElement*Imp branch.

All other cases require you to fill in all four tables.

>3. Are the Ontologies used in RAD3::OntologyTerm publicly available?  I 
>couldn't find them in the 3.0-Beta release tar, but perhaps I just missed them?
>  
>
No they are not, for a variety of reasons. You raise a good point though 
and we will put this on our to-do list.

>Any help or suggested reading would be very much appreciated!
>Paul
>
Here are two references for the previous version of the schema that 
cover the major concepts/conventions used in RAD. Most of it still 
apply, module some schema details. If you can't get these, email me (off 
the list) and I'll try and get copies sent. A new manuscript is in 
preparation.

Stoeckert, C., Pizarro, A., Manduchi, E., Gibson, M., Brunk, B., 
Crabtree, J., Schug, S., Shen-Orr, S., Overton, G.C. (2001) A relational 
schema for both array-based and SAGE gene expression experiments. 
Bioinformatics 17(4), 300-308 (2001).

Manduchi, E., Pizarro, A., Stoeckert, C. (2001) RAD (RNA Abundance 
Database): an infrastructure for array data analysis. Proc. SPIE, vol 
4266, pp. 68-78.

Angel