Re: [GUSDEV] Affymetrix .CEL files

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Dave, in line:

> Thanks Junmin and Elisabetta for your helpful comments.
>
> The consensus not to load CEL files into the database - is it because we only 
> query for probe set data based on the gene, but not for probe cell data? If I

yes typically people query the summarized results at the probe set 
level.

> store the CEL file in the filesystem and only store a file URI in the 
> database, does RAD provide a way to run summarization algorithms (e.g. RMA, 
> Plier) on those files?

Not currently. RAD provides the database where the results of such 
algorithms can be stored. One could certainly write a plugin that goes 
to the .CEL file indicated by the uri and then uses it to run their 
summarization algorithms of choice. However we do not currently have any 
such plugin in Supported or Community.

> Can I load multiple sets of probe set data for a 
> single set of probe cell data (e.g. one for RMA, one for Plier)?

Certainly. You would create as many entries in RAD.Quantification as the 
number of summarization protocols you run (e.g. MAS 5, RMA, Plier) on the 
same .CEL file, each such entry will point to the appropriate 
summarization protocol. You would additionally have a quantification 
referring to the .CEL file. In RAD.RelatedQuantification you can connect 
to the .cel quantification each of the others (summarization ones) that 
have used that .cel file.
Then you can load the results of the summarization algorithms in the 
corresponding views of RAD.CompositeElementResultImp. Currently we have 
views for MAS4, MAS5, RMAExpress (which will simply be renamed in the 
next release RMA, and which accomodates RMA, gcRMA, etc.) and MOID. But 
it's easy to create additional views of the same table in your own 
istance that might accomodate other summarization programs.

> Also, according to the instructions in the RAD website on how to load a 
> complete microarray study into the GUS database, the first step mentions 
> "Further array annotation can be loaded via 
> GUS::Community::Plugin::InsertArray2DbRefAndNaSeq. I tried to run this 
> plugin, but got this error:
>
> FATAL: Can't locate GUS/Model/RAD/CompositeElementDbRef.pm in @INC
>
> Do you know where I can find this CompositeElementDbRef.pm file?

I think this is because the tables RAD.(Composite)ElementDbRef and 
RAD.(Composite)ElementNASequence where added after the last official GUS 
release. They are scheduled for the next GUS release (which probably 
won't occur in the near future). We have added them to our own 
instance of GUS at CBIL.
So, if you want to use these tables, you first need to add those 4 
tables to your db instance (you can find the latest sql for GUS in the 
GusSchema svn at
https://www.cbil.upenn.edu/svn/gus/GusSchema/trunk/Definition/config/gus_schema.xml).
(Note that this contains also other modifications made to tables 
subsequently to the 3.5 GUS release).
Then you need to populate Core.TableInfo with entries for these new 
tables.
Then you need to rebuild GUS forcing rebuilding of the objects. This way 
the code generator will see the new tables and create the corresponding 
objects, including the one you are referring to above.

> I would like to load the annotation file I obtained from the Affymetrix 
> website for the HG-U133_Plus_2 array into the GUS database. What's the best 
> way to go about this?

There are multiple choices for where to store array annotation at the 
moment.
1. RAD.CompositeElementDbRef and RAD.CompositeElementNASequence have been 
added to more quickly annotate Affy data with Entrez Genes and RefSeq info 
respectively.
2. Another possibility is to use the external_database_release_id and 
source_id pair in RAD.ShortOligoFamily to point to one preferred 
annotation for each probe set (but you would have to choose one).
3. Another, less structured possibility, is to use 
RAD.CompositeElementAnnotation, where you use the attribute 'name' to 
denote the annotation (e.g. "Entrez Gene", "RefSeq", etc.) and the 
attribute 'value' for the annotation (e.g. entrez gene id, or refseq id, 
etc.) itself. This has less structured but it will allow you to load as 
many annotations as you like.
Elisabetta