Re: [GUSDEV] Affymetrix .CEL files

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

A clarification regarding point (2) and my response to that below.
LoadBatchArrayResults is more flexible regarding input format than 
LoadArrayResults. In fact, LoadArrayResults requires the data_file 
provided to be in the format specified in the documentation:
https://www.cbil.upenn.edu/svn/gus/GusAppFramework/trunk/Supported/doc/LoadArrayResults.html
so typically requires some parsing of the original software output prior 
to being input into this plugin.
In LoadBatchArrayResults the software output is assumed to be 
tab-delimited text, however typically output from programs like MAS4, 
MAS5, RMAExpress, MOID, GenePix or ArrayVision, can be used as is, but 
the user needs to provide an xml_file which tells the plugins how this 
output should be reformatted before the plugins calls 
LoadSimpleArrayResults (a simplified version of LoadArrayResults that 
requires a similar data_file input format) to load it into RAD.
The files LoadBatchArrayResuts*.xml in
https://www.cbil.upenn.edu/svn/gus/GusAppFramework/trunk/Community/config/
are examples of such specifications.
Basically they tell the plugin how to map the columns of the 
software output to columns whose headers are acceptable as data_file input 
for LoadSimpleArrayResults, i.e. columns compatible with the fields of 
the view to be populated. It is also possible in this xml file (see 
GenePix example) to specify how to transform a subset of these columns 
through a function (e.g. see coordGenePix2RAD).
Thus for your APT files for RMA, if they are tab-delimited text, by 
providing the correct xml_file which tells LoadBatchArrayResult how to 
read them and map its columns to fields in the RAD.RMAExpress view, you 
can load them through this plugin (note the xml file will be similar to 
the RMAExpress.xml example in the website above, but you might have to 
adjust the input header names according to those in your file).
For Plier, as mentioned, there is now view yet. Once the view is created, 
if the APT output is tab-delimited, you could use LoadBatchArrayResult 
but first you need to extend its code to accomodate the plier protocol 
(list of  current protocols accepted by LoadBatchArrayResults is at
https://www.cbil.upenn.edu/svn/gus/GusAppFramework/trunk/Community/doc/LoadBatchArrayResults.html) 
and second you will need to create the appropriate xml_file which 
describes how the APT output should be mapped.
As mentioned below though, if in your situation you expect to always use 
only a couple of software packages for summarization, always with the 
same type of format, it might be more efficient to write a specific (and 
simpler) plugin that deals directly only with those.
Elisabetta

On Fri, 14 Sep 2007, Elisabetta Manduchi wrote:

>
> Hi Dave,
> I'll respond to 2 and 4. For (1) I defer to Junmin.
> For (3) all I can say is that it is in our lab's plans to release bug-fixes 
> and new releases of GUS, however this keeps being postponed due to other 
> priorities. In the meantime for postresql questions re GUS, John Iodice might 
> be able to help you.
> Getting back to your question (2), first of all, as mentioned in my previous 
> email we currently have a view for RMA results, but we do not have a view for 
> Plier results. If you need a view for Plier in your instance of the DB 
> though, you can simply create such a view with the attributes you need in 
> your own instance. It would be a view of RAD.CompositeElementResultImp. Once 
> created, remember to update Core.TableInfo and rebuild GUS, so that the 
> objects for the new view are in place.
> The current available plugins to load data into RAD.CompositeElementResultImp 
> views are: LoadArrayResult (in Supported) which loads the results of one 
> assay at a time, and LoadBatchResult which we have already discussed. The 
> documentation of these plugins, available from svn illustrates, what the 
> input format should be. The idea guiding the design of these plugins we made 
> available was that they would be *generic*, i.e. they would be able to take 
> data from a wide variety of quantification software and load them into RAD. 
> So we opted for one generic code at the expense of some work to put the input 
> into the appropriate format.
> If a project/lab typically gets files in a particular data format, then it 
> might be worth for them to write a plugin which is specific to that rather 
> than using the generic plugin. This way they can use the output as spit out 
> by the software they use. It is fairly simple to write a plugin specific to 
> one's needs using the Plugin package. So if you expect to deal most of the 
> timewith a particular type of output (e.g. from APT) you might consider 
> writing a specific plugin.
>
> Regarding your question (4), the answer is no. We do not store images in GUS. 
> For certain types of images, like microarray images (e.g. files resulting 
> from scanning, like .TIF or .DAT) we store in the db their uri to the 
> fileserver (in RAD.Acquisition.uri).
> Hope this helps,
> Elisabetta
>
> ---
>
> On Fri, 14 Sep 2007, Dave Hau wrote:
>
>>  Junmin and Elisabetta, thanks again for your helpful comments.
>>
>>  Couple of questions.
>>
>>  1.  The HG-U133_Plus_2 array annotation file I downloaded from Affymetrix
>>  is an xml file in MAGE-ML format.  On the RAD download page (
>>  http://www.cbil.upenn.edu/downloads/RAD/ ), I see a tool called
>>  mage2tab-v0.9, which I assume would be able to convert the annotation file
>>  to MAGE-TAB format.  Then in order to load this MAGE-TAB file into GUS, I
>>  noticed on the CBIL Lab Meetings web page, for Thursday March 15, 2007,
>>  Junmin gave a talk on MR-Ti, and the description mentions the loadMageDoc
>>  GUS plugin.  I notice (and have downloaded) a file on the RAD download
>>  page called "MR_T_ForGUS35.tar.gz" but the loadMageDoc plugin is not in
>>  there.  Is there a way for me to obtain this plugin?
>>
>>  2.  I ran "apt-probeset-summarize" in the Affymetrix Power Tools (APT)
>>  package (
>>  http://www.affymetrix.com/support/developer/powertools/index.affx ) and
>>  obtained probe set data for my .CEL files, one set for RMA and another set
>>  for PLIER.  Is there a plugin that will readily load these APT output
>>  files into GUS as probe set data?
>>
>>  3.  The GUS installation I'm using is top of trunk from the CBIL svn
>>  repository.  This is because I'm using postgresql on the back end, and the
>>  3.5 GUS package gave me a lot of problems.  These seem to have been fixed
>>  in the top of trunk.  However, in order to use existing plugins, would it
>>  be advisable to use top of trunk (including the new schema changes for new
>>  features  that Elisabetta mentioned)?  If not, is there, or do you plan on
>>  releasing a bug-fix version of 3.5 that contains bug fixes back-ported to
>>  3.5, but does not contain any of the new features not yet released?
>>
>>  4.  Is there any way in RAD or GUS to load pathological images (e.g.
>>  associated with biosamples used for hybridization) into the GUS database?
>>
>>  Thanks very much,
>>  Dave
>> 
>> 
>>
>>  Junmin Liu wrote:
>> >  Hi, Dave,
>> >  Again in line:
>> > 
>> > > >  The consensus not to load CEL files into the database - is it 
>> > > >  because we only
>> > > >  query for probe set data based on the gene, but not for probe cell 
>> > > >  data? If I
>> > > 
>> > >  yes typically people query the summarized results at the probe set
>> > >  level.
>> > 
>> >  Generally speaking, schema design and data management have to be in the 
>> >  context of contract or any requirements you are obligated to.
>> > 
>> >  Ask the question what is the next if you load CEL? or what is the next 
>> >  if you load array data and etc?
>> > 
>> >  GUS and its app stacks certainly will allow you do those things, but it 
>> >  is critical you have some judgement calls. And the cost of loading raw 
>> >  data then querying them out is pretty expensive.
>> > 
>> > >  There are multiple choices for where to store array annotation at the
>> > >  moment.
>> > >  1. RAD.CompositeElementDbRef and RAD.CompositeElementNASequence have 
>> > >  been
>> > >  added to more quickly annotate Affy data with Entrez Genes and RefSeq 
>> > >  info
>> > >  respectively.
>> > >  2. Another possibility is to use the external_database_release_id and
>> > >  source_id pair in RAD.ShortOligoFamily to point to one preferred
>> > >  annotation for each probe set (but you would have to choose one).
>> > >  3. Another, less structured possibility, is to use
>> > >  RAD.CompositeElementAnnotation, where you use the attribute 'name' to
>> > >  denote the annotation (e.g. "Entrez Gene", "RefSeq", etc.) and the
>> > >  attribute 'value' for the annotation (e.g. entrez gene id, or refseq 
>> > >  id,
>> > >  etc.) itself. This has less structured but it will allow you to load 
>> > >  as
>> > >  many annotations as you like.
>> > 
>> >  I normally favor the consistant data management policy, that means, you 
>> >  don't need documentation somewhere saying "case 1, load data into table 
>> >  a, b, c; case 2, load data into table d, e, f; case 3, load data into 
>> >  table g, h, i", which not only make you data loading tough, also will 
>> >  make you app code built on top db stink.
>> > 
>> >  We didn't manage our own db perfectly neither. But hopefully our 
>> >  experiences could prove useful to you.
>> > 
>> >  I strongly suggest you look at the MAGE-Tab spec for raw/processed data 
>> >  and ADF spec for array data on ArrayExpress site, for MAGE-Tab and ADF 
>> >  are proved to be very effective for large db like AE. If you can make 
>> >  your app/db align to the standards as we are trying to do also, it 
>> >  certainly give you a safe edge.
>> > 
>> >  ---junmin
>> > 
>> 
>

-- 
Elisabetta Manduchi

Computational Biology and Informatics Laboratory
Center for Bioinformatics
University of Pennsylvania
1428 Blockley Hall
423 Guardian Drive
Philadelphia, PA 19104-6021

phone: 215-573-4408
fax: 215 573-3111
email: man...@pc...
web: http://www.cbil.upenn.edu/~manduchi

---