Re: [GeneX-dev] DataLoader

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hey Hilmar!

Interesting to see that Novartis has taken an interest.

"Hilmar Lapp" <la...@gn...> writes:

> Quoting "Jason E. Stewart" <ja...@op...>:
> 
> > There needs to be some significant
> > database changes to make it work properly, and this will be released
> > in the Genex-2 branch under development.
> 
> May I ask how this will in general work, and what the required DB
> changes were about?

Mind you, this is for the upcoming Genex-2 version, not the code that
is already available. Genex-2 was underway before the MAGE model was
finalized. The primary change between Genex-1 and Genex-2 was a far
more useful security model. In Genex-1 you can only protect
ExperimentSets, ArrayMeasurements, and AM_Spots. All the rest is world
viewable. 

Genex-2 enables you to protect *all* data: protocols, samples,
contacts, etc. It also introduces audit information so you can track
what was changed and by whom. And it introduces a generic
authentication mechanism used by all CGI scripts -- so you have to
login to the system before viewing data, making queries, manipulating
data. 

Because MAGE is now (mostly) finalized, a good deal of the plans for
Genex-2 will be the renaming of objects/tables to fit with MAGE
nomenclature plus the addition of a number of additional
tables/objects specified by MAGE. Genex-2 will *not* be fully MAGE
compliant, but it will have major pieces.

> Some background as to why I'm interested in the details: With my
> previous employer we actually together with a consultant developed a
> high-throughput general database loader, which would take any
> record-oriented input file and load it to any relational
> database. The limitation is obviously SQL on the DB end; i.e.,
> anything you cannot load through SQL cannot be loaded with that
> tool.  That limits you to a) insert into 1 table at a time, or b)
> insert into 1 view at a time, provided you can attach insert
> triggers to the view (which you can in Oracle), or c) call a stored
> procedure.  We used b) and c), with all the relational logic
> (LU,PK,FK etc) staying within the DB. I'm actually trying to get
> them to release the code (Java), not sure how successful this is
> going to be.

Sounds pretty cool. If you wanted to, that could easily be hosted at
the MAGEstk site (mged.sf.net), the GeneX site, or at OpenInformatics
(www.openinformatics.org). 

The Genex-2 data loader will *not* be a general purpose solution, it
will strictly handle microarray data. You will need to specify two
templates in order to use the loader:

1) the ArrayLayout (or ArrayDesign if you speak MAGE)
2) the QuantitiationDimension (from MAGE) that is defined by the
   combination of array technology and feature extraction software you
   used. This is a mapping that describes how many columns are in the
   output file, what their data type is and what the semantic meaning
   of the column is

Once they are specified it just a matter of slurping in rows of data
from the array files and entering them into the appropriate table in
the DB. A major change in Genex-2 will be how the AM_Spots table is
handle. In Genex-1 there is a single table into which all data is
smashed. This works, and it is very general, but it creates too many
problems. The solutiont that we've decided to pursue in Genex-2 is to
use a different AM_Spots table for each new
QuantitationDimension. That way if your data generates an array of 80
floats for each spot (or Feature in MAGE speak), all of those 80
numbers will go into a single row in the AM_Spots table for that
technology. Genex-1 would force you to create 80 ArrayMeasurements
each with a single value/spot in the AM_Spots table (yuck!).

> > In the mean time, I took code
> > that was graciously donated by Michael Pear, and got a data loader
> > working for Genex-1.
> > 
> > In the meantime, if you want to help pre-test the code, let me know.
> 
> Sure. Especially if it helps me migrate a couple of thousand chip data
> to our local GeneX in order to test its performance.

You can check the code out from CVS. You'll want to use the
'Rel-1_0_1-branch' branch. Info on how to get the code from CVS is at:

  https://sourceforge.net/cvs/?group_id=16453

Once you've logged in you'll want to do the following:

  cvs -d:pserver:ano...@cv...:/cvsroot/genex \
     co -r Rel-1_0_1-branch genex-server

except of course you want it all on one line without the backslash...

That will give you a working copy of GeneX-Server-1.0.5. The
dataloader is in the affyloader/ directory.

  !!! WARNING !!!

There isn't a huge amount of documentation available on the code. I've
added a USAGE to each and a --help flag that *should* print out useful
info, but YMMV. Please write to the list if you need help.

You'll want to run a complete install even if you already have a
working GeneX installation: there were two changes to the DB one to
fix a bug in AL_Spots (the primary key was not being auto-generated),
and the other is the addition of a view on the AM_Spots table. So you
want to make sure that the DB installer runs and downloads the new DB
init file (1.0.5) from the internet.

BTW, GeneX has a nice feature for updating an existing
installation. Check out the section on 'Updating an installation' in
the INSTALL file.

jas.