Re: [GeneX-dev] Schema, Genex-2

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Quoting "Jason E. Stewart" <ja...@op...>:

> > Is there an ERD for the new schema already? (BTW which tool did you
> use
> > to create the current one which is available on the website?)
> 
> No, sorry. That is one of the priority one tasks that I indicated for
> the virginia consortium. 

I'm happy to see that you're going to use ArgoUML instead of ERwin. 

> 
> Ooops. I forgot to send the agenda in my last email. It has my
> proposal for Genex-2. I'm sending it with this mail.
> 

So the Virginia collaborators agreed and committed to do all of that?
Did you agree on a timeline?

Does agenda point A) mean that you better not try to squeeze Affy data
into GeneX right now?

> > > QuantitationDimension. That way if your data generates an array of
> 80
> > > floats for each spot (or Feature in MAGE speak), all of those 80
> > > numbers will go into a single row in the AM_Spots table for that
> > > technology. Genex-1 would force you to create 80 ArrayMeasurements
> > > each with a single value/spot in the AM_Spots table (yuck!).
> > 
> > So you're going to denormalize. Did you run into performance
> problems,
> > and if so, on which end, or in which situations? (Trying to learn
> from
> > your experience.)
> 
> Sorry, not sure which case you mean when you say that we're going to
> denormalize, Genex-2 or Genex-1? If Genex-2 I'm not sure that it is
> really denormalizing, is it? Every array which produces output using a
> given QuantitationDimension will have a separate AM_Spots table.

Well, I'll probably have to acquaint myself really well with what
QuantitationDimension really is about before I continue to make
dumb statements. Anyway, in the first place my gut feeling from all
the DB work I did before tells me that there is something wrong if
you need to create tables on the fly for a particular new dataset
coming in.

> In
> Genex-1 we broke apart data that should never have been split in the
> first place, e.g. creating separate ArrayMeasurements for:
> 
> * Channel 1 background
> * Channel 1 intensity
> * Channel 1 background subtracted intensity
> * Channel 2 background
> * Channel 2 intensity
> * Channel 2 background subtracted intensity
> * Channel 1/Channel 2 ratio
> 
> when they all should have been a single ArrayMeasurement with 7
> columns in the AM_Spots table. Genex-2 will fix that.

Again, without the ERD in front of me the following statements may be
very dumb. But if you really are going to create a 7-column row
for every feature on every chip, this will put you into big trouble
for Affy chips, for which with the current technology you have 408k
features on one single chip (and this is about to up). 
I.e., every unused float adds 8x408k=3.2M to the storage for every chip
(i.e., gigabytes for 1000s of chips), unless Postgres is as smart as Oracle
which doesn't physically store NULLs. But then you have the block length in
Oracle, which this row is not going to exceed anyway (in fact, in Oracle you
would use a CLUSTER for this).

Regarding AM_Spots, I'm also not sure whether you really need the PK
there. You could merge in AM_SuspectSpots (0..n), and I'm not sure why
the relationship to AL_Spots has to be n..n (I may easily be missing
something). As mentioned before, tossing the PK saves you potentially
GBs of storage, let alone the index-storage and it can save considerable
time on import.

BTW storing the ratio to seems to be redundant unless you have different 
methods to compute that, and in that case it would be a 0..n relationship.
Same goes for background subtracted intensity.

     -hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: la...@gn...
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------