Re: [GeneX-dev] Schema, Genex-2

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hey Hilmar,

Thanks for taking the time to voice your ideas/concerns.

"Hilmar Lapp" <la...@gn...> writes:

> Quoting "Jason E. Stewart" <ja...@op...>:
>
> So the Virginia collaborators agreed and committed to do all of that?
> Did you agree on a timeline?

They have not yet committed to the proposal. They are currently
discussing priorities and deciding what they really need and
when. The only work I've actually done is what I just talked about for
the Genex-1 branch.

> Does agenda point A) mean that you better not try to squeeze Affy data
> into GeneX right now?

Sorry, didn't mean to give you that impression. You most certainly
can, the real issue is a decision needs to be made about how you plan
on representing replicate spots (i.e. same nucleic acid
spotted/synthesized in different locations on the array). 

We thought we had made a clever decision to force users to break
replicate spots into separate data columns (i.e. stored as separate
ArrayMeasurements in the DB). This looked appealing, but it screws too
many things up. You'll notice that the AM_Spots table has a FK to the
UserSequenceTable but not to the AL_Spots table. This wants to be the
other way around.

> Well, I'll probably have to acquaint myself really well with what
> QuantitationDimension really is about before I continue to make dumb
> statements. 

QuantitationDimension is an order list of QuantitationTypes, which in
turn specify the data type of a piece of data as well as its semantic
meaning (intensity, background, etc). Every data matrix has two
dimensions the DesignElement dimension (the number of spots or genes
on the array) and the QuantitationDimension (the number of columns and
their data types).

> Anyway, in the first place my gut feeling from all the DB work I did
> before tells me that there is something wrong if you need to create
> tables on the fly for a particular new dataset coming in.

I agree.

> Again, without the ERD in front of me the following statements may be
> very dumb. But if you really are going to create a 7-column row
> for every feature on every chip, this will put you into big trouble
> for Affy chips, for which with the current technology you have 408k
> features on one single chip (and this is about to up). 
> I.e., every unused float adds 8x408k=3.2M to the storage for every chip
> (i.e., gigabytes for 1000s of chips), 

Perhaps I wasn't clear. I don't want to have a single spot table, that
forces 7 columns for every spot, I want to have different tables that
each have different number of columns so that your data can be
dovetailed to exactly the correct table for the data matrix.

If the feature extraction software you use produces 40 columns of
output (ScanAlyze) I don't want to force all arrays to have 40
floats. 

I suspect that after a while, most arrays in the DB will be derived
calculations based on the raw data. They will likely only have a
single column of data (the value), or at most a few (average,
std. dev., variance). 

In MAGE, each array has an ArrayDesign (ArrayLayout in GeneX), and a
QuantitationDimension. The QuantitationDimension will determine what
spot table is used for the data. 

I hope this is more clear.

In terms of gigabytes of data, that's the price. If you want to keep
all 40 values that scanalyze gives you, then you wind up with a DB
like SMD, with 300M spots running on a 64 processor E10k.

I'm hoping that the QuantDim solution, will mean that most of the data
can be kept in a table with only a single data column. Much leaner. 

The only other solution that was presented to me, was a completely
generic solution that meant doing a three table join to get back all
the data from each spot. Looked horribly inefficient, and incredibly
obtuse. I figure I've already wasted 3 months of programming time
trying to explain all the bad complicated design decisions in genex-1,
I vote for simple.

If you have any better ideas *PLEASE* suggest them. The ideas I have
using the QuantDim approach are *NOT* set in stone, by any means. From
the sound of things you have a great deal more experience than I do in
building databases. 

> unless Postgres is as smart as Oracle which doesn't physically store
> NULLs. But then you have the block length in Oracle, which this row
> is not going to exceed anyway (in fact, in Oracle you would use a
> CLUSTER for this).

I honestly don't know. 

> Regarding AM_Spots, I'm also not sure whether you really need the PK
> there. 

Nope. We don't. Bad design decision. Especially since Postgres gives
us oid's in every row anyway.

> You could merge in AM_SuspectSpots (0..n), and I'm not sure why
> the relationship to AL_Spots has to be n..n (I may easily be missing
> something). As mentioned before, tossing the PK saves you potentially
> GBs of storage, let alone the index-storage and it can save considerable
> time on import.

All correct. 

Actually the SpotLink table was not because we thought the
relationship was n..n, it was for efficiency. Because we thought that
every spot would be indexed versus the usf_fk, the only spots that
needed to know what AL_Spot they came from were those without sequence
features, i.e. blanks and controls. Since there are only a few of
those on every array, we didn't want to have an extra als_fk column in
the AM_Spots table. The same reasoning was used for AM_SuspectSpots:
only a few spots will be bad, so put them in a seperate table.

As I say, these turned out to be poor design choices.

> BTW storing the ratio to seems to be redundant unless you have different 
> methods to compute that, and in that case it would be a 0..n relationship.
> Same goes for background subtracted intensity.

Sometimes that is the data that we will get. The database has to be
able to handle it if users want to store it. I'm hoping the QuantDim
idea makes things both flexible and efficient.

Genex-1 was originally designed to only house an NCGR repository. It
was later changed to be more lab-centric. Its current goals are almost
completely lab-centric, because the biologists are the people who
really need this technology.

Cheers,
jas.