|
From: <ja...@op...> - 2001-11-28 02:43:21
|
Hey Hilmar, Thanks for taking the time to voice your ideas/concerns. "Hilmar Lapp" <la...@gn...> writes: > Quoting "Jason E. Stewart" <ja...@op...>: > > So the Virginia collaborators agreed and committed to do all of that? > Did you agree on a timeline? They have not yet committed to the proposal. They are currently discussing priorities and deciding what they really need and when. The only work I've actually done is what I just talked about for the Genex-1 branch. > Does agenda point A) mean that you better not try to squeeze Affy data > into GeneX right now? Sorry, didn't mean to give you that impression. You most certainly can, the real issue is a decision needs to be made about how you plan on representing replicate spots (i.e. same nucleic acid spotted/synthesized in different locations on the array). We thought we had made a clever decision to force users to break replicate spots into separate data columns (i.e. stored as separate ArrayMeasurements in the DB). This looked appealing, but it screws too many things up. You'll notice that the AM_Spots table has a FK to the UserSequenceTable but not to the AL_Spots table. This wants to be the other way around. > Well, I'll probably have to acquaint myself really well with what > QuantitationDimension really is about before I continue to make dumb > statements. QuantitationDimension is an order list of QuantitationTypes, which in turn specify the data type of a piece of data as well as its semantic meaning (intensity, background, etc). Every data matrix has two dimensions the DesignElement dimension (the number of spots or genes on the array) and the QuantitationDimension (the number of columns and their data types). > Anyway, in the first place my gut feeling from all the DB work I did > before tells me that there is something wrong if you need to create > tables on the fly for a particular new dataset coming in. I agree. > Again, without the ERD in front of me the following statements may be > very dumb. But if you really are going to create a 7-column row > for every feature on every chip, this will put you into big trouble > for Affy chips, for which with the current technology you have 408k > features on one single chip (and this is about to up). > I.e., every unused float adds 8x408k=3.2M to the storage for every chip > (i.e., gigabytes for 1000s of chips), Perhaps I wasn't clear. I don't want to have a single spot table, that forces 7 columns for every spot, I want to have different tables that each have different number of columns so that your data can be dovetailed to exactly the correct table for the data matrix. If the feature extraction software you use produces 40 columns of output (ScanAlyze) I don't want to force all arrays to have 40 floats. I suspect that after a while, most arrays in the DB will be derived calculations based on the raw data. They will likely only have a single column of data (the value), or at most a few (average, std. dev., variance). In MAGE, each array has an ArrayDesign (ArrayLayout in GeneX), and a QuantitationDimension. The QuantitationDimension will determine what spot table is used for the data. I hope this is more clear. In terms of gigabytes of data, that's the price. If you want to keep all 40 values that scanalyze gives you, then you wind up with a DB like SMD, with 300M spots running on a 64 processor E10k. I'm hoping that the QuantDim solution, will mean that most of the data can be kept in a table with only a single data column. Much leaner. The only other solution that was presented to me, was a completely generic solution that meant doing a three table join to get back all the data from each spot. Looked horribly inefficient, and incredibly obtuse. I figure I've already wasted 3 months of programming time trying to explain all the bad complicated design decisions in genex-1, I vote for simple. If you have any better ideas *PLEASE* suggest them. The ideas I have using the QuantDim approach are *NOT* set in stone, by any means. From the sound of things you have a great deal more experience than I do in building databases. > unless Postgres is as smart as Oracle which doesn't physically store > NULLs. But then you have the block length in Oracle, which this row > is not going to exceed anyway (in fact, in Oracle you would use a > CLUSTER for this). I honestly don't know. > Regarding AM_Spots, I'm also not sure whether you really need the PK > there. Nope. We don't. Bad design decision. Especially since Postgres gives us oid's in every row anyway. > You could merge in AM_SuspectSpots (0..n), and I'm not sure why > the relationship to AL_Spots has to be n..n (I may easily be missing > something). As mentioned before, tossing the PK saves you potentially > GBs of storage, let alone the index-storage and it can save considerable > time on import. All correct. Actually the SpotLink table was not because we thought the relationship was n..n, it was for efficiency. Because we thought that every spot would be indexed versus the usf_fk, the only spots that needed to know what AL_Spot they came from were those without sequence features, i.e. blanks and controls. Since there are only a few of those on every array, we didn't want to have an extra als_fk column in the AM_Spots table. The same reasoning was used for AM_SuspectSpots: only a few spots will be bad, so put them in a seperate table. As I say, these turned out to be poor design choices. > BTW storing the ratio to seems to be redundant unless you have different > methods to compute that, and in that case it would be a 0..n relationship. > Same goes for background subtracted intensity. Sometimes that is the data that we will get. The database has to be able to handle it if users want to store it. I'm hoping the QuantDim idea makes things both flexible and efficient. Genex-1 was originally designed to only house an NCGR repository. It was later changed to be more lab-centric. Its current goals are almost completely lab-centric, because the biologists are the people who really need this technology. Cheers, jas. |