|
From: Todd F. P. <tf...@nc...> - 2001-12-16 05:51:23
|
like i said in previous msg, the original schema can be mimicked by using views. what i'm doing is attempting to optimize the underlying db. the genex XML, mage XML and links to the other tools *should* behave the same when I get this right. i'm only affecting a small portion of the schema that should be transparent in the long run. i'm not reinventing the wheel, just implementing it in a unique manner. todd On 14 Dec 2001, Jason E. Stewart wrote: > "Todd F. Peterson" <tf...@nc...> writes: > > > oh yeah, it's fast and easy. that was not the case with the curation > > tool/xml mechanism. we figure, first get the data in the db and then > > annotate as necessary. this will be good for small labs. hopefully, > > the design will allow scaleability to larger environments. > > I'm all for lightwieght, fast, and easy, as well as annotation after > the data is in the DB. None of that impacts what schema you use. > > If you invent a whole new schema you'll have to invent completely new > tools that can work with that schema. If you store the data in a > subset of the GeneX schema, you can use all the tools that work for > GeneX. > > What is the advantage of the new schema? > > jas. > > PS. I'm on the list, you don't have to reply to me personally. > > _______________________________________________ > Genex-dev mailing list > Gen...@li... > https://lists.sourceforge.net/lists/listinfo/genex-dev > |
|
From: Michael P. <mic...@ho...> - 2001-12-16 19:40:04
|
Using views for compatibility while changing the underlying schema is a great idea. But be aware that you can't directly insert, update or delete from views. There is a mechanism for creating rules to handle updating the tables behind a view, but I've never experimented with that personally. Regards, Michael Pear ----- Original Message ----- From: "Todd F. Peterson" <tf...@nc...> To: <gen...@li...> Sent: Saturday, December 15, 2001 9:51 PM Subject: Re: [GeneX-dev] Re: ERWin File on website > like i said in previous msg, the original schema can be mimicked by > using views. what i'm doing is attempting to optimize the underlying db. > the genex XML, mage XML and links to the other tools *should* behave the > same when I get this right. i'm only affecting a small portion of the > schema that should be transparent in the long run. i'm not reinventing the > wheel, just implementing it in a unique manner. > > todd > > > On 14 Dec 2001, Jason > E. Stewart wrote: > > > "Todd F. Peterson" <tf...@nc...> writes: > > > > > oh yeah, it's fast and easy. that was not the case with the curation > > > tool/xml mechanism. we figure, first get the data in the db and then > > > annotate as necessary. this will be good for small labs. hopefully, > > > the design will allow scaleability to larger environments. > > > > I'm all for lightwieght, fast, and easy, as well as annotation after > > the data is in the DB. None of that impacts what schema you use. > > > > If you invent a whole new schema you'll have to invent completely new > > tools that can work with that schema. If you store the data in a > > subset of the GeneX schema, you can use all the tools that work for > > GeneX. > > > > What is the advantage of the new schema? > > > > jas. > > > > PS. I'm on the list, you don't have to reply to me personally. > > > > _______________________________________________ > > Genex-dev mailing list > > Gen...@li... > > https://lists.sourceforge.net/lists/listinfo/genex-dev > > > > > _______________________________________________ > Genex-dev mailing list > Gen...@li... > https://lists.sourceforge.net/lists/listinfo/genex-dev |
|
From: Hilmar L. <la...@gn...> - 2001-11-28 01:38:21
|
Quoting "Jason E. Stewart" <ja...@op...>:
> > Is there an ERD for the new schema already? (BTW which tool did you
> use
> > to create the current one which is available on the website?)
>
> No, sorry. That is one of the priority one tasks that I indicated for
> the virginia consortium.
I'm happy to see that you're going to use ArgoUML instead of ERwin.
>
> Ooops. I forgot to send the agenda in my last email. It has my
> proposal for Genex-2. I'm sending it with this mail.
>
So the Virginia collaborators agreed and committed to do all of that?
Did you agree on a timeline?
Does agenda point A) mean that you better not try to squeeze Affy data
into GeneX right now?
> > > QuantitationDimension. That way if your data generates an array of
> 80
> > > floats for each spot (or Feature in MAGE speak), all of those 80
> > > numbers will go into a single row in the AM_Spots table for that
> > > technology. Genex-1 would force you to create 80 ArrayMeasurements
> > > each with a single value/spot in the AM_Spots table (yuck!).
> >
> > So you're going to denormalize. Did you run into performance
> problems,
> > and if so, on which end, or in which situations? (Trying to learn
> from
> > your experience.)
>
> Sorry, not sure which case you mean when you say that we're going to
> denormalize, Genex-2 or Genex-1? If Genex-2 I'm not sure that it is
> really denormalizing, is it? Every array which produces output using a
> given QuantitationDimension will have a separate AM_Spots table.
Well, I'll probably have to acquaint myself really well with what
QuantitationDimension really is about before I continue to make
dumb statements. Anyway, in the first place my gut feeling from all
the DB work I did before tells me that there is something wrong if
you need to create tables on the fly for a particular new dataset
coming in.
> In
> Genex-1 we broke apart data that should never have been split in the
> first place, e.g. creating separate ArrayMeasurements for:
>
> * Channel 1 background
> * Channel 1 intensity
> * Channel 1 background subtracted intensity
> * Channel 2 background
> * Channel 2 intensity
> * Channel 2 background subtracted intensity
> * Channel 1/Channel 2 ratio
>
> when they all should have been a single ArrayMeasurement with 7
> columns in the AM_Spots table. Genex-2 will fix that.
Again, without the ERD in front of me the following statements may be
very dumb. But if you really are going to create a 7-column row
for every feature on every chip, this will put you into big trouble
for Affy chips, for which with the current technology you have 408k
features on one single chip (and this is about to up).
I.e., every unused float adds 8x408k=3.2M to the storage for every chip
(i.e., gigabytes for 1000s of chips), unless Postgres is as smart as Oracle
which doesn't physically store NULLs. But then you have the block length in
Oracle, which this row is not going to exceed anyway (in fact, in Oracle you
would use a CLUSTER for this).
Regarding AM_Spots, I'm also not sure whether you really need the PK
there. You could merge in AM_SuspectSpots (0..n), and I'm not sure why
the relationship to AL_Spots has to be n..n (I may easily be missing
something). As mentioned before, tossing the PK saves you potentially
GBs of storage, let alone the index-storage and it can save considerable
time on import.
BTW storing the ratio to seems to be redundant unless you have different
methods to compute that, and in that case it would be a 0..n relationship.
Same goes for background subtracted intensity.
-hilmar
--
-------------------------------------------------------------
Hilmar Lapp email: la...@gn...
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
|
|
From: <ja...@op...> - 2001-11-28 02:43:21
|
Hey Hilmar, Thanks for taking the time to voice your ideas/concerns. "Hilmar Lapp" <la...@gn...> writes: > Quoting "Jason E. Stewart" <ja...@op...>: > > So the Virginia collaborators agreed and committed to do all of that? > Did you agree on a timeline? They have not yet committed to the proposal. They are currently discussing priorities and deciding what they really need and when. The only work I've actually done is what I just talked about for the Genex-1 branch. > Does agenda point A) mean that you better not try to squeeze Affy data > into GeneX right now? Sorry, didn't mean to give you that impression. You most certainly can, the real issue is a decision needs to be made about how you plan on representing replicate spots (i.e. same nucleic acid spotted/synthesized in different locations on the array). We thought we had made a clever decision to force users to break replicate spots into separate data columns (i.e. stored as separate ArrayMeasurements in the DB). This looked appealing, but it screws too many things up. You'll notice that the AM_Spots table has a FK to the UserSequenceTable but not to the AL_Spots table. This wants to be the other way around. > Well, I'll probably have to acquaint myself really well with what > QuantitationDimension really is about before I continue to make dumb > statements. QuantitationDimension is an order list of QuantitationTypes, which in turn specify the data type of a piece of data as well as its semantic meaning (intensity, background, etc). Every data matrix has two dimensions the DesignElement dimension (the number of spots or genes on the array) and the QuantitationDimension (the number of columns and their data types). > Anyway, in the first place my gut feeling from all the DB work I did > before tells me that there is something wrong if you need to create > tables on the fly for a particular new dataset coming in. I agree. > Again, without the ERD in front of me the following statements may be > very dumb. But if you really are going to create a 7-column row > for every feature on every chip, this will put you into big trouble > for Affy chips, for which with the current technology you have 408k > features on one single chip (and this is about to up). > I.e., every unused float adds 8x408k=3.2M to the storage for every chip > (i.e., gigabytes for 1000s of chips), Perhaps I wasn't clear. I don't want to have a single spot table, that forces 7 columns for every spot, I want to have different tables that each have different number of columns so that your data can be dovetailed to exactly the correct table for the data matrix. If the feature extraction software you use produces 40 columns of output (ScanAlyze) I don't want to force all arrays to have 40 floats. I suspect that after a while, most arrays in the DB will be derived calculations based on the raw data. They will likely only have a single column of data (the value), or at most a few (average, std. dev., variance). In MAGE, each array has an ArrayDesign (ArrayLayout in GeneX), and a QuantitationDimension. The QuantitationDimension will determine what spot table is used for the data. I hope this is more clear. In terms of gigabytes of data, that's the price. If you want to keep all 40 values that scanalyze gives you, then you wind up with a DB like SMD, with 300M spots running on a 64 processor E10k. I'm hoping that the QuantDim solution, will mean that most of the data can be kept in a table with only a single data column. Much leaner. The only other solution that was presented to me, was a completely generic solution that meant doing a three table join to get back all the data from each spot. Looked horribly inefficient, and incredibly obtuse. I figure I've already wasted 3 months of programming time trying to explain all the bad complicated design decisions in genex-1, I vote for simple. If you have any better ideas *PLEASE* suggest them. The ideas I have using the QuantDim approach are *NOT* set in stone, by any means. From the sound of things you have a great deal more experience than I do in building databases. > unless Postgres is as smart as Oracle which doesn't physically store > NULLs. But then you have the block length in Oracle, which this row > is not going to exceed anyway (in fact, in Oracle you would use a > CLUSTER for this). I honestly don't know. > Regarding AM_Spots, I'm also not sure whether you really need the PK > there. Nope. We don't. Bad design decision. Especially since Postgres gives us oid's in every row anyway. > You could merge in AM_SuspectSpots (0..n), and I'm not sure why > the relationship to AL_Spots has to be n..n (I may easily be missing > something). As mentioned before, tossing the PK saves you potentially > GBs of storage, let alone the index-storage and it can save considerable > time on import. All correct. Actually the SpotLink table was not because we thought the relationship was n..n, it was for efficiency. Because we thought that every spot would be indexed versus the usf_fk, the only spots that needed to know what AL_Spot they came from were those without sequence features, i.e. blanks and controls. Since there are only a few of those on every array, we didn't want to have an extra als_fk column in the AM_Spots table. The same reasoning was used for AM_SuspectSpots: only a few spots will be bad, so put them in a seperate table. As I say, these turned out to be poor design choices. > BTW storing the ratio to seems to be redundant unless you have different > methods to compute that, and in that case it would be a 0..n relationship. > Same goes for background subtracted intensity. Sometimes that is the data that we will get. The database has to be able to handle it if users want to store it. I'm hoping the QuantDim idea makes things both flexible and efficient. Genex-1 was originally designed to only house an NCGR repository. It was later changed to be more lab-centric. Its current goals are almost completely lab-centric, because the biologists are the people who really need this technology. Cheers, jas. |
|
From: <ja...@op...> - 2001-11-28 02:52:40
|
"Hilmar Lapp" <la...@gn...> writes: > Quoting "Jason E. Stewart" <ja...@op...>: > > > > No, sorry. That is one of the priority one tasks that I indicated for > > the virginia consortium. > > I'm happy to see that you're going to use ArgoUML instead of ERwin. ;-) Actually, I'm most likely to use the community edition of Poseidon as long as the output is compatible with ArgoUML. I need to get gcj working to speed it up. Currently, Java's too slow on my linux laptop. jas. |