From: Harry J M. <hj...@ta...> - 2004-05-10 21:08:01
|
Hi All, This doc is synthesized form a GAIM chat that Jason and I had a while back. Please read though it and see if it reflects your vison of reality and if not, let me know. Brief explanation how the scratch table works: The scratch table is completely generic; it has columns like INT1, INT2, FLOAT1, FLOAT2, etc. In order for a user to use it properly, a genex admin (a member of the genex_admin group) must create a VIEW onto the table that maps the generic columns to specific names like 'ch1_intensity => float1, ch2_intensity => float2', etc.. This is how the RAD DB at UPenn works. They use it for all their data we just use if for Derived BioAssay Data (I'll have to give a short writeup of how to use the Mason app I created to do the mapping) I think it will have to be a *much* simpler process for DB admins to understand it. This mapping can be made permanent by writing it to the DB - it's a view in the DB, so it is permanent until you drop the DB. It is also written into the TableDef table as a scratch view so all the Mason apps know about it. Further, since it involves the scratch table and views, it doesn't impinge or affect the actual DB schema. The real trick to the scratch table - and maybe the thing that will make it too much of a pain to use is that when you hook an analysis app up to the DB, you have to define a *destination view*. This destination view is where the analysis app will write it's output. This destination view is a view on the scratch table that an admin has already created; it is a formal DB View, as in 'CREATE VIEW AS ....'. All scratch table views are formal DB views that you can see under psql using '\dv' The person who hooks up the app has to choose a view whose columns match the output columns of the app so if the app produces two columns of data, he must choose a view that can hold those two columns. The app does not have to actually connect to the db and write the output INTO the scratch table, it just has to create its output in a format that is compatible with the VIEW that is going to hold it. Hooking up analysis tools to GeneX: 101 1) we want to make it really easy for people to hook up external tools for analysis. 2) therefore we can't expect them to learn and understand the Genex Perl API before they can use the tool with Genex - the API is great but it's complicated 3) we want to encourage users to keep all data stored in the DB - for archival purposes so data is traced and not lost and so it is verifiable (we're in science after all) 4) therefore we want to discourage Schlauchism - exporting data as tab files and littering tens of zip disks with data 5) to this end we need to create a smoke-and-mirrors illusion that apps which know nothing about the DB can actually pull data from the DB, do their analysis and write it back to the DB. When I was working with the Avestha people I tested out some ideas, and it turned out to be really trivial to accomplish this, at least in the limited efforts I had time to make. There are two classes of apps: 1) DB aware apps - they can already pull data from the DB directly and therefore don't need any help at all. These are easy - we just patch them to talk to Genex and viola! 2) DB stupid apps - they need disk files as input sources, not DB connectivity. To make this class of apps work, we need to subclass: subclass 1) enables both --input and --output subclass 2) doesn't allow determining both input and output file name, e.g. maybe the output is written to STDOUT or something. This isn't really so hard, we just have to run it within a wrapper that understands '--input' and '--output', and runs the application, moving files around where needed So subclass1 and subclass2 are really the same except that subclass2 needs to be run inside a (probably thin wrapping) that remaps the output/input - and that wrapper needs to be written separately for each app. To make this work we use our brand-new Protocol method in the DB. This was the bit that Michael Pear wanted to do for the grant and the piece I stole the tables from ESTAP to implement. I've added a bunch of Mason code to make it all work: You define a PROTOCOL that says what table (or query) to take the data from, which app to run on the input, and what scratch view to put the data into once its finished. [a concrete example would be ver useful here]. When the user wants to run an analysis he does the following: 1) choose an experiment to analyze. 2) choose the BioAssay's to be analyzed from that experiment. 3) Choose the analysis protocol to execute on the BioAssays. 4) go and have coffee. The DB code reads the protocol info from the DB and does the following: 1) exports the BioAssay data to a tab-delimited file using an I/O filter (see below). 2) starts the app using --input to tell it the input file name and --output telling it where to write the output. 3) when the app is finished, the wrapping script initiates transfers the data from the output file into the chosen destination view. 4) alerts the user that her data is waiting 5) smokes a cigarette This is reasonably straightforward, not too different from the approach we used in GeneX1 with rcluster, cybert, etc) The I/O filter has to formalize what we were in thinking about for GeneX1 - a general way to provide standardized inputs and outputs for any wild weasel of an app that wants to chew on GeneX data. It should enable anyone to write an I/O filter to massage the data on output from the DB, or input to the DB. So if a particular app wants it's NULL values in some heinous fashion it can be done. The filter idea is incorporated into the the PROTOCOL approach - *PROTOCOLS* are the core of the genex analysis approach. The trick to using the protocols is that each one defines the input and output tables (views) as part of the protocol. You want to store your input data into a scratch view that is useful for your protocols and you want to have scratch views that can handle the output of the protocols. The source view and destination view can be the same or different the only real issue is if they have the correct number of columns We don't necessarily want to design the Ginsu knife of filters, but provide a way that users can simply add their own wrapping filter code so that they an run the apps they want to. So can you see why scratch views are so important - they have to match the input for analysis apps and the output The trick is to change the user's mind as to what they expect the data to look like. For example, People familiar with spreadsheet files are always thinking in terms of data matrices which is fine, but that's not how the data is stored in the DB. In a data matrix, your columns are different BioAssay's (one column of intensity data per BioAssay) and the rows are the gene names with expression (intensity) for each BioAssay but in the DB, the values for the separate BioAssays are stored in different rows so when a user wants to do an analysis they can't think "I want you to run a new analysis on my favorite data matrix". Instead they have to think "I want you to run an analysis on this list of BioAssay's" (unless you have your favorite 'conversion to a data matrix' stored as an input filter - essentially a reasonably complex query). So the idea would be that a user could define a set of bioassays upon which to do other operations; that wouldn't be a filter though, it would just be a user preference that we could track, and we could provide some Mason app that allows users to define BioAssay collections for later processing In the way I've described it, the filter is just a mapping tool that takes a column of data about to be written to a text file, and applies a Perl regex to that column of numbers and based on that regex match, does something to the data. In contrast, the BioAssay set is a collection of favorite data that the user wants to analyze a number of different times. -- Cheers, Harry Harry J Mangalam - 949 856 2847 (v&f) - hj...@ta... <<plain text preferred>> |