[GeneX-dev] scatch tables and filters

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi All,

This doc is synthesized form a GAIM chat that Jason and I had a while back.
Please read though it and see if it reflects your vison of reality and if
not, let me know.

Brief explanation how the scratch table works:

The scratch table is completely generic; it has columns like INT1, INT2,
FLOAT1, FLOAT2, etc.  In order for a user to use it properly, a genex admin
(a member of the genex_admin group) must create a VIEW onto the table that
maps the generic columns to specific names like 'ch1_intensity => float1,
ch2_intensity => float2', etc.. This is how the RAD DB at UPenn works. 
They use it for all their data we just use if for Derived BioAssay Data
(I'll have to give a short writeup of how to use the Mason app I created to
do the mapping) I think it will have to be a *much* simpler process for DB
admins to understand it.

This mapping can be made permanent by writing it to the DB - it's a view in
the DB, so it is permanent until you drop the DB.  It is also written into
the TableDef table as a scratch view so all the Mason apps know about it.
Further, since it involves the scratch table and views, it doesn't impinge or 
affect the actual DB schema.

The real trick to the scratch table - and maybe the thing that will make it
too much of a pain to use is that when you hook an analysis app up to the
DB, you have to define a *destination view*. This destination view is where
the analysis app will write it's output.  This destination view is a view
on the scratch table that an admin has already created; it is a formal DB
View, as in 'CREATE VIEW AS ....'.  All scratch table views are formal DB
views that you can see under psql using '\dv'

The person who hooks up the app has to choose a view whose columns match
the output columns of the app so if the app produces two columns of data,
he must choose a view that can hold those two columns.  The app does not
have to actually connect to the db and write the output INTO the scratch
table, it just has to create its output in a format that is compatible with
the VIEW that is going to hold it.

Hooking up analysis tools to GeneX: 101
1) we want to make it really easy for people to hook up external tools 
   for analysis.
2) therefore we can't expect them to learn and understand the Genex Perl
   API before they can use the tool with Genex - the API is great but 
   it's complicated
3) we want to encourage users to keep all data stored in the DB - for    
   archival purposes so data is traced and not lost and so it is 
   verifiable (we're in science after all)
4) therefore we want to discourage Schlauchism - exporting data as tab 
   files and littering tens of zip disks with data 
5) to this end we need to create a smoke-and-mirrors illusion that apps
   which know nothing about the DB can actually pull data from the DB, 
   do their analysis and write it back to the DB.

When I was working with the Avestha people I tested out some ideas, and it
turned out to be really trivial to accomplish this, at least in the limited
efforts I had time to make.

There are two classes of apps:
1) DB aware apps - they can already pull data from the DB directly and
   therefore don't need any help at all. These are easy - we just patch 
   them to talk to Genex and viola!

2) DB stupid apps - they need disk files as input sources, not DB
   connectivity. To make this class of apps work, we need to subclass:
     subclass 1) enables both --input  and --output 
     subclass 2) doesn't allow determining both input and output file name,
         e.g. maybe the output is written to STDOUT or something.  This
         isn't really so hard, we just have to run it within a wrapper 
         that understands '--input' and '--output', and runs the
         application, moving files around where needed

So subclass1 and subclass2 are really the same except that subclass2 needs
to be run inside a (probably thin wrapping) that remaps the output/input -
and that wrapper needs to be written separately for each app.
To make this work we use our brand-new Protocol method in the DB.  This was
the bit that Michael Pear wanted to do for the grant and the piece I stole
the tables from ESTAP to implement.  I've added a bunch of Mason code to
make it all work:
You define a PROTOCOL that says what table (or query) to take the data
from, which app to run on the input, and what scratch view to put the data
into once its finished. [a concrete example would be ver useful here].

When the user wants to run an analysis he does the following:
1) choose an experiment to analyze.
2) choose the BioAssay's to be analyzed from that experiment.
3) Choose the analysis protocol to execute on the BioAssays.
4) go and have coffee.

 The DB code reads the protocol info from the DB and does the following:
1) exports the BioAssay data to a tab-delimited file using an I/O 
   filter (see below).
2) starts the app using --input to tell it the input file name and 
   --output telling it where to write the output.
3) when the app is finished, the wrapping script initiates transfers 
   the data from the output file into the chosen destination view.
4) alerts the user that her data is waiting
5) smokes a cigarette

This is reasonably straightforward, not too different from the approach we
used in GeneX1 with rcluster, cybert, etc)

The I/O filter has to formalize what we were in thinking about for GeneX1 -
a general way to provide standardized inputs and outputs for any wild
weasel of an app that wants to chew on GeneX data.  It should enable anyone
to write an I/O filter to massage the data on output from the DB, or input
to the DB. So if a particular app wants it's NULL values in some heinous
fashion it can be done.

The filter idea is incorporated into the the PROTOCOL approach -
*PROTOCOLS* are the core of the genex analysis approach.  The trick to
using the protocols is that each one defines the input and output tables
(views)  as part of the protocol.  You want to store your input data into a
scratch view that is useful for your protocols and you want to have scratch
views that can handle the output of the protocols.  The source view and
destination view can be the same or different the only real issue is if
they have the correct number of columns

We don't necessarily want to design the Ginsu knife of filters, but provide
a way that users can simply add their own wrapping filter code so that they
an run the apps they want to.  So can you see why scratch views are so
important - they have to match the input for analysis apps and the output

The trick is to change the user's mind as to what they expect the data to 
look like.  For example, People familiar with spreadsheet files are always
thinking in terms of data matrices which is fine, but that's not how the
data is stored in the DB.  

In a data matrix, your columns are different BioAssay's (one column of
intensity data per BioAssay) and the rows are the gene names with
expression (intensity) for each BioAssay but in the DB, the values for the
separate BioAssays are stored in different rows so when a user wants to do
an analysis they can't think "I want you to run a new analysis on my
favorite data matrix".  Instead they have to think "I want you to run an
analysis on this list of BioAssay's" (unless you have your favorite
'conversion to a data matrix' stored as an input filter - essentially a
reasonably complex query).

So the idea would be that a user could define a set of bioassays upon which
to do other operations; that wouldn't be a filter though,  it would just be
a user preference that we could track, and we could provide some Mason app
that allows users to define BioAssay collections for later processing

In the way I've described it, the filter is just a mapping tool that takes
a column of data about to be written to a text file, and applies a Perl
regex to that column of numbers and based on that regex match, does
something to the data.  In contrast, the BioAssay set is a collection of
favorite data that the user wants to analyze a number of different times.

-- 
Cheers, Harry
Harry J Mangalam - 949 856 2847 (v&f) - hj...@ta... 
            <<plain text preferred>>