Re: [Gmod-schema] Re: Chado Questions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Ben

Your description makes sense. I think we can have a fairly well-defined 
schema mapping. I'm thinking something like this:

map = feature [where feature_id in select srcfeature_id from featureloc]
mapfeature = feature JOIN featureloc
mapset = map JOIN cvterm [using type_id] LEFT JOIN dbxref JOIN db

under these definitions, you can have maps within maps (a feature can 
have identity as a mapfeature and a map)

I'd say don't worry too much about the chado map module just now AFAIK 
noone has any data in it; it may be in need of some tweaking, and that 
may end up being influenced by cmap

On Nov 23, 2005, at 12:21 PM, Ben Faga wrote:

> Wow, I'm impressed by how quickly everyone responded.  Thank you.
>
> Backing up and explaining what I'm doing, sounds like a good 
> suggestion.
>
> The idea is to write a script that pulls data from Chado and puts it
> into the CMap schema (for use with CMap).  Then, install postgres
> triggers to keep the CMap schema updated.  So, basically I'm looking 
> for
> all the places in Chado where data that can be displayed in CMap is
> stored.
>
> I know that this is going to get complicated since there isn't a direct
> table to table correspondence between the two schemas.
>
> Here's a little background:
>
> Map:
> In CMap, a map is anything that can be represented by a line.  A map 
> can
> be a sequence contig, an FPC, a genetic map.  For that matter, it can 
> be
> a gene, or a transcript or a protein (but no one currently uses it that
> way).
>
> So in that respect, a I'll eventually have to dip my fingers into all 
> of
> the corners of Chado to grab data.  Right now though, I'm focusing on
> the sequence module since that is the most mature (I think) and I know
> how to use the loaders for it.
>
> Map Set:
> In CMap, a map set is made up of "maps" with a common type (sequence,
> fpc, etc) and some reason to group them such as being from the same
> assembly or "experiment" (as Andrew put it).
>
> It might very well be likely that this will be defined in different 
> ways
> for different things.  For instance, Allen's mention of the
> dbxref.version would work for many things but it might not work for 
> data
> in the maps module, so I would have to create a different way to create
> map sets from that module.
>
> Put another way, I could just put "maps" the things from the same 
> module
> and with the same type in their own set.  So, all rows in the feature
> table with a cvterm of "contig" would get grouped in a set, unless they
> have a dbxref with the appropriate data or unless they meet some other
> criteria.
>
> Features:
> CMap features are a completely different concept than that of Chado.  A
> cmap feature is basically a range on a "map" with some info.  So, in 
> the
> context of the sequence module, anything with a featureloc can be a 
> cmap
> feature.
>
> This actually brings some interesting possible uses for cmap.  If you
> had protein info, such as domains, you could make the proteins into
> "maps" and then set the domains as features and then display
> corresponding proteins based on domain.  I know there are better
> programs for this but this is just a thought.
>
> Correspondences:
> I believe Andrew has a correct view of correspondences in CMap.
> Basically, it's a "this feature corresponds with this feature" format.
>
> That said, I'm not looking for a table to hold correspondences for me.
> I'm looking for any place in chado where correspondence like 
> information
> is stored.
>
> Well that's that on background.  That was long.  I'm surprised you are
> still reading this.
>
> Thanks,
>
> Ben
>
>
> On Wed, 2005-11-23 at 14:15, chris mungall wrote:
>> On Nov 23, 2005, at 8:59 AM, Andrew D. Farmer wrote:
>>
>>> Hello-
>>> I'm not an expert in Chado, but I've dabbled a little with it and 
>>> more
>>> with
>>> CMap, so I'm going to throw my 2c in for what they are worth.
>>>
>>> 1) On the "correspondence" question, I think that the primary
>>> difference between
>>>     the two systems is that Chado sees the "feature" as a primary
>>> entity that may
>>>     have many different locations in different coordinate systems
>>> (maps);
>>>     that is, the featureloc contains the relationship of one 
>>> conceptual
>>>     entity to one or more different coordinate systems (srcfeature).
>>> So,
>>>     for example, a pairwise alignment is represented as a single
>>> feature
>>>     with one location on the query sequence and one location on the
>>> background
>>>     sequence.
>>>     CMap, on the other hand, sees a feature as "belonging" to one and
>>> only
>>>     one map (it has no true identity independent of the map); 
>>> features
>>>     on different maps are related by correspondences, as opposed to a
>>>     normalization of the feature into a single entity with multiple
>>>     locations. To represent a pairwise alignment in CMap you'd
>>> probably use one
>>>     feature on the query, one feature on the background with a
>>> correspondence
>>>     to link them.
>>
>> I think they are actually in quite close correspondence here. All 
>> chado
>> features have zero or one primary featurelocs (with locgroup=0;
>> multiple featurelocs differing only in rank are for alignments).
>>
>> You *could* add secondary featurelocs to other assemblies but this
>> isn't recommended.
>>
>>>     My sense is that the Chado approach is appropriate when the data
>>> manager
>>>     knows a priori about the identity of the features and can control
>>> their
>>>     normalization, whereas the CMap approach probably makes more 
>>> sense
>>> when the
>>>     assertion of correspondences is a post hoc conjecture based on
>>> something
>>>     like name-matching.
>>>
>>> 2) I'm not sure CMap's "Map set" concept has a clear analog in Chado;
>>> the
>>> "Map set" is essentially just a grouping of all the maps that have
>>> been produced
>>> from a single "experiment" (e.g. the linkage groups from a genetic 
>>> map
>>> or
>>> contigs from FPC physical map). So they don't ultimately resolve to a
>>> single
>>> coordinate system, they are distinct coordinate systems defined 
>>> within
>>> the
>>> context of a single application of a mapping protocol. If you wanted
>>> to list
>>> all chromosomes from the UCSC assembly in Chado, how would you do it?
>>> Maybe
>>> through all UCSC top level features (chromosomes) having a common
>>> relationship to a single "publication"?
>>
>> If people really want to store multiple versions of an assembly or
>> different assemblies of the same genome in one chado db, we need to
>> come up with some Best Practices for the various scenarios that will
>> arise. AFAIK nobody has needed this so far..
>>
>> The main difference between chado and CMap is that map doesn't really
>> correspond to anything outside the map module in Chado, so it follows
>> that there will be no correspondence with map set. Chado featurelocs
>> shouldn't be overloaded with non-sequence based localizations - Chado
>> is only generic up to a point!
>>
>> Can we back up here and provide some context - is the goal here to
>> interoperate between cmap and chado?
>>
>>> Does this help at all, muddy the waters further, or expose my total
>>> misunderstanding of things??
>>>
>>> Andrew Farmer
>>>
>>> On Wed, 23 Nov 2005, Scott Cain wrote:
>>>
>>>> Ben,
>>>>
>>>> I cc'ed this to the schema mailing list, both because I want my
>>>> response
>>>> archived and looked at by other people to make sure I am describing
>>>> the
>>>> use of feature_relationship and featureloc correctly.
>>>>
>>>> Scott
>>>>
>>>>
>>>> On Wed, 2005-11-23 at 10:41 -0500, Ben Faga wrote:
>>>>> On Wed, 2005-11-23 at 10:13, Scott Cain wrote:
>>>>>> Hi Ben,
>>>>>>
>>>>>> I hope I can help, because you're not likely to get much response
>>>>>> from
>>>>>> the schema mailing list until after the holiday.
>>>>>>
>>>>>> I'm not sure how to answer the map question.  The most obvious
>>>>>> thing is
>>>>>> via the feature_relationship table, but since the actual meaning 
>>>>>> of
>>>>>> the
>>>>>> word 'map' is not clear to me, I'm not sure f_r would work.  The
>>>>>> relationships in f_r are typically 'part_of' (for gene/mRNA/exon),
>>>>>> but
>>>>>> could easily be something else.  For instance, could a map set be 
>>>>>> a
>>>>>> feature like a chromosome?
>>>>>>
>>>>>> For chromosomes, denoting containment is a little different: you
>>>>>> don't
>>>>>> use f_r, but give the feature_id of the chromosome in the 
>>>>>> featureloc
>>>>>> table as featureloc.srcfeature_id.  There is no reason that a 
>>>>>> given
>>>>>> feature can't have more than one featureloc entry to different
>>>>>> srcfeatures.  You just give a different featureloc.rank to the
>>>>>> different
>>>>>> locations (with a rank of 0 being the 'standard' location, ie, on
>>>>>> the
>>>>>> chromosome).
>>>>> You've confused me.  It seems a little backwards to me.  I would
>>>>> think
>>>>> that you would use the featureloc table for things like
>>>>> gene/mRNA/exon
>>>>> to place them on the sequence and the feature relationship table 
>>>>> for
>>>>> correspondences.
>>>>
>>>> The featureloc table does describe how features are mapped to
>>>> chromosome, but merely having overlapping coordinates is not
>>>> sufficient
>>>> for one feature to be related to another.  For example, an exon 
>>>> could
>>>> lie within the boundaries of a gene and not belong to that gene
>>>> (because
>>>> it is part of another gene).  The f_r table is used for defining 
>>>> what
>>>> is
>>>> part of what.
>>>>
>>>> 'Correspondences' is a CMap concept that ties common features on
>>>> separate maps together, right?  I don't think you would use
>>>> feature_relationship for that at all--I think it would be 
>>>> encapsulated
>>>> in featureloc (see below).
>>>>>
>>>>> Maybe because I confused you first.  I'll start over.
>>>>>
>>>>> A chromosome is a map (or an assembly is a map), basically the base
>>>>> sequence is a map (in this case).  There must be a way in Chado to
>>>>> group
>>>>> assemblies (or chromosomes) by their origin.
>>>>>
>>>>> For instance, let's say I have the human genome sequence from NCBI
>>>>> but I
>>>>> also have an old version of the UCSC genome from years ago.  How 
>>>>> to I
>>>>> query chado to give me only UCSC assemblies?
>>>>
>>>> Assuming your database is populated correctly (for whatever 
>>>> definition
>>>> of 'correctly' applies), this would be distinguished in the 
>>>> featureloc
>>>> table.  One of those mappings would be the 'default' (lets say it is
>>>> the
>>>> NCBI mapping), and so the featureloc.rank would be 0 and the
>>>> srcfeature_id would be the feature_id for the NCBI chromosome.  
>>>> Then,
>>>> the UCSC assembly would be some other rank and the srcfeature_id 
>>>> would
>>>> be the feature_id of the UCSC assembly.  To get only UCSC 
>>>> assemblies,
>>>> you would make sure that the srcfeature_id is in the set of
>>>> features_id
>>>> that are UCSC assembly feature_ids.
>>>>
>>>> Is this any better?
>>>>
>>>>>
>>>>>> Finally, in featureloc, fmin is never greater than fmax.
>>>>>> featureloc.strand (-1,0,1) indicates direction.  Yes even with
>>>>>> polypepties (though you could easily leave strand as 0 for a
>>>>>> polypeptide
>>>>>> that isn't mapped to a chromosome).
>>>>> That answers my question, thanks.
>>>>>
>>>>>> I hope that helps a little bit.
>>>>>
>>>>> It does.  Thank you.
>>>>>
>>>>> mwz
>>>>>>
>>>>>>
>>>>>> On Wed, 2005-11-23 at 01:48 -0500, Ben Faga wrote:
>>>>>>> Hey Scott,
>>>>>>>
>>>>>>> I'm hoping that you can help me with a couple questions (or point
>>>>>>> me to
>>>>>>> the mailing list).
>>>>>>>
>>>>>>> In CMap we organize "maps" into "map sets", is there anything
>>>>>>> similar in
>>>>>>> Chado?  How do you distinguish something like chromosomes from
>>>>>>> different
>>>>>>> assemblies?
>>>>>>>
>>>>>>> Also, in featureloc, can fmin be greater than fmax?  Is "strand"
>>>>>>> how you
>>>>>>> store the direction (even with proteins)?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> mwz
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>> -- 
>>>
>>> Andrew Farmer
>>> ad...@nc...
>>> (505) 995-4464
>>> Database Administrator/Software Developer
>>> National Center for Genome Resources
>>>
>>> ---
>>> "To live in the presence of great truths and eternal laws,
>>> to be led by permanent ideals-
>>> that is what keeps a man patient when the world ignores him,
>>> and calm and unspoiled when the world praises him."
>>> -Balzac
>>> ---
>>>
>>>
>>>
>>>
>>> -------------------------------------------------------
>>> This SF.net email is sponsored by: Splunk Inc. Do you grep through 
>>> log
>>> files
>>> for problems?  Stop!  Download the new AJAX search engine that makes
>>> searching your log files as easy as surfing the  web.  DOWNLOAD 
>>> SPLUNK!
>>> http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
>>> _______________________________________________
>>> Gmod-schema mailing list
>>> Gmo...@li...
>>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>>
>