From: Don G. <gil...@in...> - 2004-05-03 00:05:08
|
Toward a Unified Gene Page GMOD Gene Page Working Group 2 May 2004 ------------------------------------------------ This group will discuss and propose a common gene page that model organism/genome database members can agree to produce in some form. All interested parties can contribute. Mail list: gmo...@li... Web home: http://eugenes.org/all/gene-report-examples/ (add gmod gene-page project) Participants: (participants from 12+ MODs to be added after confirmation ..) INTRODUCTION ------------------------------------------------ There may be a long-standing desire among genome data consumers and producers to unify the documents describing organism genes (gene pages) that are provided by many model organism/genome databases (MODs). There remain questions of whether this desire exceeds costs of effort to make unified gene pages. Discussion of common gene report web pages and software should build on existing expertise of MODs. These projects have years of experience working with life scientists to produce gene pages that capture the essense of knowledge from the databases, and make it understandable and useful to scientists. For most MODs, the gene page is probably the most highly used reference document that people come to MOD web sites for. At FlyBase.net, these account for over a third of all calls, far surpassing any other single use category. PLAN ------------------------------------------------ One outcome of the April 04 GMOD meeting is organization of a working group to decide how to proceed to reach some consensus on this topic. Members are drawn from several existing and new MODs, and others interested in unified gene pages. These points need discussion: * What are common parts of MOD gene pages? * What could/should be unified? * Who will benefit? Costs? * Web Reports AND/OR XML ? Suggested starting points: - Focus on biology now, leave computing to later. The major need is to distill biological knowledge about genes to say what MODs should be representing in a common way. - Look over example MOD gene pages. The approach suggested of removing HTML to look at content, labels and organization struck a chord in meeting discussion. XML-izing gene pages is not necessary, but is one useful way to distill common information and structure. - Create a few sample unified pages, and show them around for comment. We should ask for input from gene page consumers: scientists who study a few or many genes across organisms; data miners who use gene pages in bulk (academic, govt., industry, other databases). - Speak up if you want to actively help. I've been working on this subject since 1999 -- some one else may have better luck in a new approach at consensus. If you are, or know of, someone with a strong biology background who uses different MOD gene pages, interested in organizing this group, please say so. Even if we decide this topic should be shelved, at least we should gather evidence on why the cost to unify gene pages is not worth the effort. As a target goal, a proposal for unified gene pages could be ready for GMOD meeting in Fall 2004. There is a GMOD mailing list for discussion, and any documents, sample pages, and software can be deposited at gmod.sourceforge.net. The euGenes.org gene page service is also available for such, and test cases. WORKING DOCUMENTS ------------------------------------------------ We likely should create a 'gene-page' CVS project at gmod.sourceforge.net for documents and samples. There are example gene pages (see below, above), some sample extracted content (HTML -> XML). DGG has a preliminary 'Gene Page Scraper' that automates in Perl the by-hand methods I used for removing HTML styles, extracting common content of MOD gene pages. Useful to gene-data-miners even if not otherwise, though it likely has a short lifespan as web page design changes will break it. BACKGROUND (MEOW/euGenes effort) ------------------------------------------------ In 1999, attendees of the Model Eukaryote Organism Workshop proposed to develop a common set of summary gene information. A test website grew out of this, produced primarily by Bill Gelbart and Don Gilbert. This was produced by extracting common gene information from existing MODs public data. Don has continued the MEOW effort at common summary gene information as euGenes.org, though it never achieved the desired goal of having each MOD contribute common summary data. The euGenes effort, without MOD contributions or external funding, has only middling success at trying to maintain current gene summary pages in the face of effort needed to match changing genome data. The Generic Model Organism Database project (GMOD) arose from related thoughts (and grant agency spurs for cost-effectiveness) that there should be common effort at building databases, software tools, and common practice methods for developing new organism databases, and updating existing ones. This has been an NIH funded project with many MOD participants, and as of 2004 is beginning to bear fruit in terms of commonly usable MOD components. With many more organism genome databases coming into being in decade 2000, the usefulness of having common gene information and documents is growing, both among database providers and the many customers of these (from individuals to academic, government and industrial R&D labs, and other bioinformatics database developers). A hopeful new goal of GMOD is to tackle the 'Unified Gene Page' question again, with fresh perspective and see if a new consensus on utility Along with MODs, there are numerous bioinformatics web/database services that have gene-related reports, and can offer useful insights to this topic. Many of these are found as External_Links on MOD gene pages. Often these draw the gene summary data from MODs, in ways similar to the data mining MEOW/euGenes was constructed with, which is another cost of having non-unified gene information. FIRST SUMMARY GENE PAGE PROPOSAL (extract of Model Eukaryote Organism Workshop, W. Gelbart, Feb. 1999 http://eugenes.org/docs/meow-startup.txt ) ------------------------------------------------ ... to establish a common interface for the major model eukaryotic organism databases. This interface would be offered as the default homepage for nonspecialists wishing to access any of the participating databases. ... This home page would provide gene/gene product query access to based on gene symbol, gene product characteristics, chromosome position and perhaps homology (if we can agree on criteria to be used). Our view is that this will satisfy the needs of many nonspecialist users without forcing them to learn the intricasies and idiosyncracies of each of our sites. ... We had discussed a design principle that each gene "page" would contain no more than one screenful of information. Further, the gene pages should as much as possible focus on coin-of-the-realm molecular biological terminology rather than species-specific jargon. I suggest we try to flesh out what a typical page would contain. Here are some suggestions for relevant fields on the "gene report page", in order to get a discussion started: *Valid terms: -Gene symbol -Gene full name -Gene identifier number *Synonyms: -Symbol synonyms -Full name synonyms -Secondary gene identifier numbers *Map location information: -Chromosome and genetic map position -Molecular map information (simple graphic of DNA length, encoded transcript(s) and CDS(s) if available) *Gene product information: -Functional information (from function ontology if available) -Structural information (from InterProt) -Homology information with other "MEOW" organisms (we would need to agree on some computable criteria for this field, which is of course not trivial) *Links to extended gene information in the specialty model organism DBs. *Brief free text gene summary (a few lines at most). APRIL 2004 OUTLINE ------------------------------------------------ Example work is at http://eugenes.org/all/gene-report-examples/ Common gene attributes (drawn from existing MOD pages) * Names, symbols/IDs, synonyms * Map locations * Sequences * Reagents * Gene ontology * Similar Genes * Database cross-refs, External links * Alleles, Transcripts * Proteins, Structure and Domains * Expression and Mutant Phenotypes * Gene Interactions * Literature references * Summary Text Common to gene pages? * Labels - are these same things? -- Gene / locus / orf -- Homolog / ortholog / relationship / similarity -- Citation / publication / reference * Organization of document -- Section headers -- Important at top, common ordering? * Structure and size of default document -- Tabular, text, document-like, ... -- One screen or long report * Graphics (maps, icons, ...) * Further Detail options * Layout and Design (colors, formatting, fonts ..) What is customizable? * MOD customizations -- Look and Feel -- Details & Extensions * Customer choices -- Best for organism community (org. standard) -- Best for general reader (general standard) -- Best for beginners or experts (simple,complex) Example Gene Pages Common Gene XML? * Computable text of gene page ? -- "what you see (web page) is also what your computer can read" -- simple and human-readable, or complex and detailed * XML variants, tabular, other? -- Ace2XML, NCBI XML, others -- Samples (Web -> XML) COMPARE EXISTING MOD GENE PAGES ----------------------------------------------------------- For a start at gathering this common experience into a set of common values and practice in gene page presentation, I picked a highly homologous eukaryote gene Calmodulin, and pulled out gene reports on these from these organism databases: yeast - SGD arabidopsis - TAIR zebrafish - ZFIN worm - WormBase rice - Gramene human - LocusLink mouse - MGI rat - RGD mosquito - Ensembl fly - FlyBase, euGenes, LocusLink, Ensembl (see same gene from different viewpoints) There are several Calmodulin(-like) genes/org, I just used one from each source. Find these at http://eugenes.org/all/gene-report-examples/ or ftp://eugenes.org/eugenes/gene-report-examples/ - Don Gilbert 24 Apr 2004 - These have been updated to current versions. As well, four gene pages are translated from HTML to XML, for cut/paste/edit to design common gene pages. The XML should (theoretically, with right software/templates) allow regeneration of the original web pages. 5 Sep 2003 - Example gene web page reports from several model organism databases. URLs for the Cam gene pages (Sept 2003; updated Apr 2004 - alternate reports removed to focus on common summary pages) --------------------------------------------------------- http://flybase.net/cgi-bin/fbidq.html?FBgn0000253 http://db.yeastgenome.org/cgi-bin/SGD/locus.pl?locus=S0000313 http://www.informatics.jax.org/searches/ accession_report.cgi?id=MGI:88251 http://www.arabidopsis.org/servlets/TairObject?type=locus&id=29764 http://www.wormbase.org/db/gene/gene?name=cmd-1;class=Locus http://rgd.mcw.edu/query/query.cgi?id=2257 http://www.gramene.org/perl/protein_search?acc=P29612 ## newly inferred zebrafish calm1a gene http://zfin.org/cgi-bin/ZFIN_jump?record=ZDB-GENE-030131-8308 -------- summary services ----- http://www.ensembl.org/Anopheles_gambiae/ geneview?gene=ENSANGG00000010211 http://www.ensembl.org/Drosophila_melanogaster/geneview?gene=CG8472 http://www.ncbi.nlm.nih.gov/LocusLink/LocRpt.cgi?l=801 http://www.ncbi.nlm.nih.gov/LocusLink/LocRpt.cgi?l=24242 http://www.ncbi.nlm.nih.gov/LocusLink/LocRpt.cgi?l=36329 http://www.ncbi.nlm.nih.gov/LocusLink/LocRpt.cgi?l=12313 http://bioinfo.weizmann.ac.il/cards-bin/carddisp?CALM1 http://eugenes.org/cgi-bin/moidq.html?FBgn0000253 http://eugenes.org/cgi-bin/moidq.html?AGgn0010211 http://eugenes.org/cgi-bin/moidq.html?HUgn0000801 http://eugenes.org/cgi-bin/moidq.html?MGgn0000995 http://eugenes.org/cgi-bin/moidq.html?CEgn0016585 http://eugenes.org/cgi-bin/moidq.html?ATgn0005396 http://eugenes.org/cgi-bin/fbidq.html?SGgn0000313 http://eugenes.org/cgi-bin/moidq.html?ZFgn0000878 -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405 -- gil...@in... -- http://marmot.bio.indiana.edu/ |