From: Vision, T. J <tj...@bi...> - 2012-02-20 21:14:36
|
I think there's an opportunity here to find common cause with other active developers interested in interoperability of comparative observational data. Has there been any interaction with the BIOM group already? Todd ******** Date: Mon, 20 Feb 2012 12:00:51 -0700 From: Greg Caporaso <gre...@gm...> To: dev...@li... GSC developers, Attached please find a proposal for recognition of the Biological Observation Matrix (BIOM, pronounced 'biome') format as a GSC Core Project. A brief description of the motivation for this file format is below (this is also presented as the abstract in the attached PDF). You can find additional details on the BIOM format at: http://www.qiime.org/svn_documentation/documentation/biom_format.html I look forward to feedback on this proposal from the GSC community, and would be very interested in joining a call at some point soon to discuss the next steps in moving this toward GSC recognition. Thanks! Greg Caporaso, on behalf of the Biological Observation Matrix (BIOM) project team Project Abstract: A central data type in 'comparative -omics' analyses (e.g., metagenomes, comparative genomics, marker-gene-based community surveys, and metabolomics) is a sample by observation matrix. In marker gene surveys, this would contain counts of OTUs on a per-sample basis; in metagenome analyses, this might contain counts of orthologous groups of genes, taxa, or enzymatic activity on a per-metagenome basis; in comparative genomics, this would contain counts of genes or orthologous groups on a per-genome basis. Many tools have been developed to analyze this data, but are generally focused on a specific type of study (e.g., QIIME for marker gene analysis; MG-RAST for metagenome analysis; VAMPS for taxonomic analysis). Many of the techniques, however, generalize across data types (e.g., rarefaction analysis/collector curves are generally applicable to all of these data types). A standard format for the sample by observation matrix will support interoperability of these tools, and facilitate development of future analysis tools. Additionally the incorporation of sample and observation metadata in this file allows for convenient sharing and archiving of these data within a single file. The BIOM file format has been developed with input from the QIIME, MG-RAST, and VAMPS development groups. BIOM format is based on JSON, a human?readable, open standard for data exchange. In addition to consolidating data and metadata in a single standard file format, BIOM supports sparse and dense matrix representations to efficiently store these data on disk. Sparse representations of QIIME OTU tables in BIOM format, for example, can be more than 3X smaller than the same data represented in tab-delimited text. To support the use of this file format a new open-source software package will be available at http://biom-format.sourceforge.net. This will include a format validator, and new Python objects to support working with this data. This software package will additionally serve as a repository where other developers can submit implementations of these objects in other languages. Full format and API documentation (for the Python objects) will be available to coincide with submission of an article describing the BIOM format (target submission date of late Feb 2012). Draft documentation is currently available at http://biom-format.org. |