[Gmod-tripal] storing large scale phenotype and genotype data

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello,

Since there is a discussion whether or not to use nd_module for storing
phenotype and genotype data in Chado, we wanted to share what we do at GDR,
CottonGEN and our other databases.
Using the nd_experiment table to link stock and genotype (and project, etc)
gave us (and others) a severe performance issue. Lacey came up with the
idea of using genotype_call table to link genotype and stock (and project
etc).
https://github.com/UofS-Pulse-Binfo/nd_genotypes/wiki/How-to-Store-your-Data

We adopted this and it has been working very well. We decided to make a
similar table to link stock and phenotype tables as well for large-scale
phenotype data. We have BIMS Tripal module, which breeders use to
store/manage their private data and we anticipate the users (so their data)
will grow extensively. Some of them will have phenomics data and the volume
of phenotype data will become extensive.

The phenotype_call table we create is as follows.
phenotype_call_id INT PK
phenotype_id INT FK
project_id INT FK
stock_id INT FK
nd_geolocation_id INT FK
time TIMESTAMP WITHOUT TIMEZONE

This table is basically to link the 'sample' stored in stock table (a
specific tree or plot of plants that are being phenotyped) and specific
phenotypic value stored in 'phenotype' table. Time field is needed here
since the exact phenotyping time can be different for each phenotype for
the same sample (this data comes from the FieldBook App that many breeders
use). nd_location_id is added here since we don't go through nd_experiment
and we need to store the location info of the plant (or animal/insect).

I don't think this means that we can get rid of nd module. Nd_experiment
table and other associated nd tables are being used for experiments using
stocks other than phenotyping and genotyping. Examples are cross, field
collection, etc. There are also databases that store various protocols and
reagents that are associated with experiments (hence nd_protocol,
nd_reagents, etc). For us we need nd_experiment table to store cross data
and we don't have data for nd_protocol and nd_reagents.

One of the important things that came out of ND discussions were using
stock table to store 'samples' and use stock_relationship table to store
stock and the sample. There were thoughts to create a separate table such
as 'observation_unit' etc. The same principle goes with the project table
and project_relationship table to store hierarchical datasets.  I think we
can think of genotype and phenotype as specific examples of 'experiment'
that generate HUGE quantity of data so we use specific linker tables
without going through the nd_experiment.

It would be nice to open up the conversation to come up with consensus but
temporarily we have to use these two tables to speed up loading of our data.

Hope it helps!

Thanks!

Sook and Taein