Harmogene Wiki

Status: Alpha

Brought to you by: agussman

Home

Labels:

Authors:

Current Status

Tickets

#1
Integrate Harmogene w/ Sybil
#2
Change cvterm for TAG match feature featureprops type

Also be aware of:

produlus 2404 sybil instance for entropa
tickets from the original Trac instance

Databases

Both databases are on hannibal.

harmogene_test - tested with update_assertions_in_organism.pl. Need to re-run metrics.
harmogene - gene_groups

Code and Files

The working directory is /local/projects/entropa/harmogene and contains all the scripts in svn.

Most object code lives in the app/Coati/Coati.pm and app/Coati/Coati/ChadoCoatiDB.pm although the scripts actually query the harmogene coati object in /local/projects/entropa/harmogene/Harmogene/.

Procedure

Instructions on how to use Harmogene. Working directory is /local/projects/entropa/harmogene. Many of the commands documented use manually-set environmental variables for common database parameters: --user $HUSER --password $HPASS --database $HDB. This is for security and to make the documentation cut&paste.

Sequence data for multiple species, each annotated by multiple sources, are converted into BSML and loaded into a Chado database. Data from one of the input sources is run through the JOCs pipeline to identify orthologous genes in different organisms. The same genes from identical organisms, but annotated by different input sources, are identified by shared 3' coordinates. These 3' gene sets are joined with the JOC ortholog groups to create cross-taxon, cross-source transitive annotation groups (TAGs). A TAG is a group of genes that could all receive the same annotation(s). Assertions for all TAG member genes are retrieved from the database and rules are applied to create an aggregate annotation set, a set of assertions to apply to each TAG. The aggregate annotation set is used to update annotations for an individual species in the database. Annotation evaluation metrics can be run before and after updating the database to assess the level of improvement. Metrics may utilize bins from an HMM search, as in the original VAMs procedure, or they may use the 3' gene groups as bins. The latter does not require additional computation and is useful for iterative refinement of set of rules, but outside of that is meaningless.

Convert Sequence Data to BSML

Current options for raw data inputs are Genbank/gbk and gff format. Todd's gff2bsml component will handle the later, and he keeps example output in /local/projects/entropa/gff2bsml_test/. An example genbank2bsml run can be found in Pipeline 3170.

It's important that the different input data stores share the same taxon ids to identify organisms; if they don't none of the downstream processing will function correctly. You can perform a quick and dirty grep of taxon ids from the output BSML with:

for i in  cat /usr/local/projects/entropa/output_repository/genbank2bsml/3170_rerun_angiuoli_1901/genbank2bsml.bsml.list; do zgrep -o -P '"taxon" identifier="\d*"' $i.gz; done | sort | uniq > 3170.taxon_ids

(note appending of .gz, you might not actually need to do that) Look at a diff of the two and make sure there's pretty good overlap. In this example there are 10 more taxon ids in 3170.taxon_ids (42) than gff2bsml.taxon_ids (32).

Add `genome_instance` identifiers

In this test example, we're going to create a new "target" set of sequences to receive the aggregate annotations. We'll differentiate that from original by using insert_genome_instance.pl. This will insert a provided genome instance value into the BSML. When the BSML is loaded, the attribute will be appended to the organism.species and organism.common_name fields. (note: current encoding of genome_instance should be changed in the future, see old ticket 5). The script also attaches a prefix to Feature@id and Feature-group-member@featref, without it identically-named features in "different" organisms weren't loaded. It doesn't make changes to the Sequence-related parts because those are expected to correspond with the header of the fasta file.

See pipeline 3200 for the insert_genome_instance run of refseq, IGS, and gff2bsml.

Run the JOCs Pipeline

TAGs will be created from the union of two sets of genes. One set of genes, across SOPs, is formed from genes that share the same 3' coordinate in sequences from the same taxon_id (note: we currently don't do any comparison of the underlying sequence, also there's the potential for incorrect matches discussed in old ticket 6). These are created by generating gene groups.

The other set of genes, spanning different taxon ids from the same SOP, are generated by the JOC pipeline. An example of running the JOCs_comparative_pipeline template on the IGS BSML can be found in pipeline 3201 with the following changes:

bsml2fasta.fastamulti:
  $;USE_FEATURE_IDS_IN_FASTA$; = 1
  $;INPUT_FILE_LIST$; = /usr/local/projects/entropa/output_repository/tabula_rasa/3200_insert_genome_instance.IGS/tabula_rasa.insert_genome_instance.IGS.bsml.list

xdformat:
  $;OTHER_OPTS$; = -I -k

jaccard:
  $;USE_FEATURE_IDS_IN_FASTA$; = 1
  $;INPUT_FILE_LIST$; = /usr/local/projects/entropa/output_repository/tabula_rasa/3200_insert_genome_instance.IGS/tabula_rasa.insert_genome_instance.IGS.bsml.list

j_ortholog_clusters:
  $;USE_FEATURE_IDS_IN_FASTA$; = 1
  $;QUERY_BSML_FILE_LIST$; = /usr/local/projects/entropa/output_repository/tabula_rasa/3200_insert_genome_instance.IGS/tabula_rasa.insert_genome_instance.IGS.bsml.list

Because the gene sets are defined by the gene's uniquename, this step must be redone anytime genbank2bsml is rerun as it creates new globally unique names for each feature, even if it was run on the same set of genbank files.

Load Sequence and JOCs Data

Create a pipeline with initdb followed by bsml2chado steps for the data you're going to insert (in this test, it's the refseq, IGS, gff2bsml sequence data and the JOC analysis features). See pipeline 3225 and the JOCs data was loaded separately in pipeline 3234.

QC and Counts

Check counts and confirm data was loaded. Moved to old ticket 4 because it takes up too much room.

Generate 3' gene groups

Generate lists of features sharing the same 3' coordinates. Output is tab-delimited text. First column is <taxon_id>|<coordinate>. Coordinates ending in "n" are on the reverse strand. Remaining columns are uniquenames of features in group/bin.

perl -I Harmogene/ -I Harmogene/shared/ gene_groups_by_taxon.pl \
  --user $HUSER --password $HPASS --database $HDB \
  --log clean_test/gene_groups_by_taxon.log \
  --min_size 3 IGS refseq gff2bsml > clean_test/gene_groups_by_taxon.out

Create TAGs

Generate merged gene groups between the JOCs feature groups and the gene groups sharing 3' coordinates. For each member of a JOC analysis feature group (as identified by the provided --analysis_id), identify all other genes sharing its 3' coordinate from the gene_groups_by_taxon.pl output.

Output is a 3-column tab-delimited text file. The first column is a bin/group identifier (the match_feature_id). The second column is the uniquename of the feature in the bin. For genes that were part of a bin/group in gene_groups_by_taxon.pl, the third column is their bin/group identifier from that file. For genes that were included because they are match feature members with such a gene in a JOC, it is the uniquename of the feature that is in both the JOC and gene_groups_by_taxon.pl.

perl -I Harmogene/ -I Harmogene/shared/ analysis2bin.pl \
  --user $HUSER --password $HPASS --database harmogene \
  --log clean_test/analysis2bin.log \
  --gene_groups clean_test/gene_groups_by_taxon.out \
  --analysis_id 2 > clean_test/analysis2bin.out

Store TAGs in Chado

In order to update the assertions in the database, the TAGs must be stored in Chado as match feature groups. This is done by transforming the tab delimited output of analysis2bin.pl into BSML that is then loaded with bsml2chado. The bin identifier column (mfi:3223858) is stored as a featureprop of type tag (this is a hack, see #2). The newly created match features are located on the member features via the featureloc table.

perl -I Harmogene/ -I Harmogene/shared/ -I /local/projects/ergatis/package-devel/lib/perl5 tag2bsml.pl \
  --user $HUSER --password $HPASS --database $HDB \
  --log clean_test/tag2bsml.log \
  --input_tags clean_test/analysis2bin.out \
  --output clean_test/tag2bsml.bsml > clean_test/tag2bsml.out

The bsml was loaded in pipeline 3244.

Retrieve Assertions

Dump all of the annotations (GO, EC, gene_symbol, gene_product_name) in the database for the genes in TAGs. This will produce the raw set of assertions to which rules will be applied to create the aggregate annotation set.

Output is tab-delimited with columns bin, organism name, feature uniquename, assertion type, assertion value, evidence code (currently null).

perl -I Harmogene/ -I Harmogene/shared/ chado2assertions.pl \
  --user $HUSER --password $HPASS --database harmogene \
  --log clean_test/chado2assertions.log \
  --input_bins clean_test/analysis2bin.out > clean_test/chado2assertions.out

Non-HMM Metrics

Note: This is kind of useless. It's really only valuable for checking the effectiveness of a rules file. We can perform an initial run of the VAMs metrics on the data at this point. Instead of grouping genes by matching of an equivalog level HMM (as is done in the real VAMs analysis), we will bin them on their gene_group_by_taxon_id. While there is more input in analysis2bin.out, we use gene_groups_by_taxon.out because it contains the taxon ids. First we merge the gene groups and the annotation information into a single tab-delimited file used for the metrics:

./assertions2metrics.pl --log clean_test/pre_rules/assertions2metrics.log \
  --gene_groups clean_test/gene_groups_by_taxon.out \
  --assertions clean_test/chado2assertions.out > clean_test/pre_rules/assertion.annotab

then we calculate the metrics (script is from BRC):

metrics/run_aem_metrics.pl clean_test/pre_rules/assertion.annotab clean_test/pre_rules/

Apply rules to assertions

A rules file is an ordered-list of commands that select the annotations to include in an aggregate annotation set from the set of all assertions. The tab-delimited rules file has three columns: datatype, rule action, and source. Valid datatypes are anything found in the annotation dump. Right now rules are created for gene_product_name, EC, GO, and gene_symbol. Currently, the only supported rule action is USE. A USE action specifies an ordered list of sources from which to select the annotation of the given datatype. Finally, the source columns corresponds to the organism.uniquename from which the annotation is to be taken.

To create a dummy list of rules files to be used on the IGS data:

cut -f 2 clean_test/chado2assertions.out | grep -v IGS | sort | uniq > clean_test/source.list

for i in gene gene_product_name GO EC; do perl -p -e "print \"$i\tUSE\t\";" clean_test/source.list; done > clean_test/dummy.rules

Documentation on creating an intelligent, ordered list of rules can be found in old ticket 9.

Then apply the rules to the (sorted) input assertions to create the set of assertions to be applied to the target data set:

sort clean_test/chado2assertions.out > clean_test/chado2assertions.out.sort

./apply_rules_to_assertions.pl --log clean_test/apply_rules_to_assertions.log \
  --rules_file clean_test/ordered.rules \
  --input_assertions clean_test/chado2assertions.out.sort > clean_test/apply_rules_to_assertions.out

Check how the application of the rules changed the metrics (note: this is kind of meaningless. It's only useful when comparing ordered rules to dummy rules, and not even then old ticket 9 comment 6):

./assertions2metrics.pl --log clean_test/post_rules/assertions2metrics.log \
--gene_groups clean_test/gene_groups_by_taxon.out \
--assertions clean_test/apply_rules_to_assertions.out --source TEST > clean_test/post_rules/assertion.annotab

metrics/run_aem_metrics.pl clean_test/post_rules/assertion.annotab clean_test/post_rules

Update the database

Apply the aggregated annotation set to an organism in the database.

You might want to backup the database first:

pg_dump -h hannibal -U $HUSER $HDB > clean_test/$HDB.dump.sql

Update organism_id = 43 according to the aggregate annotation set (warning: no undo from this!):

perl -I Harmogene/ -I Harmogene/shared/ ./update_assertions_in_organism.pl \
  --user $HUSER --password $HPASS --database $HDB \
  --log clean_test/update_assertions_in_organism.log \
  --assertions clean_test/apply_rules_to_assertions.out \
  --analysis_id 3 \
  --organism_id 90 > clean_test/update_assertions_in_organism.out

HMM Metrics (VAMs)

Steps in the VAMs analysis are also documented in old ticket 3 and VAMs Notes. Annotab is created by the chado2annotab.pl script and the output goes through some hack-ish postprocessing before metrics can be run.

In this example, metrics run on the annotations before updating with rules:

source  0.9753  0.3430  0.0000  1.0000  0.5680  0.8256  0.8887  0.1512

and after:

source  0.9670  0.6163  0.8918  0.7907  0.6000  0.8605  0.9186  0.4826

As a table: