Home modified by Aaron Gussman

Aaron Gussman — Tue, 20 May 2014 13:28:27 -0000
--- v1
+++ v2
@@ -1,8 +1,209 @@
-Welcome to your wiki!
-
-This is the default page, edit it as you see fit. To add a new page simply reference it within brackets, e.g.: [SamplePage].
-
-The wiki uses [Markdown](/p/harmogene/new-wiki/markdown_syntax/) syntax.
-
-[[members limit=20]]
-[[download_button]]
+# Current Status
+
+## Tickets
+
+[#1](/apps/trac/harmogene/ticket/1)
+    Integrate Harmogene w/ Sybil
+[#2](/apps/trac/harmogene/ticket/2)
+    Change cvterm for TAG match feature featureprops type
+
+Also be aware of: 
+
+  * [produlus 2404](http://produlus.igs.umaryland.edu/cgi/view_task.cgi?id=2404) sybil instance for entropa 
+  * [tickets from the original Trac instance](https://sourceforge.net/userapps/trac/agussman/wiki/Harmogene%20Notes)
+
+## Databases
+
+Both databases are on `hannibal`. 
+
+  * `harmogene_test` \- tested with update_assertions_in_organism.pl. Need to re-run metrics. 
+  * `harmogene` \- `gene_groups`
+
+## Code and Files
+
+The working directory is `/local/projects/entropa/harmogene` and contains all the scripts in svn. 
+
+Most object code lives in the `app/Coati/Coati.pm` and `app/Coati/Coati/ChadoCoatiDB.pm` although the scripts actually query the harmogene coati object in `/local/projects/entropa/harmogene/Harmogene/`. 
+
+# Procedure
+
+Instructions on how to use Harmogene. Working directory is `/local/projects/entropa/harmogene`. Many of the commands documented use manually-set environmental variables for common database parameters: `--user $HUSER --password $HPASS --database $HDB `. This is for security and to make the documentation cut&paste. 
+
+Sequence data for multiple species, each annotated by multiple sources, are converted into BSML and loaded into a Chado database. Data from one of the input sources is run through the JOCs pipeline to identify orthologous genes in different organisms. The same genes from identical organisms, but annotated by different input sources, are identified by shared 3' coordinates. These 3' gene sets are joined with the JOC ortholog groups to create cross-taxon, cross-source **transitive annotation groups (TAGs)**. A TAG is a group of genes that could all receive the same annotation(s). Assertions for all TAG member genes are retrieved from the database and rules are applied to create an ```aggregate annotation set```, a set of assertions to apply to each TAG. The aggregate annotation set is used to update annotations for an individual species in the database. Annotation evaluation metrics can be run before and after updating the database to assess the level of improvement. Metrics may utilize bins from an HMM search, as in the original VAMs procedure, or they may use the 3' gene groups as bins. The latter does not require additional computation and is useful for iterative refinement of set of rules, but outside of that is meaningless. 
+
+## Convert Sequence Data to BSML
+
+Current options for raw data inputs are Genbank/gbk and gff format. Todd's `gff2bsml` component will handle the later, and he keeps example output in `/local/projects/entropa/gff2bsml_test/`. An example `genbank2bsml` run can be found in [Pipeline 3170](http://ergatis.igs.umaryland.edu/cgi/view_pipeline.cgi?instance=/usr/local/projects/entropa/workflow/runtime/pipeline/3170/pipeline.xml). 
+
+It's important that the different input data stores share the same taxon ids to identify organisms; if they don't none of the downstream processing will function correctly. You can perform a quick and dirty grep of taxon ids from the output BSML with: 
+    
+    for i in  cat /usr/local/projects/entropa/output_repository/genbank2bsml/3170_rerun_angiuoli_1901/genbank2bsml.bsml.list; do zgrep -o -P '"taxon" identifier="\d*"' $i.gz; done | sort | uniq > 3170.taxon_ids
+    
+
+(note appending of .gz, you might not actually need to do that) Look at a `diff` of the two and make sure there's pretty good overlap. In this example there are 10 more taxon ids in `3170.taxon_ids` (42) than `gff2bsml.taxon_ids` (32). 
+
+## Add `genome_instance` identifiers
+
+In this test example, we're going to create a new "target" set of sequences to receive the aggregate annotations. We'll differentiate that from original by using `insert_genome_instance.pl`. This will insert a provided genome instance value into the BSML. When the BSML is loaded, the attribute will be appended to the `organism.species` and `organism.common_name` fields. (note: current encoding of `genome_instance` should be changed in the future, see [old ticket 5](https://sourceforge.net/userapps/trac/agussman/ticket/5)). The script also attaches a prefix to `Feature@id` and `Feature-group-member@featref`, without it identically-named features in "different" organisms weren't loaded. It doesn't make changes to the `Sequence`-related parts because those are expected to correspond with the header of the fasta file. 
+
+See [pipeline 3200](http://ergatis.igs.umaryland.edu/cgi/view_pipeline.cgi?instance=/usr/local/projects/entropa/workflow/runtime/pipeline/3200/pipeline.xml) for the `insert_genome_instance` run of refseq, IGS, and gff2bsml. 
+
+## Run the JOCs Pipeline
+
+TAGs will be created from the union of two sets of genes. One set of genes, across SOPs, is formed from genes that share the same 3' coordinate in sequences from the same taxon_id (note: we currently don't do any comparison of the underlying sequence, also there's the potential for incorrect matches discussed in [old ticket 6](https://sourceforge.net/userapps/trac/agussman/ticket/6)). These are created by [generating gene groups](WikiStart#Generate3genegroups). 
+
+The other set of genes, spanning different taxon ids from the same SOP, are generated by the JOC pipeline. An example of running the `JOCs_comparative_pipeline` template on the `IGS` BSML can be found in [pipeline 3201](http://ergatis.igs.umaryland.edu/cgi/view_pipeline.cgi?instance=/usr/local/projects/entropa/workflow/runtime/pipeline/3201/pipeline.xml) with the following changes: 
+    
+    bsml2fasta.fastamulti:
+      $;USE_FEATURE_IDS_IN_FASTA$; = 1
+      $;INPUT_FILE_LIST$; = /usr/local/projects/entropa/output_repository/tabula_rasa/3200_insert_genome_instance.IGS/tabula_rasa.insert_genome_instance.IGS.bsml.list
+    
+    xdformat:
+      $;OTHER_OPTS$; = -I -k
+    
+    jaccard:
+      $;USE_FEATURE_IDS_IN_FASTA$; = 1
+      $;INPUT_FILE_LIST$; = /usr/local/projects/entropa/output_repository/tabula_rasa/3200_insert_genome_instance.IGS/tabula_rasa.insert_genome_instance.IGS.bsml.list
+    
+    j_ortholog_clusters:
+      $;USE_FEATURE_IDS_IN_FASTA$; = 1
+      $;QUERY_BSML_FILE_LIST$; = /usr/local/projects/entropa/output_repository/tabula_rasa/3200_insert_genome_instance.IGS/tabula_rasa.insert_genome_instance.IGS.bsml.list
+    
+
+Because the gene sets are defined by the gene's uniquename, this step must be redone anytime `genbank2bsml` is rerun as it creates new globally unique names for each feature, even if it was run on the same set of genbank files. 
+
+## Load Sequence and JOCs Data
+
+Create a pipeline with `initdb` followed by `bsml2chado` steps for the data you're going to insert (in this test, it's the `refseq`, `IGS`, `gff2bsml` sequence data and the JOC analysis features). See [pipeline 3225](http://ergatis.igs.umaryland.edu/cgi/view_pipeline.cgi?instance=/usr/local/projects/entropa/workflow/runtime/pipeline/3225/pipeline.xml) and the JOCs data was loaded separately in [pipeline 3234](http://ergatis.igs.umaryland.edu/cgi/view_pipeline.cgi?instance=/usr/local/projects/entropa/workflow/runtime/pipeline/3234/pipeline.xml). 
+
+## QC and Counts
+
+Check counts and confirm data was loaded. Moved to [old ticket 4](https://sourceforge.net/userapps/trac/agussman/ticket/4) because it takes up too much room. 
+
+## Generate 3' gene groups
+
+Generate lists of features sharing the same 3' coordinates. Output is tab-delimited text. First column is <taxon_id>|<coordinate>. Coordinates ending in "n" are on the reverse strand. Remaining columns are uniquenames of features in group/bin. 
+    
+    perl -I Harmogene/ -I Harmogene/shared/ gene_groups_by_taxon.pl \
+      --user $HUSER --password $HPASS --database $HDB \
+      --log clean_test/gene_groups_by_taxon.log \
+      --min_size 3 IGS refseq gff2bsml > clean_test/gene_groups_by_taxon.out
+    
+
+## Create TAGs
+
+Generate merged gene groups between the JOCs feature groups and the gene groups sharing 3' coordinates. For each member of a JOC analysis feature group (as identified by the provided `--analysis_id`), identify all other genes sharing its 3' coordinate from the `gene_groups_by_taxon.pl` output. 
+
+Output is a 3-column tab-delimited text file. The first column is a bin/group identifier (the `match_feature_id`). The second column is the `uniquename` of the feature in the bin. For genes that were part of a bin/group in `gene_groups_by_taxon.pl`, the third column is their bin/group identifier from that file. For genes that were included because they are match feature members with such a gene in a JOC, it is the `uniquename` of the feature that is in both the JOC and `gene_groups_by_taxon.pl`. 
+    
+    perl -I Harmogene/ -I Harmogene/shared/ analysis2bin.pl \
+      --user $HUSER --password $HPASS --database harmogene \
+      --log clean_test/analysis2bin.log \
+      --gene_groups clean_test/gene_groups_by_taxon.out \
+      --analysis_id 2 > clean_test/analysis2bin.out
+    
+
+## Store TAGs in Chado
+
+In order to update the assertions in the database, the TAGs must be stored in Chado as match feature groups. This is done by transforming the tab delimited output of `analysis2bin.pl` into BSML that is then loaded with `bsml2chado`. The bin identifier column (`mfi:3223858`) is stored as a `featureprop` of type `tag` (this is a hack, see [#2](/apps/trac/harmogene/ticket/2)). The newly created match features are located on the member features via the `featureloc` table. 
+    
+    perl -I Harmogene/ -I Harmogene/shared/ -I /local/projects/ergatis/package-devel/lib/perl5 tag2bsml.pl \
+      --user $HUSER --password $HPASS --database $HDB \
+      --log clean_test/tag2bsml.log \
+      --input_tags clean_test/analysis2bin.out \
+      --output clean_test/tag2bsml.bsml > clean_test/tag2bsml.out
+    
+
+The bsml was loaded in [pipeline 3244](http://ergatis.igs.umaryland.edu/cgi/view_pipeline.cgi?instance=/usr/local/projects/entropa/workflow/runtime/pipeline/3244/pipeline.xml). 
+
+## Retrieve Assertions
+
+Dump all of the annotations (GO, EC, gene_symbol, gene_product_name) in the database for the genes in TAGs. This will produce the raw set of assertions to which rules will be applied to create the aggregate annotation set. 
+
+Output is tab-delimited with columns bin, organism name, feature uniquename, assertion type, assertion value, evidence code (currently null). 
+    
+    perl -I Harmogene/ -I Harmogene/shared/ chado2assertions.pl \
+      --user $HUSER --password $HPASS --database harmogene \
+      --log clean_test/chado2assertions.log \
+      --input_bins clean_test/analysis2bin.out > clean_test/chado2assertions.out
+    
+
+## Non-HMM Metrics
+
+Note: This is kind of useless. It's really only valuable for checking the effectiveness of a rules file. We can perform an initial run of the VAMs metrics on the data at this point. Instead of grouping genes by matching of an equivalog level HMM (as is done in the real VAMs analysis), we will bin them on their gene_group_by_taxon_id. While there is more input in analysis2bin.out, we use gene_groups_by_taxon.out because it contains the taxon ids. First we merge the gene groups and the annotation information into a single tab-delimited file used for the metrics: 
+    
+    ./assertions2metrics.pl --log clean_test/pre_rules/assertions2metrics.log \
+      --gene_groups clean_test/gene_groups_by_taxon.out \
+      --assertions clean_test/chado2assertions.out > clean_test/pre_rules/assertion.annotab
+    
+
+then we calculate the metrics (script is from [BRC](http://brc-central.svn.sourceforge.net/viewvc/brc-central/trunk/brc-central/util/aem/)): 
+    
+    metrics/run_aem_metrics.pl clean_test/pre_rules/assertion.annotab clean_test/pre_rules/
+    
+
+## Apply rules to assertions
+
+A rules file is an ordered-list of commands that select the annotations to include in an aggregate annotation set from the set of all assertions. The tab-delimited rules file has three columns: datatype, rule action, and source. Valid datatypes are anything found in the annotation dump. Right now rules are created for `gene_product_name`, `EC`, `GO`, and `gene_symbol`. Currently, the only supported rule action is `USE`. A `USE` action specifies an ordered list of sources from which to select the annotation of the given datatype. Finally, the source columns corresponds to the `organism.uniquename` from which the annotation is to be taken. 
+
+To create a dummy list of rules files to be used on the `IGS` data: 
+    
+    cut -f 2 clean_test/chado2assertions.out | grep -v IGS | sort | uniq > clean_test/source.list
+    
+    for i in gene gene_product_name GO EC; do perl -p -e "print \"$i\tUSE\t\";" clean_test/source.list; done > clean_test/dummy.rules
+    
+
+Documentation on creating an intelligent, ordered list of rules can be found in [old ticket 9](https://sourceforge.net/userapps/trac/agussman/ticket/9). 
+
+Then apply the rules to the (sorted) input assertions to create the set of assertions to be applied to the target data set: 
+    
+    sort clean_test/chado2assertions.out > clean_test/chado2assertions.out.sort
+    
+    ./apply_rules_to_assertions.pl --log clean_test/apply_rules_to_assertions.log \
+      --rules_file clean_test/ordered.rules \
+      --input_assertions clean_test/chado2assertions.out.sort > clean_test/apply_rules_to_assertions.out
+    
+
+Check how the application of the rules changed the metrics (note: this is kind of meaningless. It's only useful when comparing ordered rules to dummy rules, and not even then [old ticket 9 comment 6](https://sourceforge.net/userapps/trac/agussman/ticket/9#comment:6)): 
+    
+    ./assertions2metrics.pl --log clean_test/post_rules/assertions2metrics.log \
+    --gene_groups clean_test/gene_groups_by_taxon.out \
+    --assertions clean_test/apply_rules_to_assertions.out --source TEST > clean_test/post_rules/assertion.annotab
+    
+    metrics/run_aem_metrics.pl clean_test/post_rules/assertion.annotab clean_test/post_rules
+    
+
+## Update the database
+
+Apply the aggregated annotation set to an organism in the database. 
+
+You might want to backup the database first: 
+    
+    pg_dump -h hannibal -U $HUSER $HDB > clean_test/$HDB.dump.sql
+    
+
+Update `organism_id = 43` according to the aggregate annotation set (warning: no undo from this!): 
+    
+    perl -I Harmogene/ -I Harmogene/shared/ ./update_assertions_in_organism.pl \
+      --user $HUSER --password $HPASS --database $HDB \
+      --log clean_test/update_assertions_in_organism.log \
+      --assertions clean_test/apply_rules_to_assertions.out \
+      --analysis_id 3 \
+      --organism_id 90 > clean_test/update_assertions_in_organism.out
+    
+
+## HMM Metrics (VAMs)
+
+Steps in the VAMs analysis are also documented in [old ticket 3](https://sourceforge.net/userapps/trac/agussman/ticket/3) and [VAMs Notes](https://sourceforge.net/userapps/trac/agussman/wiki/VAMs%20Notes). Annotab is created by the `chado2annotab.pl` script and the output goes through some hack-ish postprocessing before metrics can be run. 
+
+In this example, metrics run on the annotations before updating with rules: 
+    
+    source  0.9753  0.3430  0.0000  1.0000  0.5680  0.8256  0.8887  0.1512
+    
+
+and after: 
+    
+    source  0.9670  0.6163  0.8918  0.7907  0.6000  0.8605  0.9186  0.4826
+    
+
+As a table:
Home modified by Aaron Gussman

Aaron Gussman — Tue, 20 May 2014 13:28:24 -0000
Welcome to your wiki!
This is the default page, edit it as you see fit. To add a new page simply reference it within brackets, e.g.: [SamplePage].
The wiki uses Markdown syntax.
Project Members:

Aaron Gussman (admin)
daveriley
Recent changes to Home

Home modified by Aaron Gussman

Home modified by Aaron Gussman

Project Members: