Recent changes to Data preparation

Data preparation modified by Martin S. Lindner

Martin S. Lindner — Thu, 08 May 2014 08:50:52 -0000

--- v7
+++ v8
@@ -46,6 +46,7 @@
 __GI map:__ If your reference sequences contain GI numbers, we recommend using this option. The GI numbers must occur in the sequence identifier and must start with `gi|` and end with a `TAB` or `|`. Here, you need a file that maps the GI numbers to NCBI taxid numbers. MicrobeGPS comes with a built in GI map, that comprises mappings for the NCBI bacterial reference genomes. Further, you can download a more comprehensive GI mapping file for all known bacterial sequences from the download section. The GI mapping files are excerpts from the extremely large `gi_taxid_nucl.dmp` file from the [NCBI Taxonomy FTP site](ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/). With this option, all sequences belonging to the same organism (represented by the taxid) are grouped and the correct scientific name is fetched from the taxonomy (see below).

 __Mapping file:__ We only recommend using this approach if the reference sequences do not have the `gi|` tag. You need to provide tab separated file that maps sequence IDs to reference names. Note that multiple IDs may map to one reference name. Example:
+
     contig_1.0001[tab]Organism One
     contig_1.0002[tab]Organism One
     contig_2.0001[tab]Organism Two
@@ -57,4 +58,4 @@
 Taxonomic information
 ---------------------

-
+Taxonomic information can only be used if the __GI map__ is applied. Then, MicrobeGPS automatically assigns the scientific names to the corresponding taxids and allows to place the reference genomes in a taxonomic tree. MicrobeGPS builds on the NCBI taxonomy and already comes with all necessary information. If you want to update the taxonomy information manually, you will have to replace the file `names.dmp` and `nodes.dmp` in the directory `microbegps/data/taxonomy/` manually. In the future, we plan to provide an update script for this task.

Data preparation modified by Martin S. Lindner

Martin S. Lindner — Thu, 08 May 2014 08:43:08 -0000

--- v6
+++ v7
@@ -1,5 +1,3 @@
-[Home]
-
 Data preparation
 ================

@@ -33,3 +31,30 @@
 * [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/) contains high quality reference genomes
 * [NCBI Bacteria](ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/) is a large and up to date collection of bacterial genomes
 * [HMP](http://www.hmpdacc.org/HMREFG/) contains human microbiome associated genomes (bacteria, viruses, eukaryotes, archaea)
+
+
+Reference genome information
+----------------------------
+
+The identifiers of the FASTA reference seqences are often cryptic and hinder interpretation. Further, there are often multiple reference sequences that belong to one organism, e.g. a chromosome and a plasmid, or multiple contigs. In these cases it is desirable to
+
+* put sequences belonging to one organism into a single group
+* and give meaningful names to the group.
+
+Both is done in the _Calculate Reference Table_ step. Here, you can choose between two options:
+
+__GI map:__ If your reference sequences contain GI numbers, we recommend using this option. The GI numbers must occur in the sequence identifier and must start with `gi|` and end with a `TAB` or `|`. Here, you need a file that maps the GI numbers to NCBI taxid numbers. MicrobeGPS comes with a built in GI map, that comprises mappings for the NCBI bacterial reference genomes. Further, you can download a more comprehensive GI mapping file for all known bacterial sequences from the download section. The GI mapping files are excerpts from the extremely large `gi_taxid_nucl.dmp` file from the [NCBI Taxonomy FTP site](ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/). With this option, all sequences belonging to the same organism (represented by the taxid) are grouped and the correct scientific name is fetched from the taxonomy (see below).
+
+__Mapping file:__ We only recommend using this approach if the reference sequences do not have the `gi|` tag. You need to provide tab separated file that maps sequence IDs to reference names. Note that multiple IDs may map to one reference name. Example:
+    contig_1.0001[tab]Organism One
+    contig_1.0002[tab]Organism One
+    contig_2.0001[tab]Organism Two
+    contig_2.0002[tab]Organism Two
+    contig_3.0001[tab]Another One
+
+
+
+Taxonomic information
+---------------------
+
+

Data preparation modified by Martin S. Lindner

Martin S. Lindner — Thu, 20 Mar 2014 21:52:55 -0000

--- v5
+++ v6
@@ -1,3 +1,5 @@
+[Home]
+
 Data preparation
 ================

Data preparation modified by Martin S. Lindner

Martin S. Lindner — Thu, 20 Mar 2014 21:52:22 -0000

--- v4
+++ v5
@@ -27,6 +27,7 @@
 * __Create bundles of reference genomes.__ Although MicrobeGPS has no limitation on the number of SAM files to analyze, it will save you a lot of computation time. Note: some mappers have limitations on the size of reference genome databases. For example, if you are using Bowtie2 together with the NCBI Bacteria database, you will have to create 3 or 4 databases and map against each one separately.

 Further, it is important to use an apropriate database. When in doubt, we recommend including more genomes instead of less. MicrobeGPS can deal with duplicate and overhead genomes. Here you can find some popular collections of genomes:
+
 * [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/) contains high quality reference genomes
 * [NCBI Bacteria](ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/) is a large and up to date collection of bacterial genomes
 * [HMP](http://www.hmpdacc.org/HMREFG/) contains human microbiome associated genomes (bacteria, viruses, eukaryotes, archaea)

Data preparation modified by Martin S. Lindner

Martin S. Lindner — Thu, 20 Mar 2014 21:51:44 -0000

--- v3
+++ v4
@@ -4,7 +4,7 @@
 Input data
 ----------

-MicrobeGPS requires SAM files as input. This means, you have to map the reads to the reference genomes before you can start the MicrobeGPS analysis. The [SAM](http://genome.sph.umich.edu/wiki/SAM) format is a standard format to report the alignment of reads to reference genomes. The majority of current read mappers supports this format.
+MicrobeGPS requires SAM files as input. This means, you have to map the reads to the reference genomes before you can start the MicrobeGPS analysis. The [SAM](http://samtools.github.io/hts-specs/SAMv1.pdf) format is a standard format to report the alignment of reads to reference genomes. The majority of current read mappers supports this format.

 We implemented our own SAM parser which is compatible to the SAM format specifications. Therefore, we can not load the binary BAM files, as it would be possible with SAMtools. MicrobeGPS calculates the mapping quality of each alignment and therefore searches for the CIGAR string and the `NM` tag (number of mismatches). If the `NM` tag is missing, MicrobeGPS will assume 0 mismatches. If the CIGAR string is missing hell knows what happens.

@@ -14,14 +14,19 @@
 The read mapping step is essential for MicrobeGPS to work correctly. Please respect the following hints:

 * __Avoid the Best mode__ of your mapper that only reports the best match for each read. MicrobeGPS relies on the information from shared reads. Unfortunately, this mode is very common and, for example, default for Bowtie2 (the `-k` mode does the trick here).
-* Report sufficient, but not too many matches. As a rule of thumb, count the number of reference genomes database that are higly similar to each other (the largest cluster). This should be about the maximum number of matches you should allow for each read. For example, the NCBI Bacteria genomes database contains about 80 genomes which are highly similar to _E. coli_. Here, `-k 80` would be apropriate.
+* Report sufficient, but __not too many matches__. As a rule of thumb, count the number of reference genomes database that are higly similar to each other (the largest cluster). This should be about the maximum number of matches you should allow for each read. For example, the NCBI Bacteria genomes database contains about 80 genomes which are highly similar to _E. coli_. For Bowtie2, `-k 80` would be apropriate.

 Reference genome databases
 --------------------------

 Composing the refrerence genome database is as essential as the read mapping step. The following hints can help you composing your own database:

-- not too many contigs
-- unique ids
-- GI in name
-- put in bundles
+* Use __high quality genomes__ where possible. Draft genomes that are fragmented into thousands of contigs make it harder for MicrobeGPS to calculate the quality metrics, estimate the coverage, and so on.
+* Use a __unique name for each sequence__ (genome or contig).
+* Include the __GI number in the name__. The GI is a unique sequence identifier and can be mapped to the NCBI taxid. __This is essential if you want to use all the taxonomy-related features of MicrobeGPS!__ MicrobeGPS searches for the standard pattern: `gi|` followed by numbers and terminated by `|` or any non-number character.
+* __Create bundles of reference genomes.__ Although MicrobeGPS has no limitation on the number of SAM files to analyze, it will save you a lot of computation time. Note: some mappers have limitations on the size of reference genome databases. For example, if you are using Bowtie2 together with the NCBI Bacteria database, you will have to create 3 or 4 databases and map against each one separately.
+
+Further, it is important to use an apropriate database. When in doubt, we recommend including more genomes instead of less. MicrobeGPS can deal with duplicate and overhead genomes. Here you can find some popular collections of genomes:
+* [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/) contains high quality reference genomes
+* [NCBI Bacteria](ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/) is a large and up to date collection of bacterial genomes
+* [HMP](http://www.hmpdacc.org/HMREFG/) contains human microbiome associated genomes (bacteria, viruses, eukaryotes, archaea)

Data preparation modified by Martin S. Lindner

Martin S. Lindner — Thu, 20 Mar 2014 21:17:42 -0000

--- v2
+++ v3
@@ -11,10 +11,16 @@
 Read mapping
 ------------

-- avoid best mode
-- report as many matches as necessary
+The read mapping step is essential for MicrobeGPS to work correctly. Please respect the following hints:

-reference genomes
+* __Avoid the Best mode__ of your mapper that only reports the best match for each read. MicrobeGPS relies on the information from shared reads. Unfortunately, this mode is very common and, for example, default for Bowtie2 (the `-k` mode does the trick here).
+* Report sufficient, but not too many matches. As a rule of thumb, count the number of reference genomes database that are higly similar to each other (the largest cluster). This should be about the maximum number of matches you should allow for each read. For example, the NCBI Bacteria genomes database contains about 80 genomes which are highly similar to _E. coli_. Here, `-k 80` would be apropriate.
+
+Reference genome databases
+--------------------------
+
+Composing the refrerence genome database is as essential as the read mapping step. The following hints can help you composing your own database:
+
 - not too many contigs
 - unique ids
 - GI in name

Data preparation modified by Martin S. Lindner

Martin S. Lindner — Thu, 20 Mar 2014 21:06:23 -0000

--- v1
+++ v2
@@ -1,11 +1,16 @@
-sam file format
+Data preparation
+================

-about text
+Input data
+----------

-own parser
-`NM` tag/cigar
+MicrobeGPS requires SAM files as input. This means, you have to map the reads to the reference genomes before you can start the MicrobeGPS analysis. The [SAM](http://genome.sph.umich.edu/wiki/SAM) format is a standard format to report the alignment of reads to reference genomes. The majority of current read mappers supports this format.

-read mapping
+We implemented our own SAM parser which is compatible to the SAM format specifications. Therefore, we can not load the binary BAM files, as it would be possible with SAMtools. MicrobeGPS calculates the mapping quality of each alignment and therefore searches for the CIGAR string and the `NM` tag (number of mismatches). If the `NM` tag is missing, MicrobeGPS will assume 0 mismatches. If the CIGAR string is missing hell knows what happens.
+
+Read mapping
+------------
+
 - avoid best mode
 - report as many matches as necessary

Data preparation modified by Martin S. Lindner

Martin S. Lindner — Thu, 20 Mar 2014 21:00:38 -0000

sam file format

about text

own parser
NM tag/cigar

read mapping
- avoid best mode
- report as many matches as necessary

reference genomes
- not too many contigs
- unique ids
- GI in name
- put in bundles