BPGA **Documentation**
A tool for ultra-fast pan-genome analysis of microbes.
Brought to you by:
encoderman,
guptabpga
The results are generated as pdf images for all the analyses. Sometimes due to missing dependancies or version problems plots may not be generated. In such Instances, plots can be manually plotted using raw text files generated during analysis.
A: Basic Pan-genome Analysis
Analysis | Image file | Data file |
---|---|---|
Gene Family Distribution | Histogram.pdf | histogram.txt |
New Gene Distribution | New_Genes_Plot.pdf | new_genes_count.txt |
Basic Pan/core Genome Trend | Default_Core_Pan_Plot.pdf | pan_default.txt, core_default.txt |
Genome wise Pan Genome Statistics | - | stats.xls |
B: Advanced Pan-genome Analysis
Analysis | Image file | Data file |
---|---|---|
Pan/core Genome Profile (Scatter plot) | Core_Pan_Dot_Plot.pdf | pan_genome.txt, core_genome.txt |
Pan/core Genome Profile (Box plot) | Core_Pan_Plot.pdf | pan_box.txt, core_box.txt |
Pan genome Profile Trendlines | - | curve.xls |
Pan Phylogeny | Pan_phylogeny.pdf | PAN_PHYLOGENY_MOD.ph, PAN_PHYLOGENY_MOD.nwk |
Core Phylogeny | Core_phylogeny.pdf | CORE_PHYLOGENY_MOD.ph |
Functional Distribution (Major COG catagories) | COG_DISTRIBUTION.pdf | Major_Cog_Category1.txt |
Functional Distribution (COG sub-catagories) | COG_DISTRIBUTION_DETAILS.pdf | Cog_Category1.txt |
Pathway Distribution (Major KEGG catagories) | KEGG_DISTRIBUTION.pdf | kegg_histogram1.txt |
Pathway Distribution (KEGG sub-catagories) | KEGG_DISTRIBUTION_DETAILS.pdf | kegg_histogram1.txt |
Pathway Distribution (Pathway wise Counts) | - | Kegg_count_details1.txt |
C: Sequence Retrieval
Sequence | File | Details |
---|---|---|
Representatives of Core Gene Families | REPSEQ_CORE.txt | Header has Status and Gene ID, Protein FASTA |
Representatives of Accessory Gene Families | REPSEQ_ACCESSORY.txt | Header has Status and Gene ID, Protein FASTA |
Representatives of Unique Gene Families | REPSEQ_UNIQUE.txt | Header has Status and Gene ID, Protein FASTA |
Core Gene Families (All Members from all genomes) | core_seq.txt | Header has Status, Gene ID, Gene Family ID, and Genome ID. Protein FASTA |
Acc. Gene Families (All Members from 2 or more genomes) | accessory_seq.txt | Header has Status, Gene ID, Gene Family ID, and Genome ID. Protein FASTA |
Unique Gene Families (All Members from individual genomes) | unique_seq.txt | Header has Status, Gene ID, Gene Family ID, and Genome ID. Protein FASTA |
Gene Families with Exclusive Absence | exclusively_absent_seq.txt | Header has Status, Gene ID, Gene Family ID, and Genome ID. Protein FASTA (Gene from any one genome is missing from these gene families) |
Core genes with Atypical GC | core_genes_with_atypical_GC_content.txt | Header has Status, Gene ID, Gene Family ID, and Genome ID. Protein FASTA |
Acc. genes with Atypical GC | accessory_genes_with_atypical_GC_content.txt | Header has Status, Gene ID, Gene Family ID, and Genome ID. Protein FASTA |
Unique genes with Atypical GC | unique_genes_with_atypical_GC_content.txt | Header has Status, Gene ID, Gene Family ID, and Genome ID. Protein FASTA |
Other Supporting Files
File | Description | Comment |
---|---|---|
DATASET.xls | Details about seleceted organisms | - |
list | list of selected organisms contains Genome ID adn Genome Name | This is used for reference genome IDs found anywhere else. |
INPUT_all.faa/seq | Database Protein FASTA file for clustering | Contains all the protein sequences from all the genomes and has: Genome ID, Gene ID and Organism Name. (Also has GC content if generated from Genbank Option) |
INPUT_all.ffn | Nucleotide FASTA | Contains all the coding sequences from all the genomes |
gi_name | Reference gene names | Contains Gene ID and Standard Gene Name. |
matrix.txt | 1,0 matrix in binary form | Where each column represents genome (serially as per list file sequence) and rows represent gene families. 1 for presenc, 0 for absence of genes from respective genome and gene family. |