Menu

GARM documentation

Alejandro Sanchez-Flores

GARM basics

Genome Assembly Reconciliation and Merging (GARM) is a Perl script-based pipeline intended for merging assembly results. The great advantage of GARM is that it can store the scaffolding information of the best assembly and use it later to reconstruct the scaffolds. For GARM, a scaffold is a two or more contigs attached with at least 20 N characters. The scaffolding information is used to verify if the merged contigs that belonged to a certain scaffold were placed at a similar distance the calculated gap. This is the reconciliation part, which relies on the assumption that the scaffolding information is correct, however this is not necessarily true. This reconciliation part is a work in process since the distribution of the gap lengths is something difficult to determine especially when different insert size libraries were used in the assembly.

GARM input

GARM needs 3 files:

Assembly 1: Assembly results in FASTA format. It could be just contigs or scaffolds or both.
Assembly 2: Assembly results in FASTA format. It could be just contigs or scaffolds or both.
Genome assembly list: This is a plain text file (in UNIX coding) with two columns for each line:
Column 1: the absolute path to the assembly
Column 2: a prefix that will indicate what type of technology or assembler was used. It could be
any prefix that will help you to identify the origin of the contig.

Example (my_genomes.txt):

/scratch01/alexsf/ITV/ION_S30/454AllContigs.fna ION
/scratch01/alexsf/ITV/S30/ASSEMBLY_ABYSS_27/contigs.fa ILLUM

In this case I have one assembly using Ion Torrent data with Newbler 3.0 and a second assembly from ABySS using Illumina reads.

You can use a third column where you can put a * mark to indicate which genome you want to use for scaffolding information in case both of them are scaffolds. Otherwise, the program will choose which one is the best based on a greater N50, average scaffold length and total bases.

GARM output

Depending if the input file contained scaffolds you will have different results.

CONTIGS ONLY AS INPUT

final_bin.fasta:
This file contains the "leftover" sequences from both assemblies. These sequences are contigs from one assembly that didn't find any significant overlap when compared to the other assembly. Generally, this file should be smaller than the merge contigs. This file represent the sequences outside of the intersection of both assemblies.

final_merged_contigs.fasta:
This file contains the merged contigs from both assemblies. Usually this file contains a slightly reduced number of bases compared to the best input assembly. However, the statistics should be better and the number of fragments (contigs and scaffolds) should be lower.

SCAFFOLDS AS ONE OR BOTH INPUTS

final_merged_with_reovl_contigs.fasta:

final_merged_with_reovl_scaffolds.agp:

final_merged_with_reovl_scaffolds.fasta:

final_merged.read.placed:

A file with 9 columns with useful information such as the names and coordinates of contigs placed in a certain scaffold.

Column 1: for GARM is always *
Column 2: Contig name
Column 3: start of contig
Column 4: total bases in the contig
Column 5: orientation on the merged contig (0 = forward, 1 = reverse)
Column 6: merge contig name
Column 7: currently is * (should have scaffold name in the future)
Column 8: approximate start of contig on merged contig
Column 9: is * always

For Columns 3 and 8 the first position is always 1 (not 0). For Column 8, the start
of a contig on a merged contig is always the smallest position on the merged contig which the
contig covers, regardless of its orientation.

final.stats:

The assembly stats for the bin and the merged files (scaffolds and/or contigs). The format is:

sum = number of total bases
n = number of total fragments (contigs or scaffolds)
ave = average fragment length, in bases
largest = length of the largest fragment, in bases
NXX = the length of the shortest fragment of a group of fragments that add the XX% of the total bases.