Menu

Mimhap2Fasta

Robert Kofler Christos Vlachos

1 Introduction

E&R studies usually estimate the allele frequencies or the haplotypes based on mapped sequencing data (e.g. Illumina or PacBio). MimicrEE2 allows circumventing this laborious step by directly outputting the allele frequencies or the haplotypes. However to estimate the full error of E&R studies it may sometimes be necessary to simulate reads and to map them back to the genome.
To enable this, MimicrEE2 allows converting the haplotypes to fasta files. These fasta files may than be used with common read simulators and the resulting reads may be mapped back to the genome.

2 Converting haplotypes to fasta sequences

Conversion of haplotypes requires two input files, a reference genome and the haplotypes

reference genome

We store the following toy example of a reference genome in the file refgenome.fasta

>2L
AAAAAAAA
>2R
TTTTTTTT
>3L
CCCCCCCC

We specify three chromosomes, the first consists of 8 Adenins the second of 8 Thymins and the third of 8 Cytosins

haplotypes

2L  2   A   A/C CC AA
2L  7   A   A/C CC CC
2R  2   T   T/G GG GG
2R  7   T   T/G GG AA
3L  2   C   C/A CC AA
3L  7   G   C/A CC CC

We specify six SNPs, two in each of the three reference chromosomes. We also specify 4 haploid genomes (haplotypes), with varying combinations of the 6 SNPs.
Note at chromosome 3L at position 7, we specify the reference character G instead of the actual C. MimicrEE2 allows to either ignore this kind of inconsistent data or to report an error and crash (--stringent)

converting haplotypes to fasta sequences

With the following command we convert the haplotypes into fasta files:

java -jar mim2.jar mimhap2fasta --mimhap haps.mimhap --reference refgenome.fasta --output-fasta output.fasta

Note the mismatch of the reference characters at chromosome 3L position 7 (G in haplotype file vs. C in reference fasta) is ignored unless the --stringent option is used

result fasta

We obtain 12 fasta sequence, since we have 3 reference chromosomes (2R,3L,2L) and 4 haploid genomes.
We add an identifier to the reference sequences (e.g. _mimhap1) which allows to identify sequences of the same haploid genome.

>2R_mimhap1
TGTTTTGT
>3L_mimhap1
CCCCCCCC
>2L_mimhap1
ACAAAACA
>2R_mimhap2
TGTTTTGT
>3L_mimhap2
CCCCCCCC
>2L_mimhap2
ACAAAACA
>2R_mimhap3
TGTTTTGT
>3L_mimhap3
CACCCCCC
>2L_mimhap3
AAAAAACA
>2R_mimhap4
TGTTTTGT
>3L_mimhap4
CACCCCCC
>2L_mimhap4
AAAAAACA

Related

Wiki: Home
Wiki: Manual

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.