E&R studies usually estimate the allele frequencies or the haplotypes based on mapped sequencing data (e.g. Illumina or PacBio). MimicrEE2 allows circumventing this laborious step by directly outputting the allele frequencies or the haplotypes. However to estimate the full error of E&R studies it may sometimes be necessary to simulate reads and to map them back to the genome.
To enable this, MimicrEE2 allows converting the haplotypes to fasta files. These fasta files may than be used with common read simulators and the resulting reads may be mapped back to the genome.
Conversion of haplotypes requires two input files, a reference genome and the haplotypes
We store the following toy example of a reference genome in the file refgenome.fasta
>2L AAAAAAAA >2R TTTTTTTT >3L CCCCCCCC
We specify three chromosomes, the first consists of 8 Adenins the second of 8 Thymins and the third of 8 Cytosins
2L 2 A A/C CC AA 2L 7 A A/C CC CC 2R 2 T T/G GG GG 2R 7 T T/G GG AA 3L 2 C C/A CC AA 3L 7 G C/A CC CC
We specify six SNPs, two in each of the three reference chromosomes. We also specify 4 haploid genomes (haplotypes), with varying combinations of the 6 SNPs.
Note at chromosome 3L at position 7, we specify the reference character G instead of the actual C. MimicrEE2 allows to either ignore this kind of inconsistent data or to report an error and crash (--stringent)
With the following command we convert the haplotypes into fasta files:
java -jar mim2.jar mimhap2fasta --mimhap haps.mimhap --reference refgenome.fasta --output-fasta output.fasta
Note the mismatch of the reference characters at chromosome 3L position 7 (G in haplotype file vs. C in reference fasta) is ignored unless the --stringent option is used
We obtain 12 fasta sequence, since we have 3 reference chromosomes (2R,3L,2L) and 4 haploid genomes.
We add an identifier to the reference sequences (e.g. _mimhap1) which allows to identify sequences of the same haploid genome.
>2R_mimhap1 TGTTTTGT >3L_mimhap1 CCCCCCCC >2L_mimhap1 ACAAAACA >2R_mimhap2 TGTTTTGT >3L_mimhap2 CCCCCCCC >2L_mimhap2 ACAAAACA >2R_mimhap3 TGTTTTGT >3L_mimhap3 CACCCCCC >2L_mimhap3 AAAAAACA >2R_mimhap4 TGTTTTGT >3L_mimhap4 CACCCCCC >2L_mimhap4 AAAAAACA