To study genetic variants in repetitive regions using GeneticThesaurus, you will need a file called a thesaurus. A thesaurus file is a table in which entries link pairs of genomic regions that have similar sequence.
A thesaurus table is calculated starting from a genome fasta file.
(Instead of computing a thesaurus table from scratch, you can also choose to work with precomputed tables. See this page about handling precomputed downloadable tables.)
The first step is to generate a set of unaligned reads in which the position of origin is encoded into the read name, e.g.
java -jar GeneticThesaurus.jar generate
--genome mygenome.fa
--readlen 100
--divisor 10
--output thesaurus.reads
This should create one or more files with names such as thesaurus.reads.X.fa.gz, where XX is a number, containing short sequences in fasta format. The reads are split into multiple files by defaults and the files are automatically compressed. In this example, reads are 100bp long, and span the whole genome at regular intervals of 10bp. In order to later catch similarities between short sequence fragments, it is important to keep the intervals small compared to the readlength.
(Note: the read generating program can also be configured to produce paired-end reads, but only use single-end reads from the positive genome strand for creating a thesaurus.)
The second step is to align the generated reads onto the reference genome and record all the alternate
mapping positions:
java -jar GeneticThesaurus align
--genome mygenome.fa
--input thesaurus.reads.0.fa.gz
--output thesaurus.reads.0
--mismatches 4
--earlength 5
--threads 8
This should start aligning the reads in one input file and produce output in a file mygenome.reads.0.bam. The output should contain indel-free alignments up to, in this example, 4 mismatches. The earlength argument determines the length of seeds used to start the alignment (exactly two seeds are used, one from each side of the input read). The threads argument allows to parallelize the computation.
The alignment stage can take a long time, so you may want to run multiple of such commands on a compute farm starting from the multiple input files. This is a primitive and manual form of load distribution - but you have to do this only once per genome.
The third and final step is to collect the multiple alignments into a single thesaurus table. This
is performed using
java -jar GeneticThesaurus.jar write
--genome mygenome.fa
--bam thesaurus.reads.0.bam
--bam thesaurus.reads.1.bam
--output thesaurus
--tempdisk /use/a/fast/disk/
--maxpenalty 3
--readlen 100
--threads 4
The purpose of this command is to collect information from multiple bam files (provide all the bam files from alignment step using multiple --bam arguments) and to assemble a single table (thesaurus.tsv.gz).
At this stage you can again specify the readlength and the number of mismatches - this information is interpreted slightly differently here than before. It determines how neighboring thesaurus entries are merged into longer entries. It is best if you keep --readlen at the actual readlength and --maxpenalty at or below the number of mismatches you used during alignment.
Optionally, you can specify a path to a directory for temporary file storage. It helps if you can provide access to a very fast disk, e.g. an SSD. If this directory is not explicitly set, temporary files will be
created next to the final output file.
The threads parameter helps parallelize one of the stages of the procedure. However, much of the computation is actually done serially.
The thesaurus writing process can take a long time and considerable memory, so make sure you have ample disk and memory space (e.g. 200GB of disk space, 32 GB of memory). Apologies for this heavy resource use - but you only have to do this once per genome.
(The program will actually produce multiple output files: one final thesaurus.tsv.gz table and several temporary files containing "raw" in the filename. Unless you want to recompute different thesaurus tables from the same alignments, you can remove/delete the files labeled as "raw")
The above procedure is intentionally split into multiple parts. This means that you can customize each step for your own needs. You can also substitute other tools (e.g. for read mapping) as long as you can format their output so that they are compatible with subsequent tools.