Home

Authors:

Lep-Anchor: Documentation

Pasi Rastas (C), 2019-2023, pasi.rastas@gmail.com

Lep-Anchor is a novel software for anchoring genome assemblies using linkage map. The software Lep-Anchor has been developed to efficiently anchor genomes by using all the information provided by Lep-MAP3. I can use additional information provided by read and contig-contig alignments to link contigs and to collapse haplotype contigs from the assembly.

The following wiki will document its features. The example data used to evaluate Lep-Anchor can be downloaded from the "Files" of this sourceforge page.

Introduction

Lep-Anchor consists of two main modules, CleanMap and PlaceAndOrientContigs. These and other modules can run by (and typically in this order)

java -cp bin CleanMap [parameters]

java -cp bin Map2Bed [parameters]

java -cp bin PlaceAndOrientContigs [parameters]

and

java -cp bin LiftoverHaplotypes [parameters]

(sometimes you call LiftoverHaplotypes before other modules)

, where "bin" is the directory where the class files are located. Each module requires some parameters, simply running "java -cp bin moduleName" lists these parameters and a short description of each parameter.

Installation

Lep-Anchor is implemented in java. To run it, java runtime environment is required (java.com). The compiled java classes and source code can be downloaded from this sourceforge page.

Lep-Anchor wrapper

There are now three wrappers (lepanchor_wrapper.sh and lepanchor_wrapper2.sh and lepanchor_wrapper3.sh) that do basic two- and three-iteration run of Lep-Anchor. With these wrappers, you only need one file per map, you can obtain such files as follows, assuming 21 chromosomes (see examples below on how to get snps.txt and order$X.int from Lep-MAP3 output):

for X in {1..21}
do
awk -vn=$X '(NR==FNR){map[NR-1]=$0}(NR!=FNR){$1=map[$1] "\t" n;print}' snps.txt order$X.int >order${X}_i.input
done >map_all_chr.txt

and then you can run Lep-Anchor as

./lepanchor_wrapper2.sh -t num_threads -T num_threads_per_run -f ref.fasta -n num_chr -c chain_file -p paf_file -m map_file1 -m map_file2

in this case as follows

./lepanchor_wrapper2.sh -t 8 -f ref.fa.gz -m map_all_chr.txt -n 21 -c all.chain.gz -p all.paf

This script outputs two new reference fastas (LA_REF.fa.gz and REF_LA_scaffolds.fa.gz) and Marey maps (mareyX.png) for each chromosome and other useful information. In the output reference LA_REF.fa.gz, each chromosome is in one scaffold, whereas in REF_LA_scaffolds.fa.gz, contigs/scaffolds with uncertain orientation and/or position are kept as separate scaffolds.

The third wrapper (lepanchor_wrapper3.sh) is especially suitable for allopolyploid genomes where haplotype removal is difficult before anchoring due to sequence similarity. It removes haplotypes only within each (pseudo-)chromosome.

Converting Lep-MAP3 data to Lep-Anchor

This is how you can convert your Lep-MAP3 data to Lep-Anchor:

Consider Lep-MAP3 input is the output file from ParentCall2 or Filtering2, say data.pc.gz in compressed format.

First, create file with snp names

zcat data.pc.gz|awk '(NR>=7){print $1"\t"$2}' >snps.txt

The snps.txt looks like this (output of head snps.txt)
CHR POS
Contig0 204
Contig0 231
Contig0 4848
Contig0 4921
Contig0 4938
Contig0 4946
Contig0 4989
Contig0 5003
Contig0 5020

If the marker assignment to chromosomes is in mapLG.txt (from SeparateChromosomes2 or JoinSingles2All),
the input for CleanMap can be obtained by

paste snps.txt mapLG.txt|awk '(NR>1)' >cleanMap.input

The required bed for PlaceAndOrientContigs can be created as shown in Modules part.

The input for PlaceAndOrientContigs can be obtained as follows.
Let orderX.int be the output from "java OrderMarkers2 ... calculateIntervals=orderX.int chromosome=X ..." for each chromosome X=1, 2, ..., N

awk -vn=X '(NR==FNR){map[NR-1]=$0}(NR!=FNR){$1=map[$1] "\t" n;print}' snps.txt orderX.int >orderX_i.input

Note that you have to replace X with 1..N, e.g. N=21

for X in {1..21}
do
awk -vn=$X '(NR==FNR){map[NR-1]=$0}(NR!=FNR){$1=map[$1] "\t" n;print}' snps.txt order$X.int >order${X}_i.input
done

and then

java -cp bin/ PlaceAndOrientContigs map=orderX_i.input bed=chrX.bed  ...

Or the output orderX.txt from "java OrderMarkers2 ... chromosome=X ... >orderX.txt" for each chromosome X

awk -vn=X '(NR==FNR){map[NR-1]=$0}(NR!=FNR && /^[^#]/){print map[$1],n,$2,$3}' snps.txt orderX.txt >orderX_m.input

This requires noIntervals=1 flag

java -cp bin/ PlaceAndOrientContigs map=orderX_m.input bed=chrX.bed noIntervals=1 ...

Run Lep-Anchor without a linkage map

Lep-Anchor can be run without any map by creating a dummy map like this,
still using all other information (e.g. contig and read mappings, alignment chains and Hi-C data):

awk -f contigLength.awk reference.fasta >reference.fasta.sizes
awk '{print $1,int($2/2),1,1}' reference.fasta.sizes >dummy.map

Creating alignment chains for haplotype removal

I have used the HaploMerger2 pipeline (see below) for creating the chain file (all.chain.gz).

Another option is to use minimap2, implemented in a new wrapper lepanchor_chain.sh:

./lepanchor_chain.sh -t 20 -f ref.fasta.gz

This is what happens inside the hood:

#minimap2 --no-long-join -c -PD -g 20000 -r 1000,40000 -k15 -w8 -A3 -B6 -O12 -E1 -s200 -z600 -N50 --min-occ-floor=100 -t 20 ref.fa ref.fa >ref-self.paf
#chaining
#java -cp bin/ ChainPaf paf=ref-self.paf|awk -f sortpaf.awk| sort -T . -n -r -k 1,1 | #cut -f 2- > ref-self_sorted.paf
#create the chain file
#awk -f paf2chain.awk rf-self_sorted.paf | gzip > paf.chain.gz

On some genomes, this can be very slow. Increasing k and w might speed up the minimap2 part.

Note: you can use multiple chain files in Lep-Anchor. For example you can provide both, the minimap2 chain (paf.chain.gz) and the HaploMerger2 chain (all.chain.gz) files via the wrapper as

./lepanchor_wrapper2_.sh -c "all.chain.gz paf2.chain.gz"

The use of multiple chains (computed with different software) can increase the number of detected haplotypes.

Filtering on sequencing coverage

You can use the sequencing coverage to filter out duplications from the chain files (typically you should not remove duplications as haplotypes).
The scripts lepanchor_remap.sh and lepanchor_chain_mask.sh will do the coverage filtering:

#map hifi data to the genome
./lepanchor_remap.sh -f ref.fasta -r reads.fast? -o hifi
#mask the chain (all.chain.gz) with hifi coverage (hifi.cov)
./lepanchor_chain_mask.sh -c all.chain.gz -d hifi.cov -o all
#now the masked chain is all.masked_masked.gz and this can be used with Lep-Anchor wrappers

Alternatively, If you have a bam file (data.bam) (re)mapping data to the genome you can calculate the depth with samtools like this:

samtools depth -a data.bam|java -cp bin/ CoverageAnalyser depth=- >ca.txt
#you might want to set parameters zeta and normalDepth manually, e.g. "zeta=1.0 normalDepth=210" of achieve a better fit of columns 1 (observed coverage) and 2 (estimated coverage) in the ca.txt 

#Now you can filter the chain with lepanchor_chain_mask.sh

#to be continued...

Using proximity data (Hi-C)

Mapping Hi-C reads (reads1.fq.gz and reads2.fq.gz) using bwa and samblaster:

bwa mem -t32 -5SPM reference.fasta.gz read1.fq.gz read2.fq.gz|samblaster -r|awk -f hic.awk >hic.txt

Sort data based on contig name:

sort -V -k 1,1 hic.txt >hic_sorted.txt

#alternative solution:
#mkdir split
#awk '{fn="split/"$1; print $0>fn}' hic.txt
#cat split/* >hic_sorted.txt

Normalise data:

#iteration 1
java -cp bin Normalise data=hic_sorted.txt contigLength=reference.fasta.sizes >density.txt
#iteration 2
java -cp bin Normalise data=hic_sorted.txt contigLength=reference.fasta.sizes prevDensity=density.txt >density2.txt
#iteration 3
java -cp bin Normalise data=hic_sorted.txt contigLength=reference.fasta.sizes prevDensity=density2.txt >density3.txt
#output proximity data
java -cp bin Normalise data=hic_sorted.txt contigLength=reference.fasta.sizes prevDensity=density3.txt calculateProximity=1 >proximity.txt
#get support data
java -cp bin Support data=hic_sorted.txt contigLength=reference.fasta.sizes prevDensity=density3.txt >support.txt

You can output proximity via parameter -r in the wrapper as

lepanchor_wrapper2.sh ... -r "proximity.txt 10000 25 0.2"

(parameters are bin size=10000 (bp), max distance=25 bins = 250 000 (bp), and scale is 0.2)

Find possible assembly errors with the proximity data:

awk -f support2intervals.awk reference.fasta.sizes support.txt >support_i.txt
awk -f errorIntervals2.awk support_i.txt
#mean coverage = XX mean density = YY
#take values, say between 1-10% of means as limit
awk -vminCov=0.01*XX -vminDens=0.1*YY -f errorIntervals2.awk support_i.txt support_i.txt > errors.txt
#create a paf to fix possible false positives
awk -f cuts2paf.awk errors.txt errors.txt >errors.paf

Input the possible contig errors for Lep-Anchor as

lepanchor_wrapper2.sh ... -r "proximity.txt 10000 25 0.2" -e errors.txt -p errors.paf

Including long read data

You can use minimap2 to map long read data to your genome and use the paf file as input for Lep-Anchor. Mapping can be done as follows:

#minion
minimap2 -c -x map-ont -t 16 reference.fasta.gz reads.fasta.gz >alnONT.paf
#./lepanchor_remap.sh -f reference.fasta.gz -r reads.fasta.gz -p map-ont -t 16

#pacbio
minimap2 -c -x map-pb -t 16 reference.fasta.gz reads.fasta.gz >alnPB.paf
#./lepanchor_remap.sh -f reference.fasta.gz -r reads.fasta.gz -t 16

You can provide the paf file to the lepanchor_wrappers as parameter -p, e.g.

lepanchor_wrapper2.sh ... -p alnPB.paf

or you can provide multiple pafs as this

lepanchor_wrapper2.sh ... -p <(cat alnPB.paf alnONT.paf)

If you are not using the wrapper, the paf file(s) can be provided for PlaceAndOrientContigs module as shown in below.

Modules

CleanMap

Example input, columns contig, position and chromosome (sorted on contig and position, e.g. sort -V -k 1,1 -k 2,2)

Contig1 100  4
Contig1 1315 4
Contig1 2224 4
Contig2 123 5
...

Run CleanMap

java -cp bin/ CleanMap map=map.txt >map.clean

Make a bed file and chromosomal beds (chr1.bed, chr2.bed, ...) (NEW VERSION)

java -cp /bin Map2Bed map=map.clean contigLength=reference.fasta.sizes >map.bed

You can give map.bed to PlaceAndOrientContigs with parameter chromosome=1,2,...

java -cp bin/ PlaceAndOrientContigs bed=map.bed chromosome=1 ... >chr1.la
java -cp bin/ PlaceAndOrientContigs bed=map.bed chromosome=2 ... >chr2.la
...

reference.fasta.sizes contains the contig lengths

Contig1 123001
Contig2 432002
...

This file can be obtained with contigLength.awk from the reference.fasta

awk -f contigLength.awk reference.fasta >reference.fasta.sizes

PlaceAndOrientContigs

Example input, columns contig, position, chromosome and map position interval(s)

Contig1 100  4 10 10
Contig1 1315 4 10 12 22 30
Contig1 2224 4 11 11
Contig3 123  4 12 12
...

Run PlaceAndOrientContigs

java -cp bin/ PlaceAndOrientContigs map=map.int bed=chr4.bed >chr4.la 2>chr4.la.err

Example input, columns contig, position, chromosome and map position(s) in cM

Contig1 100  4 5.5 4.0
Contig1 1315 4 6.0 4.5
Contig1 2224 4 6.2 4.5
Contig3 123  4 6.2 4.5
...

Run PlaceAndOrientContigs

java -cp bin/ PlaceAndOrientContigs map=map.cm bed=chr4.bed noIntervals=1 >chr4.la 2>chr4.la.err

To build liftover chains to collapse haplotype contigs, see Haplomerger2 pipeline here.
Only the steps m.batchA1.initiation_and_all_lastz + hm.batchA2.chainNet_and_netToMaf of HaploMerger2 are required to run.
Note that in most systems you have to increase the maximum number of open files (e.g. ulimit -n 655350), the errors caused too low ulimit are hard to notice. After successful HaploMerger2 run (check all log files carefully!), the aligment chain all.chain.gz can be found from the folder genome_name.genome_namex.result
This chain can be given to Lep-Anchor as this:

zcat all.chain.gz|java -cp bin/ PlaceAndOrientContigs chain=- ...

Create paf from long reads

#minion
minimap2 -c -x map-ont -t 16 reference.fasta.gz reads.fasta.gz >alnONT.paf

#pacbio
minimap2 -c -x map-pb -t 16 reference.fasta.gz reads.fasta.gz >alnPB.paf

Run PlaceAndOrientContigs with the paf (you can combine several pafs with cat):

java -cp bin/ PlaceAndOrientContigs paf=file.paf ...

Make agp files (puts contigs with unknown orientation in + orientation)

awk -vlg=4 -f makeagp_full2.awk chr4.la >chr4.agp

or (puts contigs with unknown orientation as separate contigs or scaffolds)

awk -vlg=4 -f makeagp2.awk chr4.la >chr4.agp

Removing haplotypes before Lep-Anchor

Let all.chain.gz be the alignment chain, the following finds candidate haplotype contigs

#default parameters -vminScore=50 -vmaxGap=10000
#alignment score >50*length and max 10kb not aligning
zcat all.chain.gz|awk -f findFullHaplotypes.awk >fullHaplotypes50.txt

To remove these haplotypes from map.bed (from Map2Bed), simple grep could be used

grep -w -v -f <(cut -f 2 fullHaplotypes50.txt) map.bed >map_nh.bed

And then PlaceAndOrientContigs should be run with map_nh.bed.

Note: to be sure on the haplotypes, one could check the sequencing depth for the haplotypes. Also the markers in the haplotype contigs could be mapped to non-haplotype contigs (e.g. by LiftoverHaplotypes).

Removing haplotypes after Lep-Anchor

Let chr*.err be the error stream output from PlaceAndOrientContigs. The lines ending with haplotype give the score of leaving these contigs away

#haplotype score 1 per 5kb, first 5kb free (parameter limit=5000)
sort -n -r chr*.err|awk -vlimit=5000 '($NF=="haplotype" && ($1>=($4-$3+1-limit)/limit) && (!(($5 SUBSEP $6 SUBSEP $7) in h))){h[$2,$3,$4]; print}' >new_haplotypes.txt

awk -f removeHaplotypes.awk map_nh.bed new_haplotypes.txt >map_nh2.bed

Then PlaceAndOrientContigs should be run again with map_nh2.bed

Removing contig overlaps

As Lep-Anchor processes each chromsome separately, there can be some overlaps in the (chimeric) contigs after PlaceAndOrientContigs. These can be removed as follows:
Let chrX.txt (chr*.txt) be the output from PlaceAndOrientContigs for each chromosome (1,2,...,X) separately

awk -f removeOverlaps map_nh.bed chr*.txt >chr_all.txt

awk '($5==1)' chr_all.txt|awk -vlg=1 -f makeagp2.awk >chr1.agp
...
awk '($5==X)' chr_all.txt|awk -vlg=X -f makeagp2.awk >chrX.agp

Include contigs without markers

Providing a parameter keepEmptyIntervals=1 to PlaceAndOrientContigs includes also the contigs without any markers. Note that you must provide additional evidence (PacBio/Nanopore data via paf or chain file via chain parameter) to place such contigs. The idea is to include all non-haplotype contigs without markers to all linkage groups (chromosomes) and after the running PlaceAndOrientContigs pick those contigs that are confidently linked to only one linkage group. The propagate.awk script can be used to find such linked contigs.

It is recommended to re-run PlaceAndOrientContigs with only the confident contigs. The sketch of the pipeline looks following:

#map.txt contains all map information for all markers
#columns contig,pos,chr,map-interval

#LiftoverHaplotypes
java -cp bin LiftoverHaplotypes map=map.txt haplotypes=fullHaplotypes50.txt >map_liftover.txt
#Use map_liftover for CleanMap
java -cp bin CleanMap map=map_liftover.txt >map.clean

#Map2Bed
java -cp bin Map2Bed map=map.clean contigLength=scaffold.lengths >map.bed

#find contigs not haplotypes and not included in map.bed
cut -f 1 scaffold.lengths|grep -v -w -F -f <(cut -f 2 fullHaplotypes50.txt; cut -f 1 map.bed) >not_used.txt

#provide parameter N (number of chromosomes) and each contig in not_used.txt is added to each chr 1..N
grep -w -F -f not_used.txt scaffold.lengths|awk -vn=N '{s=$1"\t1\t"$2"\t?\t"; for (i=1;i<=n;++i) print s i}' >chr0.bed

#combine map.bed and chr0.bed
cat map.bed chr0.bed >map_extra.bed

#run PlaceAndOrientContigs
for i in {1..N}
do
java -cp bin/ PlaceAndOrientContigs ... chromsosome=$i bed=map_extra.bed keepEmptyIntervals=1 ... >chr$i.la 2>chr$i.la.err
done

#propagate
awk -f propagate chr*.la >tmp1.la
awk -f propagate tmp1.la >tmp2.la
awk -f propagate tmp2.la >tmp3.la
awk -f propagate tmp3.la >tmp4.la
#can be iteratated until output does not change anymore

#create prop*.la
awk '/^[^#]/{++d[$1 "\t" $7+0 "\t" $8+0]; data[++line]=$0}END{for (i = 1; i <= line; ++i) {$0=data[i];if (d[$1 "\t" $7+0 "\t" $8+0] == 1) fn="prop"$5".la"; else if ($5==1) fn="prop0.la"; else fn=""; if (fn != "") print $0>fn}}' tmp4.la

#create a new bed by combining prop[1-9]*.la and map.bed
awk '(NR==FNR){print;c[$1]}(NR!=FNR && !($1 in c)){print $1 "\t" $7+0 "\t" $8+0"\t?\t"$5}' map.bed prop[1-9]*.la >map_prop.bed

#re-run LA
for i in {1..N}
do
java -cp bin/ PlaceAndOrientContigs ... chromsosome=$i bed=map_prop.bed keepEmptyIntervals=1 ... >ichr$i.la 2>ichr$i.la.err
done

Create physical marker order for final validation (and for other purposes) for Lep-MAP3

This is for chr1:

awk -f liftover chr1.agp order1.input|sort -V|grep CHR >order1.liftover
#assuming scaffolds/contigs are named CHR.... (default for makeagp*)
awk -vinverse=1 -f liftover chr1.agp order1.liftover|awk '(NR==FNR){m[$1"\t"($2+0)]=NR-1}(NR!=FNR){print m[$1"\t"($2+0)]}' snps.txt - >order1.phys

The order1.phys should contains only integers, pointing to markers in the physical order. Then you can run OrderMarkers2 as before (with map and data parameters as before. You can try higher scale parameter if you have many markers, thus scale < 1)

java OrderMarkers2 ... evaluateOrder=order1.phys improveOrder=0 hyperPhaser=1 phasingIterations=3 map=map.txt chromosome=1 ... >order1.eval
#recommended extra parameters hyperPhaser=1 and phasingIterations > 1.

Create the final fasta file

Final output from Lep-Anchor is in chr1.la, chr2.la, ...

for i in {1..X}
do
awk -vlg=$i -f makeagp2 chr$i.la >chr$i.agp
done

or (put ? orientation as +) to obtain chromosomal scaffolds

for i in {1..X}
do
awk -vlg=$i -f makeagp_full2 chr$i.la >chr$i.agp
done

assuming there are no other chr*.agp files than just created

awk -f makefasta contig.fasta chr*.agp >final.fasta

Lifting markers from haplotype regions

If you remove contigs before running Lep-Anchor, you can make a liftover any possible markers wihin these contigs as follows:

#find full haplotype contigs based on alignments alone
zcat all.chain.gz|awk -f findFullHaplotypes.awk >fullHaplotypes50.txt
#Liftover
java -cp bin LiftoverHaplotypes map=map.txt haplotypes=fullHaplotypes50.txt >map_liftover.txt
#Use map_liftover for CleanMap
java -cp bin CleanMap map=map_liftover.txt >map.clean

Similarly you can use liftover on actual linkage maps:

java -cp bin LiftoverHaplotypes map=orderX.int haplotypes=fullHaplotypes50.txt >orderX_liftover.int
java -cp bin PlaceAndOrientContigs ... map=orderX_liftover.txt ... >chrX.la 2>chrX.la.err

And you can include new_haplotypes.txt as well if you re-run Lep-Anchor:

java -cp bin LiftoverHaplotypes map=orderX.int haplotypes=<(cat fullHaplotypes50.txt new_haplotypes.txt) >orderX_liftover2.int

OLD STUFF

Make bed files (chr1.bed, chr2.bed, ...) (OLD VERSION using simple scripts)

awk '($5>=1.0)' map.clean|awk -f cleanmap.awk|awk -f joinIntervals.awk reference.fasta.sizes - >map.bed

awk '{fn="chr" $4 ".bed";print $1"\t"$2"\t"$3 >fn}' map.bed

The wiki uses Markdown syntax.

Project Members:

Pasi Rastas (admin)

Lep-Anchor Wiki

Linkage map guided genome anchoring