Hello,
Perhaps you can help with an issue I have encountered.
I am running with CrossMap v0.2.3 installed via pip2.7 as per instructions. I have performed a liftover on a .vcf.gz file going from hg19 to hg38, and I have observed that in the output_file, for SNV's, the REF allele column is empty (two adjacent tabs). The output_file.unmap does contain the REF allele.
Example cmd line:
CrossMap.py vcf /opt/lab/genomeLiftover/chain/hg19ToHg38.over.chain.gz test.hg19.vcf.gz /opt/genomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa test.hg38.vcf
For this example, I used the test.hg19.vcf.gz file and chain file from the Crossmap sourceforge site download.
Partial output, first two lines after #CHROM...
chr13 31742298 14125 A . . description=a variant linked to breast cancer
chr22 15927609 rs149201999 C 100 PASS LDAF=0.0649;RSQ=0.8652;AN=2184;ERATE=0.0046;VT=SNP;AA=.;AVGPOST=0.9799;THETA=0.0149;SNPSOURCE=LOWCOV;AC=134;AF=0.06;A...
Only the ALT alleles are emitted.
Thanks for your help, and the effort to build Crossmap.
Tim
Hello,
As an update I have resolved the issue.
This code block around line 412 in CrossMap.py
sets fields[3] to '' in my case. I am running with pysam (0.9.1.1). The most simple fix is below, making tmp a list and not a str, and fixes the issue for my case. (rstrip is not required any more either)
I am not sure if this issue was due to a change in specification for the return type for the pysam.faidx(), or a bug in CrossMap for all versions of pysam.faidx
One last point, as I was digging into the pysam library, I thought replacing pysam.faidx(...) with somthing like this (after opening the fasta file refFastaFile=pysam.FastaFile(reference))
provides a huge performance speedup - 1/5th the time for a 4M record .vcf.gz, and 1/10th the time for a 500K record .vcf.gz file.
Best regards,
Tim