Timothy Beck - 2016-06-25

Hello,
As an update I have resolved the issue.
This code block around line 412 in CrossMap.py

# update ref allele
                                tmp = pysam.faidx(refgenome,str(a[1][0]) + ':' + str(a[1][1]+1) + '-' + str(a[1][2]))
                                fields[3] =tmp[-1].rstrip('\n\r').upper()

sets fields[3] to '' in my case. I am running with pysam (0.9.1.1). The most simple fix is below, making tmp a list and not a str, and fixes the issue for my case. (rstrip is not required any more either)

# update ref allele
                                tmp = pysam.faidx(refgenome,str(a[1][0]) + ':' + str(a[1][1]+1) + '-' + str(a[1][2])).splitlines()
                                fields[3] =tmp[-1].rstrip('\n\r').upper()

I am not sure if this issue was due to a change in specification for the return type for the pysam.faidx(), or a bug in CrossMap for all versions of pysam.faidx

One last point, as I was digging into the pysam library, I thought replacing pysam.faidx(...) with somthing like this (after opening the fasta file refFastaFile=pysam.FastaFile(reference))

region = refFastaFile.fetch(region=str(a[1][0]) + ':' + str(a[1][1]+1) + '-' + str(a[1][2])).upper()

provides a huge performance speedup - 1/5th the time for a 4M record .vcf.gz, and 1/10th the time for a 500K record .vcf.gz file.

Best regards,

Tim