I am trying to use lofreq for somatic indel calling using the somatic and --call-indels handler for my normal and tumor bam files. I downloaded the version lofreq_star-2.1.0_linux-x86-64.tgz. I am using it on our institute cluster Linux version 3.2.0-4-amd64. I have just unpacked the software and then tried to run it. The command am using is below
Since it is an exome data am using the bed file of the SureSelect V4 and also am using the dbSNP file to remove the variants from the output.
I am getting the following error.
WARNING [2015-01-14 17:09:00,626]: Using stringent normal for SQ which does not however contain indels!
CRITICAL [2015-01-14 17:09:00,724]: The following command failed with code 1: lofreq call-parallel --pp-threads 8 -d 101000 -f /scratch/GT/vdas/test_exome/exome/hg19.fa --verbose --no-default-filter -b 1 -l /scratch/GT/vdas/referenceBed/hg19/ss_v4/SureSelect_XT_Human_All_Exon_V4.bed --call-indels -a 0.100000 --use-orphan -B -N -A /scratch/GT/vdas/pietro/exome_seq/results/N_S8980/N_S8980.realigned.recal.bam -o /scratch/GT/vdas/pietro/exome_seq/results/lofreq_out/out_HG_normal_relaxed.vcf.gz
CRITICAL [2015-01-14 17:09:00,724]: Received the following on stderr:
INFO [2015-01-14 17:09:00,653]: Using 8 threads with following basic args: lofreq call -d 101000 -f /scratch/GT/vdas/test_exome/exome/hg19.fa --verbose --no-default-filter -b 1 --call-indels -a 0.100000 --use-orphan -B -N -A /scratch/GT/vdas/pietro/exome_seq/results/N_S8980/N_S8980.realigned.recal.bam
Traceback (most recent call last):
File "/scratch/GT/softwares/lofreq_star-2.1.0/bin/lofreq2_call_pparallel.py", line 751, in <module>
main()
File "/scratch/GT/softwares/lofreq_star-2.1.0/bin/lofreq2_call_pparallel.py", line 563, in main
bed_bins = [Region._make(x) for x in read_bed_coords(bed_file)]
File "/scratch/GT/softwares/lofreq_star-2.1.0/bin/lofreq2_call_pparallel.py", line 106, in read_bed_coords
this looks like an unhandled error triggered in the bed reading
function. Is your bed file in any way unusual? Would you mind to share
it, either here or via PM (wilma@gis.a-star.edu.sg)? The problem
should be easy to fix then.
I am trying to use lofreq for somatic indel calling using the somatic and
--call-indels handler for my normal and tumor bam files. I downloaded the
version lofreq_star-2.1.0_linux-x86-64.tgz. I am using it on our institute
cluster Linux version 3.2.0-4-amd64. I have just unpacked the software and
then tried to run it. The command am using is below
Since it is an exome data am using the bed file of the SureSelect V4 and
also am using the dbSNP file to remove the variants from the output.
I am getting the following error.
WARNING [2015-01-14 17:09:00,626]: Using stringent normal for SQ which does
not however contain indels!
CRITICAL [2015-01-14 17:09:00,724]: The following command failed with code
1: lofreq call-parallel --pp-threads 8 -d 101000 -f
/scratch/GT/vdas/test_exome/exome/hg19.fa --verbose --no-default-filter -b 1
-l
/scratch/GT/vdas/referenceBed/hg19/ss_v4/SureSelect_XT_Human_All_Exon_V4.bed
--call-indels -a 0.100000 --use-orphan -B -N -A
/scratch/GT/vdas/pietro/exome_seq/results/N_S8980/N_S8980.realigned.recal.bam
-o
/scratch/GT/vdas/pietro/exome_seq/results/lofreq_out/out_HG_normal_relaxed.vcf.gz
CRITICAL [2015-01-14 17:09:00,724]: Received the following on stderr:
INFO [2015-01-14 17:09:00,653]: Using 8 threads with following basic args:
lofreq call -d 101000 -f /scratch/GT/vdas/test_exome/exome/hg19.fa --verbose
--no-default-filter -b 1 --call-indels -a 0.100000 --use-orphan -B -N -A
/scratch/GT/vdas/pietro/exome_seq/results/N_S8980/N_S8980.realigned.recal.bam
Traceback (most recent call last):
File
"/scratch/GT/softwares/lofreq_star-2.1.0/bin/lofreq2_call_pparallel.py",
line 751, in <module>
main()
File
"/scratch/GT/softwares/lofreq_star-2.1.0/bin/lofreq2_call_pparallel.py",
line 563, in main
bed_bins = [Region._make(x) for x in read_bed_coords(bed_file)]
File
"/scratch/GT/softwares/lofreq_star-2.1.0/bin/lofreq2_call_pparallel.py",
line 106, in read_bed_coords
I am not sure how I can share the bed file, can i host it up anywhere? its almost 180 mb I guess. it is the sure select bed file which I have been supplied with the kit. I also have a .list of the same file, can that be used or I have to use the .bed format itself?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have used the S03723314_Covered.bed file which you can download from the agilent website. And this does not work. So you recommend to just create a custom bed file with coordinates and then run the command?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Oh ok. That looks like an extension of the bed format. LoFreq (and
samtools) expect something much simpler as defined here http://genome.ucsc.edu/FAQ/FAQformat.html#format1:
1st column chromosome name
2nd column region start (zero based)
3rd column region end (inclusive)
All other columns are optional.
You have unexpected rows in your data needed for the UCSC track definition.
Anyway, the following will produce a version of your file that should work:
$ grep -v ^browser S03723314_Covered.bed | grep -v ^track > new.bed
Try the created file called "new.bed" instead of the original bed
file. Let me know if that works and I will make corresponding changes
in the LoFreq code.
Yes the tool is working now, I modified the bed file of the company to a general bed format as of mentioned in UCSC just keeping the chr start and end coordinates and it seem to run , am waiting for the output now. Just to ask you a thing, this vcf are zipped , to annotate I usually use VEP for retrieving maximum annotated variants but it does not work with pakced vcf. I will try to unzip it once the run is over. Just to ask you any annotator that can directly accept vcf zipped files?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am watching 3 different outout, 2 is of snvs ( one is relaxed and other is stringent) and one stringent for the indel file. I uncompressed the stringent and relaxed file of my output to see how the contents are placed. In some SNVs the QUAL filter does not have any values and it is having (.) and in other they have values. Usually I chose variants having QUAL above 50. ALso here I see the stringent SNV file has over 20k variants. in the info column I could see something as CONSVAR, may I know what it stands for? It seems I need to filter the vcf files as well from the lofreq output. I could not find much details on these front in any of the documentation. If still I had missed I would ask for apology. Please let me know then shall I go for the filter handler to filter the vcf files that are output here from LoFreq?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Also I carefully noticed some log comments where I see these comments
WARNING [2015-01-15 11:59:20,322]: Using stringent normal for SQ which does not however contain indels!
CRITICAL [2015-01-15 14:44:52,604]: The following command failed with code 1: lofreq vcfset -a complement -2 /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf -1 /scratch/GT/vdas/pietro/exome_s
eq/results/lofreq_out/out_LG_somatic_final.snvs.vcf.gz --only-snvs -o /scratch/GT/vdas/pietro/exome_seq/results/lofreq_out/out_LG_somatic_final_minus-dbsnp.snvs.vcf.gz
CRITICAL [2015-01-15 14:44:52,604]: Received the following on stderr:
FATAL(lofreq_vcfset.c|main_vcfset:301): Couldn't load tabix index for /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf
-a 0.100000 --use-orphan -B -N -A /scratch/GT/vdas/pietro/exome_seq/results/N_S8980/N_S8980.realigned.recal.bam -o /scratch/GT/vdas/pietro/exome_seq/results/lofreq_out/out_HG_normal_relaxed.vcf.gz
stderr: INFO [2015-01-15 11:59:13,416]: Using 8 threads with following basic args: lofreq call -d 101000 -f /scratch/GT/vdas/test_exome/exome/hg19.fa --verbose --no-default-filter -b 1 --call-indels -a 0.10
0000 --use-orphan -B -N -A /scratch/GT/vdas/pietro/exome_seq/results/N_S8980/N_S8980.realigned.recal.bam
stderr:
stderr: INFO [2015-01-15 11:59:14,418]: Adding 27 commands to mp-pool
stderr: Number of substitution tests performed: 17946093
stderr: Number of indel tests performed: 424952
stderr: INFO [2015-01-15 12:51:21,951]: Copying concatenated vcf file to final destination
Can you please guide me where am going wrong? I mean we should not have so many SNVs or indels. Both are around 20k and 14k respectively after filtering for variants with QUAL score. So am in a fix now. I have already used this bam files on GATK and VarScan and I have got respectable counts for SNVs and indels but I want to make my analysis mroe strong using LoFreq indel caller to build a comprehensive list of genes that are indels. Also I have very few samples and I work individually on them.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When you call somatic SNVs then you only need to look at the file that
ends in "somatic_final.snvs.vcf.gz", or if you used dbsnp then
"somatic_final_minus-dbsnp.snvs.vcf.gz". All the other files are more
or less temporary and you should ignore them. See also http://csb5.github.io/lofreq/commands/#somatic
You are getting a "tabix index" error because LoFreq expects your
dbsnp file to be tabix indexed (see 'lofreq somatic --help'). This is
to speed up things. Just run 'bgzip' and 'tabix' on the dbsnp vcf file
to achieve this. This will also save you some disk space.
Regarding the quality filtering: LoFreq filters variants for you
already. Just use the files called final (see above). So there is no
need for you to do that again. We call variants with more than 50%
allele frequency consensus variants (CONSVAR) and don't assign a
quality to them, which is why they get a "dot" (for missing) instead.
These are "obvious" variants i.e. you could say they are high quality.
You will want to keep them.
You asked for a variant annotator that directly accepts zipped vcf
files: I don't know if SNPeff etc. support this, but if it doesn't,
then it's just a matter of running 'gunzip' on the file first.
And finally the bed file issue: samtools actually does support the
format you used and I fixed LoFreq to do the same. The fix will be
included in the next version (2.1.2). For the sake of completeness,
could you please send me a link to the bed file you used (you said you
downloaded it from somewhere), so that I can test this thoroughly?
Let me know if you need more information or if you have any more problems,
Andreas
Also I carefully noticed some log comments where I see these comments
WARNING [2015-01-15 11:59:20,322]: Using stringent normal for SQ which does
not however contain indels!
CRITICAL [2015-01-15 14:44:52,604]: The following command failed with code
1: lofreq vcfset -a complement -2
/scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf -1
/scratch/GT/vdas/pietro/exome_s
eq/results/lofreq_out/out_LG_somatic_final.snvs.vcf.gz --only-snvs -o
/scratch/GT/vdas/pietro/exome_seq/results/lofreq_out/out_LG_somatic_final_minus-dbsnp.snvs.vcf.gz
CRITICAL [2015-01-15 14:44:52,604]: Received the following on stderr:
FATAL(lofreq_vcfset.c|main_vcfset:301): Couldn't load tabix index for
/scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf
-a 0.100000 --use-orphan -B -N -A
/scratch/GT/vdas/pietro/exome_seq/results/N_S8980/N_S8980.realigned.recal.bam
-o
/scratch/GT/vdas/pietro/exome_seq/results/lofreq_out/out_HG_normal_relaxed.vcf.gz
stderr: INFO [2015-01-15 11:59:13,416]: Using 8 threads with following basic
args: lofreq call -d 101000 -f /scratch/GT/vdas/test_exome/exome/hg19.fa
--verbose --no-default-filter -b 1 --call-indels -a 0.10
0000 --use-orphan -B -N -A
/scratch/GT/vdas/pietro/exome_seq/results/N_S8980/N_S8980.realigned.recal.bam
stderr:
stderr: INFO [2015-01-15 11:59:14,418]: Adding 27 commands to mp-pool
stderr: Number of substitution tests performed: 17946093
stderr: Number of indel tests performed: 424952
stderr: INFO [2015-01-15 12:51:21,951]: Copying concatenated vcf file to
final destination
Can you please guide me where am going wrong? I mean we should not have so
many SNVs or indels. Both are around 20k and 14k respectively after
filtering for variants with QUAL score. So am in a fix now. I have already
used this bam files on GATK and VarScan and I have got respectable counts
for SNVs and indels but I want to make my analysis mroe strong using LoFreq
indel caller to build a comprehensive list of genes that are indels. Also I
have very few samples and I work individually on them.
Error running lofreq somatic and indel calling handler on normal tumor bam
files
Hi Andreas,
Thank you for your reply. I would like to say that I had a successful run. Actually while it was runnnig the *final.vcf.gz were still not generated so I could not see them. I would look carefully now on them. One note I would like to say is I do not see any file with somatic_final_minus-dbsnp.snvs.vcf.gz. Is this because my dbsnp file is not indexed? Should I index it and run it once more? I will now go through the outputs but meanwhile would like to inform you that I have used -d dbsnps file but I do not see any output with final_minus-dbsnps. So shall I run again with the indexed dbsnp and then making a zip file?
Also the bed file you are talking about is hosted in agilent website. You have to log in
https://earray.chem.agilent.com/earray/
If you are new to earray, you will first need to register and then it is free to use. In the top right corner of the screen choose Application type Sureselect Target enrichment. From there choose “Libraries” and then “Browse Libraries”. Choose the catalogue kit that you are interested in and click download. A list of the annotation files available will then appear, choose the file type you require and click download.
If you are facing issues let me know I can put my file in dropbox and link it to you. Thanks.
Chris
Last edit: chris.cornor 2015-01-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
yes, the file somatic_final_minus-dbsnp.snvs.vcf.gz is not there,
because the dbsnp removal step failed, since your dbsnp was not
indexed. I changed the code so that it will now complain directly.
Anyway, you can just run bgzip and tabix on dbsnp and then rerun
LoFreq somatic using the same parameters as before and add --continue.
That will pick up things were they failed and continue. If you however
unzipped some files while LoFreq was still running, then it's probably
safest to remove all intermediate files and start all over again.
Hi Andreas,
Thank you for your reply. I would like to say that I had a successful run.
Actually while it was runnnig the *final.vcf.gz were still not generated so
I could not see them. I would look carefully now on them. One note I would
like to say is I do not see any file with
somatic_final_minus-dbsnp.snvs.vcf.gz. Is this because my dbsnp file is not
indexed? Should I index it and run it once more? I will now go through the
outputs but meanwhile would like to inform you that I have used -d dbsnps
file but I do not see any output with final_minus-dbsnps. So shall I run
again with the indexed dbsnp and then making a zip file?
Also the bed file you are talking about is hosted in agilent website. You
have to log in
https://earray.chem.agilent.com/earray/
If you are new to earray, you will first need to register and then it is
free to use. In the top right corner of the screen choose Application type
Sureselect Target enrichment. From there choose “Libraries” and then “Browse
Libraries”. Choose the catalogue kit that you are interested in and click
download. A list of the annotation files available will then appear, choose
the file type you require and click download.
If you are facing issues let me know I can put my file in dropbox and link
it to you. Thanks.
Chris
Error running lofreq somatic and indel calling handler on normal tumor bam
files
Thank you very much for the reply. I would like to let you know that in the final indel file (which is my primary requirement now), I have annotated and there I found some indels with rsid after annotation which says that they are indeed present in dbsnp database so I guess rerunning it wont be that necessary right? However sometimes some variants even after discounting we see they do not show any rsid at vcf level but on annotation they do get some ids. So do you think at this point it will be necessary to do it all over again. My idea is to find variants in tumor and in its reprogrammed clone and match them , provided if some of them are not novel it is not an issue as of now. When I will go for tumor driving variants then probably discounting the dbsnp variants will be useful. That can also be done filtering for annotated variants which has some #rsid after annotating them. Please let me know your views on this.
Thanks,
Chris
Last edit: chris.cornor 2015-01-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
if you get a final vcf file, then there is no need to rerun LoFreq.
Whether you want to remove dbsnp or not ,depends a bit on the question
you want to answer. The default is be to remove dbsnp, simply to avoid
cases where a germline call was erroneously missed in the normal,
which then produces a (false) somatic call. If in doubt, go with the
default!
Thank you very much for the reply. I would like to let you know that in the
final indel file (which is my primary requirement now, I have not annotated
them and there I found some indels with rsid after annotation which says
that they are indeed present in dbsnp database so I guess rerunning it wont
be that necessary right? However sometimes some variants even after
discounting we see they do not show any rsid at vcf level but on annotation
they do get some ids. So do you think at this point it will be necessary to
do it all over again. My idea is to find variants in tumor and in its
reprogrammed clone and match them , provided if some of them are not novel
it is not an issue as of now. When I will go for tumor driving variants then
probably discounting the dbsnp variants will be useful. That can also be
done filtering for annotated variants which has some #rsid after annotating
them. Please let me know your views on this.
Thanks,
Chris
Error running lofreq somatic and indel calling handler on normal tumor bam
files
Hi Andreas,
I am trying to use lofreq for somatic indel calling using the somatic and --call-indels handler for my normal and tumor bam files. I downloaded the version lofreq_star-2.1.0_linux-x86-64.tgz. I am using it on our institute cluster Linux version 3.2.0-4-amd64. I have just unpacked the software and then tried to run it. The command am using is below
/scratch/GT/softwares/lofreq_star-2.1.0/bin/lofreq somatic --threads 8 -n /scratch/GT/vdas/pietro/exome_seq/results/N_S8980/N_S8980.realigned.recal.bam -t /scratch/GT/vdas/pietro/exome_seq/results/T_S7998/T_S7998.realigned.recal.bam -f /scratch/GT/vdas/test_exome/exome/hg19.fa --call-indels -l /scratch/GT/vdas/referenceBed/hg19/ss_v4/SureSelect_XT_Human_All_Exon_V4.bed -d /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf -o /scratch/GT/vdas/pietro/exome_seq/results/lofreq_out/out_HG_
N_S8980.realigned.recal.bam is my normal bam
T_S7998.realigned.recal.bam is my tumor bam
Since it is an exome data am using the bed file of the SureSelect V4 and also am using the dbSNP file to remove the variants from the output.
I am getting the following error.
WARNING [2015-01-14 17:09:00,626]: Using stringent normal for SQ which does not however contain indels!
CRITICAL [2015-01-14 17:09:00,724]: The following command failed with code 1: lofreq call-parallel --pp-threads 8 -d 101000 -f /scratch/GT/vdas/test_exome/exome/hg19.fa --verbose --no-default-filter -b 1 -l /scratch/GT/vdas/referenceBed/hg19/ss_v4/SureSelect_XT_Human_All_Exon_V4.bed --call-indels -a 0.100000 --use-orphan -B -N -A /scratch/GT/vdas/pietro/exome_seq/results/N_S8980/N_S8980.realigned.recal.bam -o /scratch/GT/vdas/pietro/exome_seq/results/lofreq_out/out_HG_normal_relaxed.vcf.gz
CRITICAL [2015-01-14 17:09:00,724]: Received the following on stderr:
INFO [2015-01-14 17:09:00,653]: Using 8 threads with following basic args: lofreq call -d 101000 -f /scratch/GT/vdas/test_exome/exome/hg19.fa --verbose --no-default-filter -b 1 --call-indels -a 0.100000 --use-orphan -B -N -A /scratch/GT/vdas/pietro/exome_seq/results/N_S8980/N_S8980.realigned.recal.bam
Traceback (most recent call last):
File "/scratch/GT/softwares/lofreq_star-2.1.0/bin/lofreq2_call_pparallel.py", line 751, in <module>
File "/scratch/GT/softwares/lofreq_star-2.1.0/bin/lofreq2_call_pparallel.py", line 563, in main
File "/scratch/GT/softwares/lofreq_star-2.1.0/bin/lofreq2_call_pparallel.py", line 106, in read_bed_coords
ValueError: need more than 1 value to unpack
CRITICAL [2015-01-14 17:09:00,724]: Somatic SNV caller failed. Exiting
Can you please tell me where am getting wrong and how to get rid of this error. I guess the software I downloaded if fine for my system.
Regards,
Chris
Hi Chris,
this looks like an unhandled error triggered in the bed reading
function. Is your bed file in any way unusual? Would you mind to share
it, either here or via PM (wilma@gis.a-star.edu.sg)? The problem
should be easy to fix then.
Andreas
On 15 January 2015 at 01:45, chris.cornor vd4mmind@users.sf.net wrote:
--
Andreas Wilm
andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
I am not sure how I can share the bed file, can i host it up anywhere? its almost 180 mb I guess. it is the sure select bed file which I have been supplied with the kit. I also have a .list of the same file, can that be used or I have to use the .bed format itself?
I have used the S03723314_Covered.bed file which you can download from the agilent website. And this does not work. So you recommend to just create a custom bed file with coordinates and then run the command?
The format of the bed file looks like this
head -5 S03723314_Covered.bed
browser position chr1:762097-762270
track name="Covered" description="Agilent SureSelect DNA - SureSelect #Human All Exon V4- Genomic regions covered by probes" color=0,0,128
chr1 762097 762270 #ref|LINC00115,ens|ENST00000473798,ens|ENST00000536430,ref|NR_024321,mRNA|AK026292,mRNA|BC017762,mRNA|BC017762,mRNA|AK026292,ref|NR_024321,ens|ENST00000536430,ens|ENST00000473798,ref|LINC00115
chr1 861281 861490 #ref|SAMD11,ccds|CCDS2.2,ens|ENST00000420190,ens|ENST00000437963,ens|ENST00000342066,ref|NM_152486,mRNA|AF161376,mRNA|AF161376,ref|NM_152486,ens|ENST00000342066,ens|ENST00000437963,ens|ENST00000420190,ccds|CCDS2.2,ref|SAMD11
chr1 865591 865791 #ref|SAMD11,ccds|CCDS2.2,ens|ENST00000420190,ens|ENST00000437963,ens|ENST00000342066,ens|ENST00000341065,ref|NM_152486,mRNA|AF161376,mRNA|AF161376,ref|NM_152486,ens|ENST00000341065,ens|ENST00000342066,ens|ENST00000437963,ens|ENST00000420190,ccds|CCDS2.2,ref|SAMD11
This is a standard bed format for the agilent kits.
Last edit: chris.cornor 2015-01-15
Oh ok. That looks like an extension of the bed format. LoFreq (and
samtools) expect something much simpler as defined here
http://genome.ucsc.edu/FAQ/FAQformat.html#format1:
1st column chromosome name
2nd column region start (zero based)
3rd column region end (inclusive)
All other columns are optional.
You have unexpected rows in your data needed for the UCSC track definition.
Anyway, the following will produce a version of your file that should work:
$ grep -v ^browser S03723314_Covered.bed | grep -v ^track > new.bed
Try the created file called "new.bed" instead of the original bed
file. Let me know if that works and I will make corresponding changes
in the LoFreq code.
Thanks,
Andreas
On 15 January 2015 at 19:15, chris.cornor vd4mmind@users.sf.net wrote:
--
Andreas Wilm
andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
Yes the tool is working now, I modified the bed file of the company to a general bed format as of mentioned in UCSC just keeping the chr start and end coordinates and it seem to run , am waiting for the output now. Just to ask you a thing, this vcf are zipped , to annotate I usually use VEP for retrieving maximum annotated variants but it does not work with pakced vcf. I will try to unzip it once the run is over. Just to ask you any annotator that can directly accept vcf zipped files?
I am watching 3 different outout, 2 is of snvs ( one is relaxed and other is stringent) and one stringent for the indel file. I uncompressed the stringent and relaxed file of my output to see how the contents are placed. In some SNVs the QUAL filter does not have any values and it is having (.) and in other they have values. Usually I chose variants having QUAL above 50. ALso here I see the stringent SNV file has over 20k variants. in the info column I could see something as CONSVAR, may I know what it stands for? It seems I need to filter the vcf files as well from the lofreq output. I could not find much details on these front in any of the documentation. If still I had missed I would ask for apology. Please let me know then shall I go for the filter handler to filter the vcf files that are output here from LoFreq?
Also I carefully noticed some log comments where I see these comments
WARNING [2015-01-15 11:59:20,322]: Using stringent normal for SQ which does not however contain indels!
CRITICAL [2015-01-15 14:44:52,604]: The following command failed with code 1: lofreq vcfset -a complement -2 /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf -1 /scratch/GT/vdas/pietro/exome_s
eq/results/lofreq_out/out_LG_somatic_final.snvs.vcf.gz --only-snvs -o /scratch/GT/vdas/pietro/exome_seq/results/lofreq_out/out_LG_somatic_final_minus-dbsnp.snvs.vcf.gz
CRITICAL [2015-01-15 14:44:52,604]: Received the following on stderr:
FATAL(lofreq_vcfset.c|main_vcfset:301): Couldn't load tabix index for /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf
CRITICAL [2015-01-15 14:44:52,604]: Somatic SNV caller failed. Exiting
Does this mean that the tool did not work properly? do I have to index even te dbsnp file?
Also in the log file which comes as an output I see the below comments which shows that the run was fine
lofreq call-parallel --pp-threads 8 -d 101000 -f /scratch/GT/vdas/test_exome/exome/hg19.fa --verbose --no-default-filter -b 1 -l /scratch/GT/vdas/referenceBed/hg19/ss_v4/Exon_SSV4_clean.bed --call-indels
-a 0.100000 --use-orphan -B -N -A /scratch/GT/vdas/pietro/exome_seq/results/N_S8980/N_S8980.realigned.recal.bam -o /scratch/GT/vdas/pietro/exome_seq/results/lofreq_out/out_HG_normal_relaxed.vcf.gz
stderr: INFO [2015-01-15 11:59:13,416]: Using 8 threads with following basic args: lofreq call -d 101000 -f /scratch/GT/vdas/test_exome/exome/hg19.fa --verbose --no-default-filter -b 1 --call-indels -a 0.10
0000 --use-orphan -B -N -A /scratch/GT/vdas/pietro/exome_seq/results/N_S8980/N_S8980.realigned.recal.bam
stderr:
stderr: INFO [2015-01-15 11:59:14,418]: Adding 27 commands to mp-pool
stderr: Number of substitution tests performed: 17946093
stderr: Number of indel tests performed: 424952
stderr: INFO [2015-01-15 12:51:21,951]: Copying concatenated vcf file to final destination
Can you please guide me where am going wrong? I mean we should not have so many SNVs or indels. Both are around 20k and 14k respectively after filtering for variants with QUAL score. So am in a fix now. I have already used this bam files on GATK and VarScan and I have got respectable counts for SNVs and indels but I want to make my analysis mroe strong using LoFreq indel caller to build a comprehensive list of genes that are indels. Also I have very few samples and I work individually on them.
Hi Chris,
When you call somatic SNVs then you only need to look at the file that
ends in "somatic_final.snvs.vcf.gz", or if you used dbsnp then
"somatic_final_minus-dbsnp.snvs.vcf.gz". All the other files are more
or less temporary and you should ignore them. See also
http://csb5.github.io/lofreq/commands/#somatic
You are getting a "tabix index" error because LoFreq expects your
dbsnp file to be tabix indexed (see 'lofreq somatic --help'). This is
to speed up things. Just run 'bgzip' and 'tabix' on the dbsnp vcf file
to achieve this. This will also save you some disk space.
Regarding the quality filtering: LoFreq filters variants for you
already. Just use the files called final (see above). So there is no
need for you to do that again. We call variants with more than 50%
allele frequency consensus variants (CONSVAR) and don't assign a
quality to them, which is why they get a "dot" (for missing) instead.
These are "obvious" variants i.e. you could say they are high quality.
You will want to keep them.
You asked for a variant annotator that directly accepts zipped vcf
files: I don't know if SNPeff etc. support this, but if it doesn't,
then it's just a matter of running 'gunzip' on the file first.
And finally the bed file issue: samtools actually does support the
format you used and I fixed LoFreq to do the same. The fix will be
included in the next version (2.1.2). For the sake of completeness,
could you please send me a link to the bed file you used (you said you
downloaded it from somewhere), so that I can test this thoroughly?
Let me know if you need more information or if you have any more problems,
Andreas
On 16 January 2015 at 01:30, chris.cornor vd4mmind@users.sf.net wrote:
--
Andreas Wilm
andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
Hi Andreas,
Thank you for your reply. I would like to say that I had a successful run. Actually while it was runnnig the *final.vcf.gz were still not generated so I could not see them. I would look carefully now on them. One note I would like to say is I do not see any file with somatic_final_minus-dbsnp.snvs.vcf.gz. Is this because my dbsnp file is not indexed? Should I index it and run it once more? I will now go through the outputs but meanwhile would like to inform you that I have used -d dbsnps file but I do not see any output with final_minus-dbsnps. So shall I run again with the indexed dbsnp and then making a zip file?
Also the bed file you are talking about is hosted in agilent website. You have to log in
https://earray.chem.agilent.com/earray/
If you are new to earray, you will first need to register and then it is free to use. In the top right corner of the screen choose Application type Sureselect Target enrichment. From there choose “Libraries” and then “Browse Libraries”. Choose the catalogue kit that you are interested in and click download. A list of the annotation files available will then appear, choose the file type you require and click download.
If you are facing issues let me know I can put my file in dropbox and link it to you. Thanks.
Chris
Last edit: chris.cornor 2015-01-16
Hey Chris,
yes, the file somatic_final_minus-dbsnp.snvs.vcf.gz is not there,
because the dbsnp removal step failed, since your dbsnp was not
indexed. I changed the code so that it will now complain directly.
Anyway, you can just run bgzip and tabix on dbsnp and then rerun
LoFreq somatic using the same parameters as before and add --continue.
That will pick up things were they failed and continue. If you however
unzipped some files while LoFreq was still running, then it's probably
safest to remove all intermediate files and start all over again.
Andreas
On 16 January 2015 at 17:49, chris.cornor vd4mmind@users.sf.net wrote:
--
Andreas Wilm
andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC
Hi Andreas,
Thank you very much for the reply. I would like to let you know that in the final indel file (which is my primary requirement now), I have annotated and there I found some indels with rsid after annotation which says that they are indeed present in dbsnp database so I guess rerunning it wont be that necessary right? However sometimes some variants even after discounting we see they do not show any rsid at vcf level but on annotation they do get some ids. So do you think at this point it will be necessary to do it all over again. My idea is to find variants in tumor and in its reprogrammed clone and match them , provided if some of them are not novel it is not an issue as of now. When I will go for tumor driving variants then probably discounting the dbsnp variants will be useful. That can also be done filtering for annotated variants which has some #rsid after annotating them. Please let me know your views on this.
Thanks,
Chris
Last edit: chris.cornor 2015-01-19
Hey Chris,
if you get a final vcf file, then there is no need to rerun LoFreq.
Whether you want to remove dbsnp or not ,depends a bit on the question
you want to answer. The default is be to remove dbsnp, simply to avoid
cases where a germline call was erroneously missed in the normal,
which then produces a (false) somatic call. If in doubt, go with the
default!
Andreas
On 19 January 2015 at 17:48, chris.cornor vd4mmind@users.sf.net wrote:
--
Andreas Wilm
andreas.wilm@gmail.com | mail@andreas-wilm.com | 0x7C68FBCC