From: James B. <jk...@sa...> - 2022-08-04 08:19:28
|
On Thu, Aug 04, 2022 at 08:43:58AM +0100, Thomas Juettemann wrote: > Hi Rob, > Thanks for looking into it. Unfortunately isec keeps only the first record. If you're not after a full intersection of only-in-A only-in-B and in-both, then it's possible you could use filtering options instead. Eg "bcftools view -T A.vcf.gz B.vcf.gz" will report records from B that overlap locations listed in A. It doesn't need to a be BED file as it'll auto-detect the file format. > > On Tue, 2 Aug 2022, Thomas Juettemann wrote: > > > > > I came across a "transcript-based" VCF file, meaning a variant can be > > > present multiple times but belonging to a different transcript. See > > > "FIle 1" below as an example. I am finding myself in the unfortunate > > > situation of having to intersect ("File 2") and retain all records > > > with the same position and REF/ALT ("Desired output"). > > > Long shot: Is that possible? > > > > Does "bcftools isec" (https://www.htslib.org/doc/bcftools.html#isec) do > > what you want? The "Extract and write records from A shared by both A and > > B using exact allele match" example in the manual page sounds like it > > might: > > > > bcftools isec -p dir -n=2 -w1 A.vcf.gz B.vcf.gz I think this means that in the above example, multiple transcripts in B that overlap the coordinates in A will still be shown. If you need the reverse, then it'd need another command with A and B swapped around. I'm not sure this is exactly the same thing, but it's worth an experiment with a few simple examples to validate it. (Take care with complex variants and not just SNPs to check "overlap" works as you expect when indels are present.) James -- James Bonfield (jk...@sa...) The Sanger Institute, Hinxton, Cambs, CB10 1SA -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |