Extract PE Reads (with their mates) supporting variants in vcf file
2
3
Entering edit mode
6.4 years ago
pitithat ▴ 30

Hi, Suppose I have a bam file and a vcf file containing variant calling result. I want to extract only reads with their mate that support variant allele in the vcf. It would be nice to get those reads in bam format. I tried googling such tools to do this and found like VariantBAM but it reports both reads that supporting and not supporting variant.

Thanks

sequence vcf bam • 5.5k views
ADD COMMENT
1
Entering edit mode

I am going to tag Pierre Lindenbaum

He may already have something written to do this.

ADD REPLY
0
Entering edit mode

Hello,

have a look at this thread. It might be useful for you.

fin swimmer

ADD REPLY
0
Entering edit mode

Hello, Yeah, I found this thread however they don't give me the mate reads.

ADD REPLY
0
Entering edit mode

If it gives you the read names that support the variant, you can take this list as an input for FilterSamReads or a simple grep. Depending on your aligner the mate have the same read name and should be returned as well.

fin swimmer

ADD REPLY
0
Entering edit mode

This would be a nice tool indeed. For visualization, the program I've written, ASCIIGenome, has the option filterVariantReads that gets close to what you need if combined with the print command. It could be scripted and automated but it's not quite what you ask for.

ADD REPLY
0
Entering edit mode

hi Pierre, Excellent!!! This is something I was trying to do a couple of weeks ago and settled for VariantBam. But now I see this discussion, I would like to ask if this is applicable for structural variants as well. As we all know, the vcf for structural variants is a bit different compared to SNVs. Here are how the vcf records look like

> 10      89877751        3829:2  G       [10:89652084[G  37      PASS    DISC_MAPQ=60;EVDNC=ASDIS;MAPQ=60;MATEID=3829:1;MATENM=0;NM=0;NUMPARTS=2;SCTG=c_10_89651707_89652771_3C;SPAN=225667;SVTYPE=BND GT:AD:DP:GQ:PL:SR:DR:LR:LO      0/0:0:683:99:0,184.9,2030:0:0:185.2:0  0/1:60:592:37.7:37.7,0,1401:50:14:-37.5:116.7

> 21      17979635        8995:1  A       ]21:39877316]A  50      PASS    DISC_MAPQ=59;EVDNC=TSI_G;HOMSEQ=AAA;INSERTION=TAGATTGGGGAAGAATTTATACATTCATTCAAC;MAPQ=60;MATEID=8995:2;MATENM=0;NM=0;NUMPARTS=3;SCTG=c_21_39875939_39879467_17C;SPAN=21897681;SVTYPE=BND GT:AD:DP:GQ:PL:SR:DR:LR:LO      0/0:0:662:99:0,179.2,1967:0:0:179.6:0  0/1:61:558:50.3:50.3,0,1305:42:34:-50.02:120.8

Getting the supporting reads for these variants is a little confusing. And at the moment, I am taking the reads from both breakpoints separately and then combining. This is a little laborious way but as I was on a tight deadline, I just did it manually. It would be nice to have a way to extract just supporting reads for such structural variants (helps visualization and validation).

Thanks in advance, Venkatesh Chellappa (Venki)

ADD REPLY
0
Entering edit mode

it's too complicated for now, but anyway, why would you need to extract the variant of a SV ? Isn't visualization enough to validate a variant ?

ADD REPLY
0
Entering edit mode

I need the "reads supporting the variant" to visualize in IGV. I am currently using an "evidence bam" that contains reads that support breakpoints on left and right ends of the structural variants.

so yeah, me manually extracting the reads based on the loci of breakpoints is too laborious and I want to know if anyone has working solution for this.

I am thinking of opening a new discussion!

ADD REPLY
6
Entering edit mode
6.4 years ago

I quickly wrote something : http://lindenb.github.io/jvarkit/Biostar322664.html

caveat: BAM files must be sorted with picard SortSam/queryname and variants are loaded in memory and only SNP are considered.

$ java -jar picard.jar SortSam I=src/test/resources/S1.bam O=query.bam SO=queryname
$ java -jar dist/biostar322664.jar -V src/test/resources/S1.vcf.gz query.bam  

(...)
RF02_358_926_2:0:0_2:1:0_83 83  RF02    857 60  70M =   358 -569    GACGTGAACTATATAATTAAAATGGACAGAAATCTGCCATCAACAGCTAGATATATAAGACCTAATTTAC  2222222222222222222222222222222222222222222222222222222222222222222222  RG:Z:S1 NM:i:3  AS:i:55 XS:i:0
RF02_362_917_2:0:0_2:1:0_6f 147 RF02    848 60  70M =   362 -556    ATAAGGAATCACGTTAACTATATACTTAAAATGGACTGAAATCTGCCATCAACAGCTAGATATATAAGAC  2222222222222222222222222222222222222222222222222222222222222222222222  RG:Z:S1 NM:i:3  AS:i:55 XS:i:0
(...)

it's not fully tested, tell me if something is looks wrong.

ADD COMMENT
1
Entering edit mode

This was just what I needed as well! It was a huge help, at just the right time! Thank you Pierre!

All I had to do was manually edit the VCF file 'src/test/resources/S1.vcf.gz' so that it would reflect my target variant.

ADD REPLY
0
Entering edit mode

enter image description here

ADD REPLY
1
Entering edit mode

Haha! Unfortunately, there will be no article written, but seeds will sprout and thrive thanks to your help.

ADD REPLY
0
Entering edit mode

Thank you much. But I still find that your program also report reads that don't support the variant allele.

unsupport reads

ADD REPLY
0
Entering edit mode

can you please send a minimal VCF (one variant) and a minimal bam. Thanks. https://github.com/lindenb/jvarkit/issues/new

ADD REPLY
2
Entering edit mode
6.4 years ago

Just as a warning, strictly speaking, it is not possible to extract reads supporting a variant in a VCF since you don't know what reads the underlying variant caller has used. Mutect2, for example, discard reads failing some configurable options about base quality, mapping quality, flags etc plus reads are locally re-aligned. So the number of reads supporting a variant in the VCF output of Mutect2 is typically lower than the one you would "see" in a genome viewer. This gets even more complicated for structural variants. In fact, what my program does is simply to extract reads in a region if the read bases in that region do not match the reference.

ADD COMMENT
0
Entering edit mode

Hello, I wanna make sure how your program work. Is this the command "filterVariantReads -r 1000 vars.*vcf" to do so? If it is, what will it give me and in what format?

Thanks

ADD REPLY

Login before adding your answer.

Traffic: 1879 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6