Extracting Reads Containing A Specific Variant From A Bam File
2
2
Entering edit mode
11.0 years ago
Nick Stoler ▴ 70

Perhaps my Google-fu is failing me, because this doesn't seem like an uncommon need, but I can't find any answer here.

Say I have a SAM/BAM file, and a variant at a known location. I'd like to extract all the alignments from the BAM file which contain that variant. That is, I'd like the ALT, not the REF allele. I do not want all the reads at that location. Is there an existing tool or workflow out there that can accomplish this?

I would honestly be surprised if there wasn't, since I would imagine it's a common situation to be interested in a certain variant and investigate the reads supporting it. Specifically, I'd like to make sure there's no bias in where along the read it falls, or the strand of the read. I should note that my coverage is very high, so it's not feasible to extract all the reads covering the site and manually sort them out.

I've already started a script that does this by parsing CIGAR strings, but I wanted to check that it doesn't already exist before going further down this road.

bam sam format cigar • 14k views
ADD COMMENT
0
Entering edit mode

Maybe something like this:

samtools view foo.bam | grep -E 'pattern'

ADD REPLY
1
Entering edit mode

@PhilS: your solution won't help parsing the cigar string.

ADD REPLY
0
Entering edit mode

Sry I thought you need the whole information of the read given by a bam file. If you need the cigar string only you can do it like:

samtools view foo.bam1:1660404-1660404 | awk '{print $6"\t"$10'} | grep 'pattern' | cut -f 1

this will give you the cigar strings of all reads containing 'pattern'

ADD REPLY
2
Entering edit mode

no, you have to walk over the cigar string to see if the read contains or not the variation. You cannot use a regex for this.

ADD REPLY
3
Entering edit mode
11.0 years ago
Rm 8.3k

See if this oneliner helps: For example: at given position say "1:1660404" get reads with Variant from the bam file:

samtools view -b INPUT.bam 1:1660404-1660404 | samtools fillmd -e - REFERENCE.fasta | grep -v "^@"| awk -v pos="1660404" 'BEGIN {OFS = FS = "\t" } ; {n=split($10,a,"") ; if(a[(pos-$4)+1] != "=" ) print pos,(pos-$4)+1, a[(pos-$4)+1], $1, $4, $10 }'

Potential output:

Position Distance_from_read_start Variant READ_NAME Read_Start READ

    1660404 31      G       HI-SW1050R:121:D2C53BCXX:8:2211:2278:38508      1660374 ==============================G===================================================
    1660404 30      G       HI-SW1050R:121:D2C53BCXX:8:2308:3759:48204      1660375 =============================G====================================================
    1660404 25      G       HI-SW1050R:121:D2C53BCXX:2:2208:13607:73456     1660380 =======C================G=========================================================
    1660404 25      G       HI-SW1050R:121:D2C53BCXX:6:2201:3216:84210      1660380 =======C================G=========================================================
    1660404 25      G       HI-SW1050R:121:D2C53BCXX:8:1316:1264:91116      1660380 =======C================G======N==N===================N=N=========================
    1660404 16      G       HI-SW1050R:121:D2C53BCXX:2:1301:15715:89711     1660389 ===============G==================================================================

Note: READ base changed to '=' if identical to reference base: it doesn't show indels and soft-clipped bases

ADD COMMENT
0
Entering edit mode

That's good to know about fillmd's -e option, but unfortunately it seems it doesn't show indels. Also, soft-clipped bases appear just like SNV's.

ADD REPLY
0
Entering edit mode

@Nick : thats True, I will edit it

ADD REPLY
0
Entering edit mode

lovely solution!

I wanted to filter the read pairs also, in which case I used your command to get the read names, stored them in an intermediate file and then extracted the same info (using fillmd) to get the read pairs into a .sam file

ADD REPLY
1
Entering edit mode
11.0 years ago

My previous answer might help: position of mismatches per read from a sam/bam file : you'll get the content of each read (as XML) for each base.

ADD COMMENT
0
Entering edit mode

Thanks! So it looks like that can get the information I need. It's a custom script like I've been writing (except certainly much nicer and faster), instead of an existing tool, so I'll take that as an answer to whether I'm reinventing the wheel.

ADD REPLY
0
Entering edit mode

Hey Nick,

Did you ever figure out how to extract reads with a specific variant? I'm looking to do something similar.

Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2513 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6