Question

Extract reads with variant at given position

2

Entering edit mode

10.5 years ago

philip.jonsson ▴ 30

I want to be able to extract, from BAM files, the reads that, for a given position, contain the variant (and conversely but separately, the ones that contain the reference). I've played around with various tools and searched the tubes, without finding any convenient way to do this. It doesn't seem like too odd a task to perform, so maybe my google fu is just failing me here.

sequencing alignment • 9.5k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by philip.jonsson ▴ 30

0

Entering edit mode

Are you looking for all reads containing some variant at, say, chr1:1000 or instead all those that contain a variant in base 20 of their alignment? I assume the former, which is much quicker to do, but figured I'd ask.

ADD REPLY • link 10.5 years ago by Devon Ryan 105k

0

Entering edit mode

The former, if I understand you correctly. I have a specific variant and I want to examine the reads that support that variant and compare to the reads that instead show the reference allele at the same position.

ADD REPLY • link 10.5 years ago by philip.jonsson ▴ 30

Ram · Answer 1 · 2015-02-05

Another option, for querying any ad hoc range (chrN:X-Y):

$ echo -e "chrN\tX\tY" | bedops --element-of 1 <(bam2bed < reads.bam) - > overlapping_reads.bed

To query any given set of genomic ranges in variants.bed:

$ bedops --element-of 1 <(bam2bed < reads.bam) variants.bed > overlapping_reads.bed

The bam2bed tool will preserve all fields, which may be useful.

If variants are in VCF, then a second bash subshell to call vcf2bed can convert them to BED, e.g.:

$ bedops --element-of 1 <(bam2bed < reads.bam) <(vcf2bed < variants.vcf) > overlapping_reads.bed

Ram · Answer 2 · 2015-02-05

2

Entering edit mode

10.5 years ago

Pierre Lindenbaum 166k

I would use my tool sam2tsv to extract the name of the reads having (or not) the mutation.

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Pierre Lindenbaum 166k

2

Entering edit mode

Since this can take input via a pipe, it should be added that the simplest method would be to take the BAM file, sort and index it, and then:

samtools view -bu some_file.bam chr1:123 | java -jar dist/sam2tsv.jar -r ref.fa stdin > blah.txt

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Devon Ryan 105k

1

Entering edit mode

@Devon: added -bu

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Good catch!

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Devon Ryan 105k

0

Entering edit mode

Why is it better to use a BAM file as stdin?

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 9.6 years ago by emd07 • 0

score 1 · Answer 3 · 2020-07-20

1

Entering edit mode

5.1 years ago

sacha ★ 2.4k

For a specific variant, you can use bamql

For instance, if you want to extract all reads from a bam file :

which support the alternate 'A' in snp : chr17:29827429 G>A

bamql -f yourfile.bam 'chr(17) & nt_exact(29827429,A)' -o A.bam

which support the reference 'G' in snp : chr17:29827429 G>A

bamql -f yourfile.bam 'chr(17) & nt_exact(29827429,G)' -o B.bam

I guess you can loop over variant file and do this for each variant. You can also filter with mate information. See documentation Here

enter image description here

ADD COMMENT • link 5.1 years ago by sacha ★ 2.4k

0

Entering edit mode

man this! thank you very much! this is specifically what i needed :)

ADD REPLY • link 4.8 years ago by morovatunc ▴ 560

0

Entering edit mode

BAMQL is unfortunetely not very HPC friendly. I kinda found this maybe it helps.

https://github.com/broadinstitute/VariantBam

ADD REPLY • link 4.8 years ago by morovatunc ▴ 560

score 0 · Answer 4 · 2015-02-05

In the off chance that you really do need to have the full alignments in SAM format in separate files, then you can code something up in either python with pysam or C with HTSlib. In both cases you can use the pileup functions to just grab all alignments that overlap a given position (or set of them) and then iterate over the results. The pileup functions will tell you which base of each alignment overlaps a given position, so you can check what genotype it supports and treat it accordingly.

Edit: But really, if you think you want the full alignments in separate files then you should probably rethink what you're doing. The two answers posted before mine are vastly simpler and should suffice for most normal needs.

score 0 · Answer 5 · 2015-02-06

0

Entering edit mode

10.5 years ago

philip.jonsson ▴ 30

Thanks for the input, guys. I've realized that bam-readcount ultimately is the most appropriate tool for my purposes.

ADD COMMENT • link 10.5 years ago by philip.jonsson ▴ 30