Extract reads from BAM file with known variant
1
0
Entering edit mode
2.3 years ago
pablo ▴ 310

Hello,

I have a bam file from which I'd like to extract reads which contain one specific variant at a specific position. I used :

samtools mpileup -uf ref.fa my.bam -v -r chr7:151490948-151490948

From IGV, I can spot 1681 "A" nucleotide variant (cf. IGV image below). I would like to extract the read containing this variant. The upper command givec back :

#CHROM POS      ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  my.bam
chr7      151490948       .       T       A,C,<*> 0       .       DP=8074;I16=1395,5186,333,1160,1.67816e+06,4.2793e+08,380715,9.70823e+07,45921,2.31924e+06,9838,499526,164525,4.11312e+06,37325,933125;QS=0.82102,0.178825,0.000155298,0;VDB=1.02295e-43;SGB=-0.693147;RPB=0.78422;MQB=0.997427;MQSB=1.49326e-36;BQB=1;MQ0F=0.00606886  PL      0,99,255,255,255,255,255,255,255,255

I can see the A variants but also the C variants. I know the option --output-QNAME extract the read names. In my case, I want to extract only the "A" variant reads, not the "C" ones.

I tried :

samtools mpileup --output-QNAME  -f ref.fa my.bam  -r CM000669.2:151490948-151490948 |awk '{print $NF}' | sed 's/,/\n/g' | wc -l

I get 8157 matching reads. Do these reads correspond to the reference and the variant (A or C) alleles?

Any help to extract my 1681 variant "A" reads?

bam variant samtools • 4.8k views
ADD COMMENT
0
Entering edit mode

how did you find the variant ? show us the igv screenshot.

ADD REPLY
0
Entering edit mode

you may use python programming for this task. here are some simple steps: open the file, read line by line, add the "if" condition variant you need, store these variant data in a list and print the list. that's all. i hope this help...

ADD REPLY
1
Entering edit mode

this doesn't explain why OP cannot find a read carrying the variant with samtools.

ADD REPLY
0
Entering edit mode

This user seems to be pushing python on multiple threads.

ADD REPLY
0
Entering edit mode

but there is any output, whereas

what does that mean ?

are you sure it's 'Chr19' and not 'chr19' or '19' ?

ADD REPLY
0
Entering edit mode

enter image description here

I expect a variant T/A (with 1681 reads matching the A variant) as you in the IGV window. I would like to extract those 1681 reads ID.

ADD REPLY
0
Entering edit mode

I am sure of the nomenclature. I add this an IGV window, for another locus.

ADD REPLY
0
Entering edit mode

for another locus

???

ADD REPLY
0
Entering edit mode

The IGV window I showed is for another locus which is Chr7:151490948-151490948) (not Chr9:151490946-151490946 as mentionned first)

ADD REPLY
0
Entering edit mode

You are not answering questions: A) show the locus that is relevant, b) confirm that "Chr7" rather than "chr7" is correct.

ADD REPLY
0
Entering edit mode

You're right, it is always because I try to modify a bit my post to do not divulge on what I work. Anyway, I edited the post.

ADD REPLY
1
Entering edit mode

It just makes it harder to help. Do you really think people could threaten your project by seeing a random screenshot not even knowing who you are and who your work for? Whatever, your choice.

ADD REPLY
0
Entering edit mode

Just right click the read in IGV and click "copy read details to clipboard" then just grep for the read name using something like samtools view your_file.bam | grep 'M00119:58:000000000-JV6N9:1:2116:17863:22678'

ADD REPLY
0
Entering edit mode

This works only for one read right? I need to do it for the 1681 reads..

ADD REPLY
0
Entering edit mode

start with one read... show us that this read is present in the samtools view output...

ADD REPLY
0
Entering edit mode

On IGV, as I showed with the screenshot, I have 1681 "A" variant reads. But when I scroll all the reads, I can't spot any "A" variant . Is that normal? So, it is impossible for me to detect them and get the corresponding "copy read details to clipboard".

ADD REPLY
0
Entering edit mode

But when I scroll all the reads, I can't spot any "A" variant . Is that normal?

see IGV downsampling reads params https://software.broadinstitute.org/software/igv/Preferences .

ADD REPLY
0
Entering edit mode

Thanks for the doc. I tried to dealt with it and the "Preferences" parameters (I also unchecked the "Downsample reads") but I can't get the "A" variant. Other mutations are however visible as you can see on the screenshot....

enter image description here

ADD REPLY
0
Entering edit mode

Why don't you copy the sequence of the read surrounding the A like +/-10 bp and then grep for the sequence instead of the read name e.g. 'AGAAGATA' etc... make sure it's long enough to be unique

ADD REPLY
0
Entering edit mode

I already tried. When I grep +/- 10 bp around the variant, I only get 39 BAM sequences. Which is very much low than the 1681 expecting reads of the IGV screenshot.

ADD REPLY
0
Entering edit mode

I installed the last version of IGV and I am now able to spot the "A" variant on the reads.

enter image description here

About the "copy read details to clipboard" , I can grep the read name in my BAM file. It works for one :

samtools view my.bam | grep "m64071_220512_054244/103417836/ccs"
ADD REPLY
0
Entering edit mode

ok i can write you a script to identify these if that's all you need... are you able to use R?

ADD REPLY
0
Entering edit mode

Indeed, I am able to use R.

ADD REPLY
0
Entering edit mode
2.3 years ago

This should extract reads from an index'd BAM file at a location of interest and transfer them into a valid SAM file. Modify the variables in the script to work according to your specific variant.

library(GenomicAlignments)
## your bam file
mybam <- 'your_file.bam'
## the variant you want to extract reads for
##location of variant chromosome:position (make sure to check if your genome is 1 or chr1 style)
varposition <- '10:89720633'
## allele of variant
var <- 'T'
## convert to GRanges
var.gr <- GRanges(varposition)

## read your BAM file
aln <- readGAlignments(mybam,param=ScanBamParam(which=var.gr,what=c('qname','strand','seq')))
## find variants at location! WATCH OUT MAKE SURE YOUR VARIANT IS BASED ON THE plus strand
aln.seq <- stackStringsFromGAlignments(aln,region=var.gr)
## extract read names with variant
reads <- mcols(aln[aln.seq == var])$qname
## print them
print(reads)
## save to file - you can change the name of this file if you want it will be written to your current working dir
write(reads,'pattern_file.txt')

## now you can do this on the command line to make a subset sam file
## header copy
print(paste0('samtools view -H ',normalizePath(mybam),' > ',gsub('\\.bam$','_subset.sam',mybam)))
## extract relevant reads
print(paste0("samtools view ",normalizePath(mybam)," '",varposition,"' | grep -f pattern_file.txt - >> ",gsub("\\.bam$","_subset.sam",mybam)))
ADD COMMENT
0
Entering edit mode

Here is the output of the last two print statements for me:

samtools view -H /home/biostars/my.bam > my_subset.sam
samtools view /home/biostars/my.bam '10:89720633' | grep -f pattern_file.txt - >> my_subset.sam
ADD REPLY
0
Entering edit mode

Also the strand param I had originally added isn't really used but you could probably fix the function up to take the actual variant (in the mRNA) and flip the sequence/strand for the reads when needed instead of being forced to use the '+' strand variant definition.

ADD REPLY
0
Entering edit mode

Thanks a lot for your reply. But I got this error message :

> aln.seq <- stackStringsFromGAlignments(aln,region=var.gr)
   Error in .normarg_at2(at, x) :
   some ranges in 'at' are off-limits with respect to their corresponding
   sequence in 'x'

Could it be because I work on long pacbio reads? My BAM file is OK I guess.

ADD REPLY
0
Entering edit mode

what does this give you? width(mcols(aln)$seq)

ADD REPLY
0
Entering edit mode

This gives me :

width(mcols(aln)$seq)

 [1] 1689    0    0 1563 1613 1601 1602 1614 1614 1614 1614 1614 1745 1476
 [15] 1478 1476 1408 1407 1405 1405  927  926  927  976  622  612  758 2030
 [29] 2030 2021 2081 1967 1967 1163 1692 1135 1135 1369 1369 1369  918  921
 [43]  684  972  924 1248 1265 1253 1254 1251 1254 1248 1249    0    0    0
 [57]    0    0    0    0    0    0    0    0    0    0    0    0    0    0
 [71]    0    0    0    0    0  746  747  746  746  746 2406 2129 1927 1922
 [85] 1918 1923 1919 1925 1928 1922 1976 1941 2029 2070 2071 1915 1943 2061
 [99] 1944 2576 2049 1916 2041 1941 1918 1942 1947 1941 1823 2048 2092 1923
 [113] 2046 1823 2053 1946 2048 2104 2089 1827 1825 1736 1699 1698 2167 1983
 [127] 1701 1802 2162 1701 1977 1980 2164 1978 1980 1976 2167 1701 2170 1738
    ...
 [19979]  971 1234  728  733  804  703  506  825  757  474 1128  204  203  611
 [19993]  611  854    0    0    0    0  637  637  627    0    0    0    0    0
 [20007]    0    0  629    0    0    0    0    0  626  653    0  576  421    0
 [20021]    0  580  589  363    0    0  509  615    0  616    0  613  601

I remove the data where width(mcols(aln)$seq) gives back "0" . From my initial BAM file, I removed the alignments where the length was equal to "0" :

samtools view -hbq 1 my.bam  > filtered.my.bam 
ADD REPLY
0
Entering edit mode

So did it work after that? You must have done aln <- aln[width(mcols(aln)$seq) > 0]

ADD REPLY
1
Entering edit mode

Both way works : either with aln <- aln[width(mcols(aln)$seq) > 0] or samtools view -hbq 1 my.bam > filtered.my.bam

ADD REPLY

Login before adding your answer.

Traffic: 1769 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6