As with almost every life problem, this is one that can [surprisingly] be handled by mpileup
:
cat rsBED/rs2857845.bed
chr4 3026385 3026386
bcftools mpileup --regions-file rsBED/rs2857845.bed \
--annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR \
input.bam \
--fasta-ref refdata-GRCh38-2.1.0/fasta/genome.fa
chr4 3026386 . A T,<*> 0 . DP=20;ADF=6,6,0;ADR=2,4,0;AD=8,10,0;I16=6,2,6,4,295,11299,365,13851,480,28800,600,36000,183,4439,203,4917;QS=0.44697,0.55303,0;VDB=0.558487;SGB=-0.670168;RPB=0.906029;MQB=1;MQSB=1;BQB=0.991158;MQ0F=0 PL:DP:SP:ADF:ADR:AD 224,0,174,248,204,255:18:2:6,6,0:2,4,0:8,10,0
Let's pipe it into bcftools query
to tidy it up nicely:
bcftools mpileup --regions-file rsBED/rs2857845.bed \
--annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR \
input.bam \
--fasta-ref refdata-GRCh38-2.1.0/fasta/genome.fa |\
bcftools query --format '%CHROM\t%POS\t%REF\t%ALT[\t%AD]\n'
chr4 3026386 A T,<*> 8,10,0
We don't even need the [really] biased Buffalo New York reference genome - just add --no-reference
to bcftools mpileup
bcftools mpileup --regions-file rsBED/rs2857845.bed \
--annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR \
input.bam \
--no-reference |\
bcftools query --format '%CHROM\t%POS\t%ALT[\t%AD]\n'
chr4 3026386 T,A,<*> 0,11,8,0
Be also weary of filter flags, such as:
-q, --min-MQ INT skip alignments with mapQ smaller than INT [0]
-Q, --min-BQ INT skip bases with baseQ/BAQ smaller than INT [13]
Kevin
Here's a one-liner that will do it:
thank you chris, your solution is perfect for what i have asked but when i saw the output, i realized that the position i need is not with respect to read, but it is based on reference. i have commented to Arun above in a bit detail.. can you suggest any solution for this?
I ran across a problem similar to yours, and when googling to find a solution I found this thread. Since Pierre's solution did not work for me I wrote a GATKWalker of my own to solve the issue.
You run it like this:
And it will generate output that looks like this:
It's available in my fork of gatk at github. Clone it from there, and then build it and you should be able to run it:
Is it something like this you were looking for? I realize that this a old thread, but I still think that I would leave this answer for completeness sake.
NOTE: This has now been fixed, and the
<>
are no longer an issues, so this comment may be disregarded - @RamRS on 22-Aug-2020You have to remove the "<>" around the git url, Biostars seems to add it automatically, and I don't know how to remove them.
Thank you for your reply. But i am not looking to get for a interval. you can read my comment from above.. where i am not just looking for bases from read. I am looking for bases after mapping to reference i.e it has to take gaps in to consideration and get as output.. I have some specific snp positions and like to get bases on reads for those positions.. hope i havent confused you..
I got it that you wanted the bases relative to the reference, and that is what this will give you. so if you specify --intervalToGet chr1:1-3, you would get the first three bases from chromosome 1. However, it won't be able to handle gaps. I don't think that it should be to much work to check if it is a gap on that position by looking at the cigar string, but I don't have the time for it at the moment. Feel free to fork the code and do it your self how ever. :) Hope you find a solution.
thank you johan, also can i give multiple locations? like instead of chr1:1-3, i wanted the bases for following positions on chr:1 - 1, 10, 15 , 20 , 25 , 30. how can i get this information?
It's not possible in the current implementation, I'm afraid. But I might fix that later on, since I think I might need it myself. But right now you have to run it once for each locus, which isn't very practical if you have a lot of loci.
I've been looking at this walker since I needed exactly what is described here; however I'm slightly worried about the correctness of the output.
Can you indicate if your walker handles orientation and indels correctly?
If I count the base frequencies from pileup or IGV and compare these to the output of the walker, the latter values can be way off.
I'm aware of the problem and have sent you a PM.
Hi Johan,
I have cloned the link to the command line but the syntax you've written doesn't work. Do I need to clone this first (GenomeAnalysisTK.jar) into the command line?
Did you managed to find a way to recover specific bases from your reads based on position from bam file ? I have the same problem ! Thx
thank you for the answer. But this doesnt help. because when i give this command as input, it again prints the bam file at that positions but not the bases. For example, i have few positions where i got the SNPS, lets say 5, 10 ,15 20 and not all reads will have this variation right !! so for read 1 at 5, 10,15,20 - ATGC , Read 2 : ATCG , Read3 : AACG, Read4: AACC .. like that.. i have some 15- 20 positions and i have to extract all the bases on all reads for those positions.
Its not very simple as it looks like. Using this command, i will get the bases on the reads just by giving that position, but i want it according to reference. please read my comments under my post above. that might give you better idea of what i am looking for and help me if you figure out a way to do it.. so in simple if i have an indel in a read for example from pos 5-10 (5bases) and i wanted to extract the 11th base.
Technically using the above command i will get 11th base on read but when it maps to reference, the 11th base on reference will be 11+5 i.e 16th base pair on the read as we have 5bp indel in between the read. So it should take gaps, mismatches, indels in to account for calculating that base.
Hope this is clear.. i have been searching for this and to my surprise no program is able to achieve this :(
did you find a solution ? I have the same problem !
Chris' solution is neat! However, could you explain why you'd want to extract bases at the same location from all the reads, given that they have different start and end mapped positions?
thank you but this not what i need. I guess i should have made my question clear..sorry about that.. So once the read maps to the reference, i wanted to give input a position based on the reference say 100th position on reference, i need all bases from the reads which are mapping at that particular position including the readname. if i use the solution below, its printing the 100th base on the reads which is not equal to what i need because if there are indels in reference or reads, then the position of bases changes completely. So is there anyway to get what i need? i tried with mpileup. Its printing out all the bases at that positions but not the readnames.
Can you explain why you need the read name and not just the list of bases via mpileup?
ok sure.. I am working on a polyploid organism which has two genomes in it. goal is to seperate them by identifying the markers. So when i identify the snps between them i could not find good heterozygous snps between them. So i wanted to get a haplotype for the read. for that, i use the snps information and build this. As i have known the snp positions, i would like to get bases on other reads as well . this is amplicon sequencing so the read length is around 380bp. this is one part to seperate two genomes in a genotype. Also i would like to seperate two genotypes by using this. I rename reads according to specific genotype and once i get the haplotype information for the reads, i can seperate two genotypes easily based on read names.
i tried several haplotype callers but they doesnt seem to solve my problem. So i thought of starting from base i.e bam file to get such information out.. hope i am clear with what i need. Can you suggest a better approach for this?
How to do this in R?