Is there an easy way to extract the allele (monomorphic or polymorphic wrt the reference) for a sample across all reads at specific locations, given its BAM file?
Example: The base at chr1 pos 45800 for that sample from its BAM file?
I do not want the allele from each read, but only the consensus. It is similar to an old post here: Get all alleles of 1 column in bam file
Thanks in advance
Update-1
I figure that samtools mpileup..
is the way to go, but I've been fiddling around for the past 2 days, but still haven't figured out the right set of flags.
This is what I got so far... (SNP_loci.txt is the file with the positions (chr pos) I want to extract the allele/genotype at)
$samtools mpileup -I -l SNP_loci.txt -Bf ref.fa sample1.bam
chr1 1000379 T 22 cCCcCcCCCCcCCCCcCccccc IIIIIIIIIIIIIIIIIIIIII
**chr1 1000400 A 19 .,...,....,.,,,,,,. IIIIIIIIIIIIIIIIIII**
chr1 100017386 G 6 .$aAAAa IIIIII
chr1 100019654 T 19 CcCCccCccCC$cCcccccc IIIIIIIIIIIIIIIIIII
chr1 100019657 G 18 TtTTttTttTtTtttttt IIIIIIIIIIIIIIIIII
chr1 100019756 A 12 gggGgggGGGGg IIIIIIIIIIII
chr1 100020740 A 5 cCccc IIIII
chr1 100022994 G 7 aAaaaaa IIIIIII
chr1 100023027 G 7 aAaaaaa IIIIIII
chr1 100030383 G 11 AAAAaAAaaAA IIIIIIIIIII
Now, do I need to manually parse this based on the pileup format?
Wait, I suppose it could be piped with bcftools this way.. and then use vcftools
samtools mpileup -I -l SNP_loci.txt -Bf ref.fa sample1.bam | bcftools view -bvcg - > sample1.raw.bcf
But it ONLY reports variants/SNPs, however I need the match with reference too... basically the allele/genotype, whatever it is! I tried without the -vcg options, but the output is too filthy.
Basically I wish I could have something report this for the above data, with some quality attribute...
chr1 1000379 T 22 C
**chr1 1000400 A 19 A**
chr1 100017386 G 6 A
chr1 100019654 T 19 C
chr1 100019657 G 18 T
chr1 100019756 A 12 G
chr1 100020740 A 5 C
chr1 100022994 G 7 A
chr1 100023027 G 7 A
chr1 100030383 G 11 A
I know this is totally possible, but just can't figure how using samtools? Lack of knowledge & hence trying to reach out the experts out thr! Any suggestions please?
Update-2
Here are my attempts in more detail:
(1) With -vcg option for bcf tools (-v option reporting only variants/SNPs) - This format is perfect, however I'll miss the monomorphic alleles, which I NEED to keep (the second line above i.e., chr1:1000400)...
$samtools mpileup -I -l SNP_loci.txt -uBf ref.fa sample1.bam | bcftools view -bvcg - > sample1.bcf
$bcftools view sample1.bcf | vcfutils.pl varFilter -D100 > sample1.vcf
chr1 1000379 . T C 99 . DP=22;AF1=1;CI95=1,1;DP4=0,0,12,10;MQ=60 PL:GT:GQ 255,66,0:1/1:99
chr1 100017386 . G A 99 . DP=6;AF1=0.5016;CI95=0.5,0.5;DP4=1,0,3,2;MQ=60;PV4=1,1,1,1 PL:GT:GQ 156,0,22:0/1:25
chr1 100019654 . T C 99 . DP=19;AF1=1;CI95=1,1;DP4=0,0,7,12;MQ=60 PL:GT:GQ 255,57,0:1/1:99
chr1 100019657 . G T 99 . DP=18;AF1=1;CI95=1,1;DP4=0,0,6,12;MQ=60 PL:GT:GQ 255,54,0:1/1:99
chr1 100019756 . A G 99 . DP=12;AF1=1;CI95=1,1;DP4=0,0,5,7;MQ=60 PL:GT:GQ 255,36,0:1/1:99
chr1 100020740 . A C 99 . DP=5;AF1=1;CI95=0.5,1;DP4=0,0,1,4;MQ=60 PL:GT:GQ 164,15,0:1/1:75
chr1 100022994 . G A 99 . DP=7;AF1=1;CI95=1,1;DP4=0,0,1,6;MQ=60 PL:GT:GQ 198,21,0:1/1:84
chr1 100023027 . G A 99 . DP=7;AF1=1;CI95=1,1;DP4=0,0,1,6;MQ=60 PL:GT:GQ 198,21,0:1/1:84
chr1 100030383 . G A 99 . DP=11;AF1=1;CI95=1,1;DP4=0,0,8,3;MQ=60 PL:GT:GQ 255,33,0:1/1:99
(2) Without the -vcg option, reports all alleles. Yes, the monomorphic one too!! However, what does this X mean? There is no quality scores attached to the calls. How do I parse this into something similar to in (1)? Can I trust those other SNP calls??
$samtools mpileup -I -l SNP_loci.txt -uBf ref.fa sample1.bam | bcftools view -b - > sample1.bcf
$bcftools view sample1.bcf | vcfutils.pl varFilter -D100 > sample1.vcf
chr1 1000379 . T C,X 0 . DP=22;I16=0,0,12,10,0,0,880,35200,0,0,1320,79200,0,0,390,8052 PL 255,66,255,0,66,255
**chr1 1000400 . A X 0 . DP=19;I16=10,9,0,0,760,30400,0,0,1140,68400,0,0,344,7816,0,0 PL 0,57,255**
chr1 100017386 . G A,X 0 . DP=6;I16=1,0,3,2,40,1600,200,8000,60,3600,300,18000,0,0,117,2759 PL 156,0,159,22,37,188
chr1 100019654 . T C,X 0 . DP=19;I16=0,0,7,12,0,0,760,30400,0,0,1140,68400,0,0,367,8105 PL 255,57,255,0,57,255
chr1 100019657 . G T,X 0 . DP=18;I16=0,0,6,12,0,0,720,28800,0,0,1080,64800,0,0,370,8338 PL 255,54,255,0,54,255
chr1 100019756 . A G,X 0 . DP=12;I16=0,0,5,7,0,0,480,19200,0,0,720,43200,0,0,208,4430 PL 255,36,255,0,36,255
chr1 100020740 . A C,X 0 . DP=5;I16=0,0,1,4,0,0,200,8000,0,0,300,18000,0,0,125,3125 PL 164,15,164,0,15,164
chr1 100022994 . G A,X 0 . DP=7;I16=0,0,1,6,0,0,280,11200,0,0,420,25200,0,0,160,3696 PL 198,21,198,0,21,198
chr1 100023027 . G A,X 0 . DP=7;I16=0,0,1,6,0,0,280,11200,0,0,420,25200,0,0,145,3325 PL 198,21,198,0,21,198
chr1 100030383 . G A,X 0 . DP=11;I16=0,0,8,3,0,0,440,17600,0,0,660,39600,0,0,239,5415 PL 255,33,255,0,33,255
Can someone please throw their experience at this? I'm really struggling with this... when the answer doesn't seem too far... It might just take a few mins for you, but will help me a lot!!
Thanks a bunch in advance!!!
This would seem to be the same as normal SNP calling then, yes?
Ryan, thanks again... yes and no... I explain better below! I want to call the genotype - polymorphic or monomorphic (same as reference)
I moved your answer to an update on your question.
You used the
-v
option, which explicitly tells bcftools to only report variant sites...Ryan, I understand. But without using
-v
(meaning leaving out-cg
too)... I'm unsure of the output format like reported above (in update-2). Thanks for correcting my post btw.-v
forces-c
to be on, but you can use-cg
without using-v
.Yes, I used just
-cg
too. It doesn't report the monomorphic alleles. Output same as ONLY SNPs/variants