Extract Allele At Particular Positions From Bam Files
4
4
Entering edit mode
10.7 years ago
guns.guru ▴ 80

Is there an easy way to extract the allele (monomorphic or polymorphic wrt the reference) for a sample across all reads at specific locations, given its BAM file?

Example: The base at chr1 pos 45800 for that sample from its BAM file?

I do not want the allele from each read, but only the consensus. It is similar to an old post here: Get all alleles of 1 column in bam file

Thanks in advance


Update-1

I figure that samtools mpileup.. is the way to go, but I've been fiddling around for the past 2 days, but still haven't figured out the right set of flags.

This is what I got so far... (SNP_loci.txt is the file with the positions (chr pos) I want to extract the allele/genotype at)

$samtools mpileup -I -l SNP_loci.txt -Bf ref.fa sample1.bam
chr1        1000379 T       22      cCCcCcCCCCcCCCCcCccccc  IIIIIIIIIIIIIIIIIIIIII
**chr1        1000400 A       19      .,...,....,.,,,,,,.     IIIIIIIIIIIIIIIIIII**
chr1        100017386       G       6       .$aAAAa IIIIII
chr1        100019654       T       19      CcCCccCccCC$cCcccccc    IIIIIIIIIIIIIIIIIII
chr1        100019657       G       18      TtTTttTttTtTtttttt      IIIIIIIIIIIIIIIIII
chr1        100019756       A       12      gggGgggGGGGg    IIIIIIIIIIII
chr1        100020740       A       5       cCccc   IIIII
chr1        100022994       G       7       aAaaaaa IIIIIII
chr1        100023027       G       7       aAaaaaa IIIIIII
chr1        100030383       G       11      AAAAaAAaaAA     IIIIIIIIIII

Now, do I need to manually parse this based on the pileup format?

Wait, I suppose it could be piped with bcftools this way.. and then use vcftools

samtools mpileup -I -l SNP_loci.txt -Bf ref.fa sample1.bam | bcftools view -bvcg - > sample1.raw.bcf

But it ONLY reports variants/SNPs, however I need the match with reference too... basically the allele/genotype, whatever it is! I tried without the -vcg options, but the output is too filthy.

Basically I wish I could have something report this for the above data, with some quality attribute...

chr1        1000379 T       22      C
**chr1        1000400 A       19      A**
chr1        100017386       G       6       A
chr1        100019654       T       19      C
chr1        100019657       G       18      T
chr1        100019756       A       12      G
chr1        100020740       A       5       C
chr1        100022994       G       7       A
chr1        100023027       G       7       A
chr1        100030383       G       11      A

I know this is totally possible, but just can't figure how using samtools? Lack of knowledge & hence trying to reach out the experts out thr! Any suggestions please?


Update-2

Here are my attempts in more detail:

(1) With -vcg option for bcf tools (-v option reporting only variants/SNPs) - This format is perfect, however I'll miss the monomorphic alleles, which I NEED to keep (the second line above i.e., chr1:1000400)...

$samtools mpileup -I -l SNP_loci.txt -uBf ref.fa sample1.bam | bcftools view -bvcg - > sample1.bcf
$bcftools view sample1.bcf | vcfutils.pl varFilter -D100 > sample1.vcf
chr1        1000379 .       T       C       99      .       DP=22;AF1=1;CI95=1,1;DP4=0,0,12,10;MQ=60        PL:GT:GQ        255,66,0:1/1:99
chr1        100017386       .       G       A       99      .       DP=6;AF1=0.5016;CI95=0.5,0.5;DP4=1,0,3,2;MQ=60;PV4=1,1,1,1      PL:GT:GQ        156,0,22:0/1:25
chr1        100019654       .       T       C       99      .       DP=19;AF1=1;CI95=1,1;DP4=0,0,7,12;MQ=60 PL:GT:GQ        255,57,0:1/1:99
chr1        100019657       .       G       T       99      .       DP=18;AF1=1;CI95=1,1;DP4=0,0,6,12;MQ=60 PL:GT:GQ        255,54,0:1/1:99
chr1        100019756       .       A       G       99      .       DP=12;AF1=1;CI95=1,1;DP4=0,0,5,7;MQ=60  PL:GT:GQ        255,36,0:1/1:99
chr1        100020740       .       A       C       99      .       DP=5;AF1=1;CI95=0.5,1;DP4=0,0,1,4;MQ=60 PL:GT:GQ        164,15,0:1/1:75
chr1        100022994       .       G       A       99      .       DP=7;AF1=1;CI95=1,1;DP4=0,0,1,6;MQ=60   PL:GT:GQ        198,21,0:1/1:84
chr1        100023027       .       G       A       99      .       DP=7;AF1=1;CI95=1,1;DP4=0,0,1,6;MQ=60   PL:GT:GQ        198,21,0:1/1:84
chr1        100030383       .       G       A       99      .       DP=11;AF1=1;CI95=1,1;DP4=0,0,8,3;MQ=60  PL:GT:GQ        255,33,0:1/1:99

(2) Without the -vcg option, reports all alleles. Yes, the monomorphic one too!! However, what does this X mean? There is no quality scores attached to the calls. How do I parse this into something similar to in (1)? Can I trust those other SNP calls??

$samtools mpileup -I -l SNP_loci.txt -uBf ref.fa sample1.bam | bcftools view -b - > sample1.bcf
$bcftools view sample1.bcf | vcfutils.pl varFilter -D100 > sample1.vcf

chr1        1000379 .       T       C,X     0       .       DP=22;I16=0,0,12,10,0,0,880,35200,0,0,1320,79200,0,0,390,8052   PL      255,66,255,0,66,255
**chr1        1000400 .       A       X       0       .       DP=19;I16=10,9,0,0,760,30400,0,0,1140,68400,0,0,344,7816,0,0    PL      0,57,255**
chr1        100017386       .       G       A,X     0       .       DP=6;I16=1,0,3,2,40,1600,200,8000,60,3600,300,18000,0,0,117,2759        PL      156,0,159,22,37,188
chr1        100019654       .       T       C,X     0       .       DP=19;I16=0,0,7,12,0,0,760,30400,0,0,1140,68400,0,0,367,8105    PL      255,57,255,0,57,255
chr1        100019657       .       G       T,X     0       .       DP=18;I16=0,0,6,12,0,0,720,28800,0,0,1080,64800,0,0,370,8338    PL      255,54,255,0,54,255
chr1        100019756       .       A       G,X     0       .       DP=12;I16=0,0,5,7,0,0,480,19200,0,0,720,43200,0,0,208,4430      PL      255,36,255,0,36,255
chr1       100020740       .       A       C,X     0       .       DP=5;I16=0,0,1,4,0,0,200,8000,0,0,300,18000,0,0,125,3125        PL      164,15,164,0,15,164
chr1        100022994       .       G       A,X     0       .       DP=7;I16=0,0,1,6,0,0,280,11200,0,0,420,25200,0,0,160,3696       PL      198,21,198,0,21,198
chr1        100023027       .       G       A,X     0       .       DP=7;I16=0,0,1,6,0,0,280,11200,0,0,420,25200,0,0,145,3325       PL      198,21,198,0,21,198
chr1        100030383       .       G       A,X     0       .       DP=11;I16=0,0,8,3,0,0,440,17600,0,0,660,39600,0,0,239,5415      PL      255,33,255,0,33,255

Can someone please throw their experience at this? I'm really struggling with this... when the answer doesn't seem too far... It might just take a few mins for you, but will help me a lot!!

Thanks a bunch in advance!!!

bam consensus allele • 15k views
ADD COMMENT
1
Entering edit mode

This would seem to be the same as normal SNP calling then, yes?

ADD REPLY
0
Entering edit mode

Ryan, thanks again... yes and no... I explain better below! I want to call the genotype - polymorphic or monomorphic (same as reference)

ADD REPLY
0
Entering edit mode

I moved your answer to an update on your question.

ADD REPLY
1
Entering edit mode

You used the -v option, which explicitly tells bcftools to only report variant sites...

ADD REPLY
0
Entering edit mode

Ryan, I understand. But without using -v (meaning leaving out -cg too)... I'm unsure of the output format like reported above (in update-2). Thanks for correcting my post btw.

ADD REPLY
1
Entering edit mode

-v forces -c to be on, but you can use -cg without using -v.

ADD REPLY
0
Entering edit mode

Yes, I used just -cg too. It doesn't report the monomorphic alleles. Output same as ONLY SNPs/variants

ADD REPLY
4
Entering edit mode
10.7 years ago
guns.guru ▴ 80

I think I figured the problem... Woohoo!! The vcfutils.pl is filtering out these monomorphic reference alleles, while bcftools reports them OK! Was only using vcfutils for varFilter -D max depth purposes, but who knew it was also the culprit here...

So, like Ryan suggested, do not use the -v option & I'm rerunning my SNP calling routine as follows. No vcfutils!!

samtools mpileup -I -uBf ref.fa sample1.bam | bcftools view -bcg - > sample1.bcf
bcftools view sample1.bcf > sample1.vcf

Thanks so much Ryan!!

ADD COMMENT
0
Entering edit mode

Glad you got it working and thanks for posting the update!

ADD REPLY
2
Entering edit mode
10.7 years ago
hliang ▴ 100

The new samtools mpileup can generate VCF output, although without genotype calling.

For those who might be interested in using a newer version (0.2.0-rc6) of samtools and bcftools, you can try this way:

samtools mpileup -uvDV -l sites.bed -f ref.fa input.bam | bcftools call -c - > output.vcf
ADD COMMENT
0
Entering edit mode

Note that this command generates some warnings with the latest version of samtools

[warning] samtools mpileup option `-D` is functional, but deprecated. Please switch to `-t DP` in future.
[warning] samtools mpileup option `-V` is functional, but deprecated. Please switch to `-t DV` in future.
ADD REPLY
0
Entering edit mode

This solution works, the previous answers no longer work with the new version of samtools.

ADD REPLY
0
Entering edit mode
10.7 years ago
guns.guru ▴ 80

Thanks for the tip hliang! That's good to know. Actually, with my existing Version: 0.1.12a (r862) a single line command seems to work too..

samtools mpileup -uBI -l SNP_sites.bed -f ref.fa input.bam | bcftools view -cg - > output.vcf
ADD COMMENT
0
Entering edit mode

I've recently suggested Calculate The Frequency Of Nucleotides At Each Position In An Mpileup File on a similar post. Then you could check easily check the genotype.

ADD REPLY

Login before adding your answer.

Traffic: 2152 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6